BrokeDBA: 2022

Wednesday, December 21, 2022

ZDM troubleshooting part 4: NONCDBTOPDB_CONVERSION fails (GUID conflicts with existing PDB)

Intro

Every time I think I’m done with my ZDM troubleshooting series, a new topic pops up :). I have also learned that every migration is distinct and brings its own unique challenges, so I have decided to keep the series open-ended. In this post, we will discuss the steps you should take when a conversion of a non-CDB to a PDB fails during a ZDM migration to identify the root cause, and get your migration back on track. So, let's dive in!

Note: you ca always explore the other ZDM troubleshooting related post below:
- Migration failing at ZDM_CONFIGURE_DG_SRC
- Migration failing at ZDM_SWITCHOVER_SRC plus hack

My ZDM environment

ZDM: 21.3 build

Property Source Target

RAC NO YES

Encrypted NO YES

CDB NO YES
Release 12.2 12.2
Platform On prem Linux ExaCC

Prerequisites

All the prerequisites related to the ZDM VM, the Source and Target Database system were satisfied before running the migration.

Responsefile

Prepare a responsefile for a Physical Online Migration with the required parameters (see excerpt). I will just point out that ZDM 21.3 now supports Data Guard Broker configuration .

$ cat physical_online_demo.rsp | grep -v ^#
TGT_DB_UNIQUE_NAME=TGTCDB
MIGRATION_METHOD=ONLINE_PHYSICAL
DATA_TRANSFER_MEDIUM=DIRECT
PLATFORM_TYPE=EXACC
...etc

Run migration until the DG config –step1

As usual Irun the migrate command with the -pauseafter ZDM_CONFIGURE_DG_SRC to stop when the replication is configured in order to resume the full migration a later time.

$ZDM_HOME/bin/zdmcli migrate database –sourcedb SRCDB \ -sourcenode srcHost -srcauth zdmauth \ -srcarg1 user:zdmuser \ -targetnode tgtNode \ -tgtauth zdmauth \ -tgtarg1 user:opc \ -rsp ./physical_online_demo.rsp –ignore ALL -pauseafter ZDM_CONFIGURE_DG_SRC

Resume migration –step2

Now that the Data guard Configuration is complete. It’s time to resume the full migration to the end.

$ zdmservice resume job –jobid 1

Querying job status

As you can see, It didn’t take long before noticing that the Switchover step failed.

$ zdmservice query job –jobid 1
zdmhost.domain.com: Audit ID: 39
Job ID: 1
User: zdmuser
Client: zdmhost
Job Type: "MIGRATE"
Current status: FAILED
Result file path: "/u01/app/oracle/zdmbase/chkbase/scheduled/job-1-*log" ...
Job execution elapsed time: 1 hours 25 minutes 41 seconds
ZDM_GET_SRC_INFO .............. COMPLETED
ZDM_GET_TGT_INFO .............. COMPLETED
ZDM_PRECHECKS_SRC ............. COMPLETED
ZDM_PRECHECKS_TGT ............. COMPLETED
ZDM_SETUP_SRC ................. COMPLETED
ZDM_SETUP_TGT ................. COMPLETED
ZDM_PREUSERACTIONS ............ COMPLETED
ZDM_PREUSERACTIONS_TGT ........ COMPLETED
ZDM_VALIDATE_SRC .............. COMPLETED
ZDM_VALIDATE_TGT .............. COMPLETED
ZDM_DISCOVER_SRC .............. COMPLETED
ZDM_COPYFILES ................. COMPLETED
ZDM_PREPARE_TGT ............... COMPLETED
ZDM_SETUP_TDE_TGT ............. COMPLETED
ZDM_RESTORE_TGT ............... COMPLETED
ZDM_RECOVER_TGT ............... COMPLETED
ZDM_FINALIZE_TGT .............. COMPLETED
ZDM_CONFIGURE_DG_SRC .......... COMPLETED
ZDM_SWITCHOVER_SRC ............ COMPLETED
ZDM_SWITCHOVER_TGT ............ COMPLETED
ZDM_POST_DATABASE_OPEN_TGT .... COMPLETED
ZDM_DATAPATCH_TGT ............. COMPLETED
ZDM_NONCDBTOPDB_PRECHECK ...... COMPLETED
ZDM_NONCDBTOPDB_CONVERSION .... FAILED
ZDM_POST_MIGRATE_TGT .......... PENDING
ZDM_POSTUSERACTIONS ........... PENDING
ZDM_POSTUSERACTIONS_TGT ....... PENDING
ZDM_CLEANUP_SRC ............... PENDING
ZDM_CLEANUP_TGT ............... PENDING

Troubleshooting the error

To determine the content of the error message, the best approach is to check the $ZDM_BASE logs hosted locally on the target node “$ORACLE_BASE/zdm/zdm_targetDB_$jobID/zdm/log”

-- Target node
$ cd $ORACLE_BASE/zdm/zdm_TGTCDB_1/zdm/log
$ tail ./zdm_noncdbtopdb_conversion_*.log
…
[jobid-1][2022-12-14][mZDM_Queries.pm:556]:[DEBUG] Output is :
SQL*Plus: Release 12.2.0.1.0 Production on Wed Dec 14  2022 ..
Connected to: Oracle Database 12c EE Extreme Perf Release 12.2.0.1.0 - 64bit Production
CREATE PLUGGABLE DATABASE zdm_aux_SRCDB using '/tmp/zdm_aux_SRCDB.xml' NOCOPY 
TEMPFILE REUSE
 *  ERROR at line 1:
 ORA-65122: Pluggable database GUID conflicts with the GUID of an existing  container.
[jobid-1][2022-12-14][mZDM_convert_noncdb2pdb.pl:522]:[ERROR]  
    failed to create the PDB 'zdm_aux_SRCDB'

As you can see above, the issue is related to the new PDB created in the target CDB from the auxiliary database

What Happened

In an online physical migration from a non-CDB database to a PDB on a target container, ZDM creates an auxiliary standby database in the background to ensure replication consistency before the final switchover. After the switchover is complete, a data patch is applied and an unplug and plug operation is done to convert the auxiliary DB into a PDB on the target container database (CDB).

Why is ZDM failing to create the new PDB?

Turns out, ZDM attempted to construct a new PDB from the generated XML manifest but was unsuccessful, as Oracle had assigned a default GUID from another PDB in the CDB. I wouldn't say this happens for all instances of target CDBs with existing PDBs (I already completed such migrations in the past), but in this case, two databases had already been migrated to the same target CDB before this one.

Subsidiary question
Why did the CREATE PLUGGABLE DATABASE command use an existing GUID instead of generating a new one?
I don’t have the answer to this yet but we chose to open an SR and see if there was a workaround for this issue.

SOLUTION: ZDM pl script rewrite

Force ZDM to use the clone option :

ZDM uses a PDB plugin script called mZDM_convert_noncdb2pdb.pl to perform the PDB conversion
All we need to do is to update the script at the plugin section and add AS CLONE to the create pluggable database command

Location : Make a copy of the below file on he ZDM host before the change

cp $ZDM_HOME/rhp/zdm/mZDM_convert_noncdb2pdb.pl mZDM_convert_noncdb2pdb.pl.old

The perl script contains variables replacing the PDB name & XML manifest, but the error here occurred because the script had a "NOCOPY" statement without an "AS CLONE" clause.

To fix the issue, we just need amend it and add the missing clone part.


…else
{ 
@slq_stmts =( 
...
“CREATE PLUGGABLE DATABASE $sdb AS CLONE USING '$descfile' NOCOPY TEMPFILE REUSE”);
}

Note:

It is only advised to perform the below change in this particular case or when asked by Oracle support.

Resume the job

this is it, ZDM will now create the PDB as clone implying a new GUID after resuming the job to complete the rest of our online physical migration

$ zdmservice resume job –jobid 1
$ zdmservice query job –jobid 1
...
ZDM_CONFIGURE_DG_SRC .......... COMPLETED
ZDM_SWITCHOVER_SRC ............ COMPLETED
ZDM_SWITCHOVER_TGT ............ COMPLETED
ZDM_POST_DATABASE_OPEN_TGT .... COMPLETED
ZDM_DATAPATCH_TGT ............. COMPLETED
ZDM_NONCDBTOPDB_PRECHECK ...... COMPLETED
ZDM_NONCDBTOPDB_CONVERSION .... COMPLETED
ZDM_POST_MIGRATE_TGT .......... PENDING
ZDM_POSTUSERACTIONS ........... PENDING
ZDM_POSTUSERACTIONS_TGT ....... PENDING
ZDM_CLEANUP_SRC ............... PENDING
ZDM_CLEANUP_TGT ............... PENDING

Conclusion

We learned that ZDM may sometimes try to reuse an existing GUID while converting DB to PDB
This may be fixed natively in future releases of ZDM.
I can’t assume this behavior would be the same in all cases, because I already moved databases to a CDB with many PDBs without any problem in the past
Oracle documentation is explicit about the "AS CLONE" clause. Nevertheless, I don't think the same DB was migrated in the past to the same destination.
”Specifying AS CLONE also ensures that Oracle Database generates new identifiers(GUID,DBID) if the target CDB already contains a PDB that was created using the same set of data files”
You might not run into the same error, but this is the quickest fix in case it happens.

Thank you for reading

Monday, December 5, 2022

What's ODABR snapshot & how to efficiently use it to patch ODA from 18 to 19.x

Intro

Although most of the focus, nowadays, has shifted to migration of on-premises workloads into the Cloud, companies still leverage Oracle databases engineered systems like Oracle database appliance to run their databases on-prem. As a matter of fact ,ODA is a low entry pricing and flexible CPU licensing platform that can still host workloads that aren’t mature enough to go to the cloud. Until then, system updates fall under the customer’s responsibility. In today’s use case, patching your ODA software version from 18.8 to 19.x will require to upgrade your OS from Linux Enterprise 6 to 7. But how does Oracle make that move seamless and safe in case of failure?This is why I chose to discuss a tool called ODABR that allows rollback capability during OS upgrade on ODAs.

BACKUP BEFORE YOU PATCH
It will be especially interesting to learn how to effectively use it with reduced available storage when patching an ODA to 19.6. Read more about ODA release matrix in the official Oracle blog

Patching process to ODA 19.6

The upgrade from 18.8 to 19.6 has two main stages :

A first pass to upgrade the Linux from OEL 6 to OEL 7.
A second to update the ODA binaries (DCS and Grid) as for previous versions.

What’s ODABR

ODA backup & recovery is a utility developed by Oracle engineer Ruggero Citton, which allows to backup and recover an ODA node using consistent & incremental System backups on Bare metal ODAs as described in Oracle support Note ID 2466177.1. ODABR is a perquisite for the 1st stage (OS upgrade to OEL7) as it will save a disk restore point in case of rollback after ODA patching failure (precheck will even fail if the tool is not installed).

ODA backups

System Node Backup includes following filesystems:

/ : Root file system
/boot : Boot partition
/opt : opt file system (OAK/DCS,TFA, OWG, ASR)
/u01: Grid Infrastructure, RDBMS binaries
Grid Infrastructure OCR file

There are 2 types of backups with ODABR but only one is needed when patching the ODA to 19.6

Consistent backup is guarantee by the LVM snapshot feature (used during patching)
Incremental backup managed automatically using rsync features (physical copy to specified destination)

LVM snapshot used by ODABR

ODABR is just reusing Linux LVM snapshot feature that create two copies of the same logical Volume, where one is used for backup purposes while the other continues in operation. The delta is tracked since snapshot creation

Snapshot creation is quick & doesn’t need stopping the server.
A Snapshot will use only the space needed to accommodate the difference between the two LVs (delta also called Copy-on-Write (CoW) )

ODABR installation
Download and install the rpm: >> odabr-2.0.1

[root@odadev1~]# rpm -Uvh odabr-2.0.1-62.noarch.rpm 
odabr-2.0.1.62 has been installed on /opt/odabr succesfully!

Backup Syntax

Usage:
odabr backup [-snap] [-destination <dest path> [-dryrun][-silent]] | [-mgmtdb]
       [-osize <opt snapsize>][-rsize <root snapsize>][-usize <u01 snap size>]
odabr infosnap  --- show available snapshots
odabr delsnap   --- delete all snapshots

The backup syntax is pretty straightforward with -snap & -destination (nfs/local path or ssh/rsync) as main option

Patching to 19.6 challenge with limited Free space

Before upgrading the OS, ODABR will create LVM snapshots for the file systems that need 190GB of free space:

root LVM snapshot 30Gb
opt LVM snapshot 60Gb
u01 LVM snapshot 100Gb

But in most situations, old systems unused space is lower.
Example: A node with only 78GB unused space which will cause an error during the patching prechecks

[root@odadev2 ~]# df -Ph / /u01 /opt
Filesystem                          Size Used Avail Use% Mounted on
/dev/mapper/VolGroupSys-LogVolRoot   30G 6.9G   22G 25% /
/dev/mapper/VolGroupSys-LogVolU01   148G 104G   37G 75% /u01
/dev/mapper/VolGroupSys-LogVolOpt    59G   38G   19G 68% /opt
=== 78GB available only

PRECHECK ERROR

# odacli create-prepatchreport -v 19.6.0.0.0 -os
# odacli describe-prepatchreport -i 12d61cda-1cef-40b9-ad7d-8e087007da23v

Patch pre-check report
------------------------------------------------------------------------
Job ID: 666f7269-7f9a-49b1-8742-2447e94fe54e
Description: Patch pre-checks for [OS]
Status: FAILED
Created: November 7, 2022 5:30:42 PM CEST
Result: One or more pre-checks failed for [OS]
Pre-Check Status Comments
----------------------- -------- --------------------------------------
Validate LVM free space Failed Insufficient space to create LVM
snapshots on node: odadev1.

Expected free space(GB): 190, available space GB): 78.

Workarounds

In case of limited Free space we have 2 options

1. Cowboy

My Oracle ACE peer Fernando Simon explains a drastic way to reduce the /u01 footprint in his excellent blogpost- Patch ODA from 18.3 to 19.8. Part 2 by unmounting the disk and using both resize2fs & lvreduce to claim free space.

2. Manual OADR backup with custom snapshot size

A snapshot will require as much storage space as changes made in the logical volume, meaning the OS upgrade change will be the main source of all the changes stored in the snapshots.
Solution: run a manual backup by specifying lower size required for /, /opt, and /u01 snapshots , but you need to run the patchreport at least one time.
Example
With only 98G free space, we can run adapt the FS snapshots to lower sizes (opt=30g , root=5g, u01=70G)

[root@odadev1 ~]# df -Ph / /opt /u01

Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroupSys-LogVolRoot 30G 7.6G 21G 28% / /dev/mapper/VolGroupSys-LogVolOpt 59G 41G 16G 73% /opt /dev/mapper/VolGroupSys-LogVolU01 148G 80G 61G 57% /u01

-- Actual free space

[root@odadev1 ~]# pvs

PV VG Fmt Attr PSize PFree /dev/md
1 VolGroupSys lvm2 a--u 446.00g 98.00g

Note: specify lower values for the lvm snapshots size than the actual filesystem usage.
- odacli update-server command will use these custom snapshots (98GB) during the upgrade instead of creating larger ones automatically which would take 190GB.

[root@odadev1 ~]# /opt/odabr/odabr backup -snap -osize 30 -rsize 5 -usize 70

¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ odabr - ODA node Backup Restore - Version: 2.0.1-62 Copyright Oracle, Inc. 20 -------------------------------------------------------- Author: Ruggero Citton <ruggero.citton@oracle.com> RAC Pack, Cloud Innovation ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
SUCCESS: 2022-11-7 12:10:18:...snapshot backup for 'opt' created successfully SUCCESS: 2022-11-7 12:10:20:...snapshot backup for 'u01' created successfully SUCCESS: 2022-11-7 12:10:20:...snapshot backup for 'root' created successfully SUCCESS: 2022-11-7 12:10:20: LVM snapshots backup done successfully

-- Check the created LVM snapshots
[root@odadev02 ~]# /opt/odabr/odabr infosnap LVM snap name Status COW Size Data% ------------- ---------- ---------- ------ root_snap active 5.00 GiB 0.05% opt_snap active 30.00 GiB 0.02% u01_snap active 70.00 GiB 0.02%

As shown above and below, the size of the snapshot will only contain the changes written during the OS upgrade.

[root@odadev1 ~]# lvs
LV         VG          Attr       LSize   Pool Origin     Data% Meta% Move Log
LogVolDATA VolGroupSys -wi-a----- 10.00g
LogVolOpt VolGroupSys owi-aos--- 60.00g
LogVolRECO VolGroupSys -wi-a----- 10.00g
LogVolRoot VolGroupSys owi-aos--- 30.00g
LogVolSwap VolGroupSys -wi-ao---- 24.00g
LogVolU01 VolGroupSys owi-aos--- 150.00g
opt_snap   VolGroupSys swi-a-s--- 30.00g      LogVolOpt 0.01 <— snapshot
root_snap VolGroupSys swi-a-s---   5.00g      LogVolRoot 0.04 <— snapshot
u01_snap   VolGroupSys swi-a-s--- 70.00g      LogVolU01 0.02 <- snapshot

ODABR tips when patching

You can use the "odabr –dryrun” option before choosing the right size .
When custom snapshots already exist on the system during odacli create-prepatchreport run, the precheck fails, because it expects to create these snapshots itself. However, odacli update-server –c OS still continues with the upgrade.

Use -force option during upgrade to skip the auto backup.

# odacli update-server -v 19.6.0.0.0 -c os --local --force Verifying OS upgrade Current OS base version: 6 is lessthan target OS base version: 7 OS needs to upgrade to 7.7

Run ODABR backup right after the repository update in order to avoid extracting the patch a second time

$  odacli update-repository –f oda-asm-zipfile1,zipfile2,zipfile3,zipfile4

You can now follow the rest of the guided steps to patch ODA from 18.8 to 19.9

When Running the post upgrade checks: You’ll be asked to delete the snapshots

[root@odadev1]# ./odacli update-server-postcheck -v 19.6.0.0.0
Comp Pre-Check       Status    Comments 
---- --------------- -------- ---------------------------------
OS   ODABR snapshot   WARNING ODABR snapshot found. Run 'odabr delsnap' 

-- Delete the snapshots
[root@odadev1]# /opt/odabr/odabr delsnap
INFO: 2022-11-07 20:44:55: Removing LVM snapshots 
SUCCESS: 2022-11-07 20:44:55: ...snapshot for 'opt' removed successfully 
SUCCESS: 2022-11-07 20:44:55: ...snapshot for 'u01' removed successfully
SUCCESS: 2022-11-07 20:44:56: ...snapshot for 'root' removed successfully

Recovering from a Failed Operating System Upgrade

In case things go south, we can always rollback sine we have a restore point.

Download ODARescue Live Disk ISO image for the 19.6 release to enable booting the node on which the OS upgrade failed: See Oracle Support Note 2495272.1:
Then Configure the ODA system on Oracle ILOM to boot from the ISO image

Specify the NFS location, including the IP address and path with file name, for the ISO image.

-set /SP/services/kvms/host_storage_device/remote server_URI=nfs://10.10.1.1:/export/iso/ODARescue_LiveDisk.iso

Configure the ISO image from the Oracle ILOM Service Processor (SP) serial console so that you can use the ISO image to boot the Oracle Database Appliance system.
```
-> set /SP/services/kvms/host_storage_device/ mode=remote
-> set /HOST boot_device=cdrom
```
Reboot the ODA host from ILOM using ODARescue ISO image.
Login as root user with password "welcome1" ( user "odalive" can also be used).
If you decide to revert to the Oracle Linux 6 configuration after troubleshooting, then run the below
```
# odarescue ol6restore
ol6restore will restore:
  boot/efi partition
  LVM snapshots (root, opt, u01)
  grub v1
```
This command restores the Oracle Linux 6 configuration using the snapshots that were taken using ODABR.

Conclusion

ODABR is a very convenient tool that can help you backup & recover your server from OS corruption
We also learned how to reduce the snapshot footprint before upgrading the ODA from 18.8 to 19.6
With this in mind, you can patch your ODA to 19.6 safely even if your free space is lower than 190GB
I hope this can help learn more about this tool which got me curious back when I first patched ODA to 19.6 couple of years ago

Thank you for reading

Monday, November 14, 2022

ExaC@C DB state failed in OCI console while up & running in reality (fix)

Intro

Exadata Cloud@Customer has the particularity of bringing the best of both worlds, where on-premises Data sovereignty meets the innovation & capabilities of the Cloud. Thanks to Control plane network that links up both ExaCC servers and OCI, users can create/manage resources through the Console or any API based cloud tooling (terraform,OCI-CLI, SDK..). Everything you do on the exaC@C is synchronized into OCI through that layer.

Issue of the day

I’ll describe a small glitch that sometimes happens to a database resource. It has no incidence on the database itself, because under EaxC@C, it works just fine. However, you can see the screenshot that databases are marked as failed while they are actually “up and running'”(and accessible) databases.

+-------------+-----------+------------------------------------+-----------+
| Unique-Name | charset   | id                                 | state     |
+-------------+-----------+------------------------------------+-----------+
| MYCDB1_DOM  | AL32UTF8  | ocid1.database.oc1.ca-toronto-1.xxa|  FAILED   |
+-------------+-----------+------------------------------------+-----------+

State

We need to be mindful of what the state column really means. It’s quite self explanatory after a deployment attempt, but for an existing DB, a state often means database resource is down/up. In our case, however, OCI couldn’t detect the resource anymore, hence the state info shows “FAILED”.
But before delving into it, let’s review how ExaCC database resources are seen & registered on OCI side.

Database registration in ExaCC

DB registration allows to perform admin tasks on the exaC@C database through OCI console & Cloud tooling.
Each database created in Exadata Cloud@Customer using API/Console, will automatically be registered in OCI.
Minus few exceptions, where OCI allows for a manual registration which are:
cases:
- Database, that you manually created on Exadata Cloud at Customer, using DBCA
- Existing database, that you migrated from another platform to Exadata Cloud@Customer.
This is done through dbaascli registerdb function, read more on Registring a Database.

Files created after registration
Each registered database will generate a cloud registration file (`DBname.ini`) located under the below directory.

$ ll /var/opt/oracle/creg/*ini
… MYCDB1.ini

Troubleshooting

I first decided to check a workaround described below
Doc ID 2764524.1 EXACS DBs Show Wrong State (Failed) on OCI Webconsole

Cause: DBs registered in CRS with dbname in lowercase (dborcl) instead of uppercase (DBORCL).
Suggested solution: Create a symbolic link to creg db ini file to match the case for the db name registered in CRS.

Outcome: This didn’t fix my problem so I opened an SR to get to the bottom of this.

Diagnosis

This took help from support, as they have a better view on Control plane resources metadata. Taking a look at cloud registration file content, we can see that it contains DB information usually present in the crs plus a few parameters present in the spfile.

$ more /var/opt/oracle/creg/MYCDB1.ini

#################################################################
# This file is automatically generated by database as a service # #################################################################
acfs_vol_dir=/var/opt/oracle/dbaas_acfs
acfs_vol_sizegb=10
agentdbid=83112625-52d2-4b39-b987-1b0d7d2d70cb
aloc=/var/opt/oracle/ocde/assistants
archlog=yes
bkup_asm_spfile=+DATA1/MYCDB1_DOM/spfilemycdb1.ora
…

Agent resource id
Notice the agentdbid in the .ini registration file. Agent resource id, is actually the id that the control plane layer uses to identify & interact with the DB
“agentdbid=83112625-52d2-4b39-b987-1b0d7d2d70cb”

On top of the registration file, the agent id is also written in a rec file under /var/opt/oracle/dbaas_acfs/<DBNAME>

$ more /var/opt/oracle/dbaas_acfs/MYCDB1/83112625-52d2-4b39-b98xx.rec
{
   "agentdbid" : "83112625-52d2-4b39-b987-1b0d7d2d70cb"
}

Root cause

According to OCI support, Somehow the Agent Resource ID seen in Control plane UI console was different than the agentdbid in the corresponding *.ini file.

Solution

Take note of the agent id communicated by the support engineer & replace the id in the .ini and the .rec file.

Take backup of {DBNAME}.ini file of above two dbs on all db nodes

sudo su - oracle
$ cd /var/opt/oracle/creg
$ cp /var/opt/oracle/creg/MYCDB1.ini /var/opt/oracle/creg/MYCDB1.ini.old

Modify ID in {DBNAME}.ini file of the DB with the value of Agent Resource ID seen in the support console.

-- Replace agentdbid=  >> by 47098321-43d1-4b44-b997-1b0d5d1d90cb
 $ vi /var/opt/oracle/creg/MYCDB1.ini

Remove the old rec file with the wrong resourceid and replace it with a new rec file with the right recid

rm /var/opt/oracle/dbaas_acfs/MYCDB1/83112625-52d2-4b39-b987-1b0d7d2d70cb.rec

$ vi /var/opt/oracle/dbaas_acfs/MYCDB1/47098321-43d1-4b44-b997-1b0d5d1d90cb.rec

{
"agentdbid" : "47098321-43d1-4b44-b997-1b0d5d1d90cb" << new value }

After the the change, wait for an hour or so, for the Control Plane to get in sync and verify DB state

+-------------+-----------+------------------------------------+-----------+
| Unique-Name | charset   | id                                 | state     |
+-------------+-----------+------------------------------------+-----------+
| MYCDB1_DOM  | AL32UTF8  | ocid1.database.oc1.ca-toronto-1.xxa| AVAILABLE |
+-------------+-----------+------------------------------------+-----------+

Can we spot the actual agent id in OCI ?

As an end user, you can't see agent resource id in your console. It is unfortunately an internal metadata for control plane. This means, you will have to open an SR each time an issue like this happens. However, I have opened an enhancement request to allow visibility of control plane agentid for end users.

Conclusion

We can say that failed database state in OCI console doesn’t always mean the resource is down
It is possible that migrated database from other platform could lead to this phenomenon
There is no way as of now for you to know agent resource id that control plane is seeing
Hope control plane metadata like agent resource id visibility can be achieved in future release
Until then this workaround can still help those who face such behaviour

Thank you for reading

Property	Source	Target
RAC	NO	YES
Encrypted	NO	YES
CDB	NO	YES
Release	12.2	12.2
Platform	On prem Linux	ExaCC

Wednesday, December 21, 2022

ZDM troubleshooting part 4: NONCDBTOPDB_CONVERSION fails (GUID conflicts with existing PDB)

Intro

My ZDM environment

ZDM: 21.3 build

Property Source Target RAC NOYES Encrypted NO YES CDB NO YES Release 12.212.2PlatformOn prem LinuxExaCC

Prerequisites

All the prerequisites related to the ZDM VM, the Source and Target Database system were satisfied before running the migration.

Responsefile

Prepare a responsefile for a Physical Online Migration with the required parameters (see excerpt). I will just point out that ZDM 21.3 now supports Data Guard Broker configuration .

Run migration until the DG config –step1

Run migration until the DG config –step1

Resume migration –step2

As you can see, It didn’t take long before noticing that the Switchover step failed.

Troubleshooting the error

As you can see above, the issue is related to the new PDB created in the target CDB from the auxiliary database

What Happened

Why is ZDM failing to create the new PDB?

SOLUTION: ZDM pl script rewrite

Force ZDM to use the clone option :

ZDM uses a PDB plugin script called mZDM_convert_noncdb2pdb.pl to perform the PDB conversion

All we need to do is to update the script at the plugin section and add AS CLONE to the create pluggable database command

Location : Make a copy of the below file on he ZDM host before the change

The perl script contains variables replacing the PDB name & XML manifest, but the error here occurred because the script had a "NOCOPY" statement without an "AS CLONE" clause.

To fix the issue, we just need amend it and add the missing clone part.

Note:

It is only advised to perform the below change in this particular case or when asked by Oracle support.

Resume the job

this is it, ZDM will now create the PDB as clone implying a new GUID after resuming the job to complete the rest of our online physical migration

Conclusion

Monday, December 5, 2022

What's ODABR snapshot & how to efficiently use it to patch ODA from 18 to 19.x

Intro

Patching process to ODA 19.6

The upgrade from 18.8 to 19.6 has two main stages :

A first pass to upgrade the Linux from OEL 6 to OEL 7.

A second to update the ODA binaries (DCS and Grid) as for previous versions.

What’s ODABR

ODA backups

System Node Backup includes following filesystems:

/ : Root file system

/boot : Boot partition

/opt : opt file system (OAK/DCS,TFA, OWG, ASR)

/u01: Grid Infrastructure, RDBMS binaries

Grid Infrastructure OCR file

There are 2 types of backups with ODABR but only one is needed when patching the ODA to 19.6

Consistent backup is guarantee by the LVM snapshot feature (used during patching)

Incremental backup managed automatically using rsync features (physical copy to specified destination)

LVM snapshot used by ODABR

ODABR is just reusing Linux LVM snapshot feature that create two copies of the same logical Volume, where one is used for backup purposes while the other continues in operation. The delta is tracked since snapshot creation

Snapshot creation is quick & doesn’t need stopping the server.

A Snapshot will use only the space needed to accommodate the difference between the two LVs (delta also called Copy-on-Write (CoW) )

ODABR installationDownload and install the rpm: >> odabr-2.0.1

Backup Syntax

The backup syntax is pretty straightforward with -snap & -destination (nfs/local path or ssh/rsync) as main option

Patching to 19.6 challenge with limited Free space

Before upgrading the OS, ODABR will create LVM snapshots for the file systems that need 190GB of free space:

But in most situations, old systems unused space is lower. Example: A node with only 78GB unused space which will cause an error during the patching prechecks

PRECHECK ERROR

Workarounds

In case of limited Free space we have 2 options

1. Cowboy

My Oracle ACE peer Fernando Simon explains a drastic way to reduce the /u01 footprint in his excellent blogpost- Patch ODA from 18.3 to 19.8. Part 2 by unmounting the disk and using both resize2fs & lvreduce to claim free space.

2. Manual OADR backup with custom snapshot size

Note: specify lower values for the lvm snapshots size than the actual filesystem usage.- odacli update-server command will use these custom snapshots (98GB) during the upgrade instead of creating larger ones automatically which would take 190GB.

As shown above and below, the size of the snapshot will only contain the changes written during the OS upgrade.

ODABR tips when patching

You can use the "odabr –dryrun” option before choosing the right size .

When custom snapshots already exist on the system during odacli create-prepatchreport run, the precheck fails, because it expects to create these snapshots itself. However, odacli update-server –c OS still continues with the upgrade.

Use -force option during upgrade to skip the auto backup.

Run ODABR backup right after the repository update in order to avoid extracting the patch a second time

You can now follow the rest of the guided steps to patch ODA from 18.8 to 19.9

When Running the post upgrade checks: You’ll be asked to delete the snapshots

Recovering from a Failed Operating System Upgrade

In case things go south, we can always rollback sine we have a restore point.

Download ODARescue Live Disk ISO image for the 19.6 release to enable booting the node on which the OS upgrade failed: See Oracle Support Note 2495272.1:Then Configure the ODA system on Oracle ILOM to boot from the ISO image

Specify the NFS location, including the IP address and path with file name, for the ISO image.

Configure the ISO image from the Oracle ILOM Service Processor (SP) serial console so that you can use the ISO image to boot the Oracle Database Appliance system.

Reboot the ODA host from ILOM using ODARescue ISO image.

Login as root user with password "welcome1" ( user "odalive" can also be used).

Property Source Target

RAC NO YES

Encrypted NO YES

CDB NO YES
Release 12.2 12.2
Platform On prem Linux ExaCC

ODABR installation
Download and install the rpm: >> odabr-2.0.1

But in most situations, old systems unused space is lower.
Example: A node with only 78GB unused space which will cause an error during the patching prechecks

Note: specify lower values for the lvm snapshots size than the actual filesystem usage.
- odacli update-server command will use these custom snapshots (98GB) during the upgrade instead of creating larger ones automatically which would take 190GB.

Download ODARescue Live Disk ISO image for the 19.6 release to enable booting the node on which the OS upgrade failed: See Oracle Support Note 2495272.1:
Then Configure the ODA system on Oracle ILOM to boot from the ISO image

Files created after registration
Each registered database will generate a cloud registration file (`DBname.ini`) located under the below directory.

I first decided to check a workaround described below
Doc ID 2764524.1 EXACS DBs Show Wrong State (Failed) on OCI Webconsole

Cause: DBs registered in CRS with dbname in lowercase (dborcl) instead of uppercase (DBORCL).
Suggested solution: Create a symbolic link to creg db ini file to match the case for the db name registered in CRS.

Agent resource id
Notice the agentdbid in the .ini registration file. Agent resource id, is actually the id that the control plane layer uses to identify & interact with the DB
“agentdbid=83112625-52d2-4b39-b987-1b0d7d2d70cb”