Monday, November 14, 2022

ExaC@C DB state failed in OCI console while up & running in reality (fix)



Intro

Exadata Cloud@Customer has the particularity of bringing the best of both worlds, where on-premises Data sovereignty meets the innovation & capabilities of the Cloud. Thanks to Control plane network that links up both ExaCC servers and OCI, users can create/manage resources through the Console or any API based cloud tooling (terraform,OCI-CLI, SDK..). Everything you do on the exaC@C is synchronized into OCI through that layer.


Issue of the day

I’ll describe a small glitch that sometimes happens to a database resource. It has no incidence on the database itself, because under EaxC@C, it works just fine. However, you can see the screenshot that databases are marked as failed while they are actually “up and running'”(and accessible) databases. 

+-------------+-----------+------------------------------------+-----------+
| Unique-Name | charset   | id                                 | state     |
+-------------+-----------+------------------------------------+-----------+
| MYCDB1_DOM  | AL32UTF8  | ocid1.database.oc1.ca-toronto-1.xxa|  FAILED   |
+-------------+-----------+------------------------------------+-----------+


State

We need to be mindful of what the state column really means. It’s quite self explanatory after a deployment attempt, but for an existing DB, a state often means database resource is down/up. In our case, however, OCI couldn’t detect the resource anymore, hence the state info shows “FAILED”
But before delving into it, let’s review how ExaCC database resources are seen & registered on OCI side.


Database registration in ExaCC


DB registration allows to perform admin tasks on the exaC@C database through OCI console & Cloud tooling.
Each database created in Exadata Cloud@Customer using API/Console, will automatically be registered in OCI.
Minus few exceptions, where OCI allows for a manual registration which are:
  cases:
   - Database, that you manually created on Exadata Cloud at Customer, using DBCA
   - Existing database, that you migrated from another platform to Exadata Cloud@Customer.
  This is done through dbaascli registerdb function, read more on Registring a Database.

Files created after registration
Each registered database will generate a cloud registration file (DBname.ini) located under the below directory.

$ ll /var/opt/oracle/creg/*ini
MYCDB1.ini


Troubleshooting 

I first decided to check a workaround described below
Doc ID 2764524.1 EXACS DBs Show Wrong State (Failed) on OCI Webconsole

Cause: DBs registered in CRS with dbname in lowercase (dborcl) instead of uppercase (DBORCL).
Suggested solution: Create a symbolic link to creg db ini file to match the case for the db name registered in CRS.

Outcome: This didn’t fix my problem so I opened an SR to get to the bottom of this.  


Diagnosis

This took help from support, as they have a better view on Control plane resources metadata. Taking a look at cloud registration file content, we can see that it contains DB information usually present in the crs plus a few parameters present in the spfile. 

$ more /var/opt/oracle/creg/MYCDB1.ini

#################################################################
# This file is automatically generated by database as a service #
#################################################################
acfs_vol_dir=/var/opt/oracle/dbaas_acfs
acfs_vol_sizegb=10
agentdbid=83112625-52d2-4b39-b987-1b0d7d2d70cb
aloc=/var/opt/oracle/ocde/assistants
archlog=yes
bkup_asm_spfile=+DATA1/MYCDB1_DOM/spfilemycdb1.ora

Agent resource id
Notice the agentdbid in the .ini registration file. Agent resource id, is actually the id that the control plane layer uses to identify & interact with the DB
agentdbid=83112625-52d2-4b39-b987-1b0d7d2d70cb

On top of the registration file, the agent id is also written in a rec file under /var/opt/oracle/dbaas_acfs/<DBNAME>

$ more /var/opt/oracle/dbaas_acfs/MYCDB1/83112625-52d2-4b39-b98xx.rec
{
   "agentdbid" : "83112625-52d2-4b39-b987-1b0d7d2d70cb" }


Root cause

According to OCI support, Somehow the Agent Resource ID seen in Control plane UI console was different than the agentdbid  in the corresponding *.ini file.


Solution

Take note of the agent id communicated by the support engineer & replace the id in the .ini and the .rec file.

  • Take backup of {DBNAME}.ini file of above two dbs on all db nodes

sudo su - oracle
$ cd /var/opt/oracle/creg
$ cp /var/opt/oracle/creg/MYCDB1.ini /var/opt/oracle/creg/MYCDB1.ini.old

  • Modify ID in {DBNAME}.ini file of the DB with the value of Agent Resource ID seen in the support console.

-- Replace agentdbid=  >> by 47098321-43d1-4b44-b997-1b0d5d1d90cb

$ vi /var/opt/oracle/creg/MYCDB1.ini

  • Remove the old rec file with the wrong resourceid and replace it with a new rec file with  the right recid

rm /var/opt/oracle/dbaas_acfs/MYCDB1/83112625-52d2-4b39-b987-1b0d7d2d70cb.rec

$ vi /var/opt/oracle/dbaas_acfs/MYCDB1/47098321-43d1-4b44-b997-1b0d5d1d90cb.rec

{
   "agentdbid" : "47098321-43d1-4b44-b997-1b0d5d1d90cb" << new value }
  • After the the change, wait for an hour or so, for the Control Plane to get in sync and verify DB state

+-------------+-----------+------------------------------------+-----------+
| Unique-Name | charset   | id                                 | state     |
+-------------+-----------+------------------------------------+-----------+
| MYCDB1_DOM  | AL32UTF8  | ocid1.database.oc1.ca-toronto-1.xxa| AVAILABLE |
+-------------+-----------+------------------------------------+-----------+

 

Can we spot the actual agent id in OCI ?

As an end user, you can't see agent resource id in your console. It is unfortunately an internal metadata for control plane. This means, you will have to open an SR each time an issue like this happens. However, I have opened an enhancement request to allow visibility of control plane agentid for end users.



Conclusion

  • We can say that failed database state in OCI console doesn’t always mean the resource is down 
  • It is possible that migrated database from other platform could lead to this phenomenon
  • There is no way as of now for you to know agent resource id that control plane is seeing  
  • Hope control plane metadata like agent resource id  visibility can be achieved in future release
  • Until then this workaround can still help those who face such behaviour

        Thank you for reading

No comments:

Post a Comment