Showing posts with label ExaC@C. Show all posts
Showing posts with label ExaC@C. Show all posts

Monday, November 14, 2022

ExaC@C DB state failed in OCI console while up & running in reality (fix)



Intro

Exadata Cloud@Customer has the particularity of bringing the best of both worlds, where on-premises Data sovereignty meets the innovation & capabilities of the Cloud. Thanks to Control plane network that links up both ExaCC servers and OCI, users can create/manage resources through the Console or any API based cloud tooling (terraform,OCI-CLI, SDK..). Everything you do on the exaC@C is synchronized into OCI through that layer.


Issue of the day

I’ll describe a small glitch that sometimes happens to a database resource. It has no incidence on the database itself, because under EaxC@C, it works just fine. However, you can see the screenshot that databases are marked as failed while they are actually “up and running'”(and accessible) databases. 

+-------------+-----------+------------------------------------+-----------+
| Unique-Name | charset   | id                                 | state     |
+-------------+-----------+------------------------------------+-----------+
| MYCDB1_DOM  | AL32UTF8  | ocid1.database.oc1.ca-toronto-1.xxa|  FAILED   |
+-------------+-----------+------------------------------------+-----------+


State

We need to be mindful of what the state column really means. It’s quite self explanatory after a deployment attempt, but for an existing DB, a state often means database resource is down/up. In our case, however, OCI couldn’t detect the resource anymore, hence the state info shows “FAILED”
But before delving into it, let’s review how ExaCC database resources are seen & registered on OCI side.


Database registration in ExaCC


DB registration allows to perform admin tasks on the exaC@C database through OCI console & Cloud tooling.
Each database created in Exadata Cloud@Customer using API/Console, will automatically be registered in OCI.
Minus few exceptions, where OCI allows for a manual registration which are:
  cases:
   - Database, that you manually created on Exadata Cloud at Customer, using DBCA
   - Existing database, that you migrated from another platform to Exadata Cloud@Customer.
  This is done through dbaascli registerdb function, read more on Registring a Database.

Files created after registration
Each registered database will generate a cloud registration file (DBname.ini) located under the below directory.

$ ll /var/opt/oracle/creg/*ini
MYCDB1.ini


Troubleshooting 

I first decided to check a workaround described below
Doc ID 2764524.1 EXACS DBs Show Wrong State (Failed) on OCI Webconsole

Cause: DBs registered in CRS with dbname in lowercase (dborcl) instead of uppercase (DBORCL).
Suggested solution: Create a symbolic link to creg db ini file to match the case for the db name registered in CRS.

Outcome: This didn’t fix my problem so I opened an SR to get to the bottom of this.  


Diagnosis

This took help from support, as they have a better view on Control plane resources metadata. Taking a look at cloud registration file content, we can see that it contains DB information usually present in the crs plus a few parameters present in the spfile. 

$ more /var/opt/oracle/creg/MYCDB1.ini

#################################################################
# This file is automatically generated by database as a service #
#################################################################
acfs_vol_dir=/var/opt/oracle/dbaas_acfs
acfs_vol_sizegb=10
agentdbid=83112625-52d2-4b39-b987-1b0d7d2d70cb
aloc=/var/opt/oracle/ocde/assistants
archlog=yes
bkup_asm_spfile=+DATA1/MYCDB1_DOM/spfilemycdb1.ora

Agent resource id
Notice the agentdbid in the .ini registration file. Agent resource id, is actually the id that the control plane layer uses to identify & interact with the DB
agentdbid=83112625-52d2-4b39-b987-1b0d7d2d70cb

On top of the registration file, the agent id is also written in a rec file under /var/opt/oracle/dbaas_acfs/<DBNAME>

$ more /var/opt/oracle/dbaas_acfs/MYCDB1/83112625-52d2-4b39-b98xx.rec
{
   "agentdbid" : "83112625-52d2-4b39-b987-1b0d7d2d70cb" }


Root cause

According to OCI support, Somehow the Agent Resource ID seen in Control plane UI console was different than the agentdbid  in the corresponding *.ini file.


Solution

Take note of the agent id communicated by the support engineer & replace the id in the .ini and the .rec file.

  • Take backup of {DBNAME}.ini file of above two dbs on all db nodes

sudo su - oracle
$ cd /var/opt/oracle/creg
$ cp /var/opt/oracle/creg/MYCDB1.ini /var/opt/oracle/creg/MYCDB1.ini.old

  • Modify ID in {DBNAME}.ini file of the DB with the value of Agent Resource ID seen in the support console.

-- Replace agentdbid=  >> by 47098321-43d1-4b44-b997-1b0d5d1d90cb

$ vi /var/opt/oracle/creg/MYCDB1.ini

  • Remove the old rec file with the wrong resourceid and replace it with a new rec file with  the right recid

rm /var/opt/oracle/dbaas_acfs/MYCDB1/83112625-52d2-4b39-b987-1b0d7d2d70cb.rec

$ vi /var/opt/oracle/dbaas_acfs/MYCDB1/47098321-43d1-4b44-b997-1b0d5d1d90cb.rec

{
   "agentdbid" : "47098321-43d1-4b44-b997-1b0d5d1d90cb" << new value }
  • After the the change, wait for an hour or so, for the Control Plane to get in sync and verify DB state

+-------------+-----------+------------------------------------+-----------+
| Unique-Name | charset   | id                                 | state     |
+-------------+-----------+------------------------------------+-----------+
| MYCDB1_DOM  | AL32UTF8  | ocid1.database.oc1.ca-toronto-1.xxa| AVAILABLE |
+-------------+-----------+------------------------------------+-----------+

 

Can we spot the actual agent id in OCI ?

As an end user, you can't see agent resource id in your console. It is unfortunately an internal metadata for control plane. This means, you will have to open an SR each time an issue like this happens. However, I have opened an enhancement request to allow visibility of control plane agentid for end users.



Conclusion

  • We can say that failed database state in OCI console doesn’t always mean the resource is down 
  • It is possible that migrated database from other platform could lead to this phenomenon
  • There is no way as of now for you to know agent resource id that control plane is seeing  
  • Hope control plane metadata like agent resource id  visibility can be achieved in future release
  • Until then this workaround can still help those who face such behaviour

        Thank you for reading

Monday, April 4, 2022

ExaCC dbaascli command with mysterious PILOT error when creating a PDB

This image has an empty alt attribute; its file name is image-16.png

Intro

I lately had a silly error while running dbaascli on an ExaCC cluster but the java output was not helpful at all.  This is a super short post that shows where to find dbaascli execution logs and also an example of what can break your command execution depending on who runs it.


My error

I  was just creating a PDB the other day after provisioning a CDB using dbaaspi, but when I run the command I had this odd error message.

  • Command output
[root@clvmd01 ~]# dbaascli pdb create --dbname MYCDB --pdbname UAT
DBAAS CLI version 21.4.1.1.0
Executing command pdb create --pdbname UAT
Job id: dbac1ebe-5a81-4bfa-b8da-a01177359abd
Loading PILOT...
[FATAL] [DBAAS-60022] Command '/var/opt/oracle/dbaastools/pilot/bin/pilot -plugin create_pdb_cloud 
PLUGIN_OPERATION_TYPE="create" DBNAME="MYCDB" PDBNAME="UAT" DBAASAPI_JOB_ID="dbac1ebe-5a81-4bfa-b8da-a01177359abd"
-logLevel FINE -logDir /var/opt/oracle/log/MYCDB/pdb/create -silent -checkpointDir /var/opt/oracle/log/pilot_checkpoints
-jreLoc /usr/java/jdk1.8.0_291-amd64/jre ' execution has failed on nodes [localnode]. ACTION: Refer application log file for more information. *MORE DETAILS* Result of node:localnode ERRORS: Exception in thread "main" java.lang.IllegalMonitorStateException <------ what is that?? at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryRelease(ReentrantReadWriteLock.java:371) at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1261) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.unlock(ReentrantReadWriteLock.java:1131) at oracle.install.commons.pilot.JobManagementHelper.createSubmittedJobInfo(JobManagementHelper.java:179) at oracle.install.commons.pilot.PilotApplicationHandler.executeApp(PilotApplicationHandler.java:428) at oracle.install.commons.pilot.PilotApplicationHandler.performOperation(PilotApplicationHandler.java:253) at oracle.install.commons.pilot.PilotApplication.execute(PilotApplication.java:210) at oracle.install.commons.pilot.PilotApplication.startup(PilotApplication.java:142) at oracle.install.commons.pilot.PilotApplication.main(PilotApplication.java:428) [Loading PILOT...] Exit code of the operation:1

PILOT failing to load

  • IllegalMonitorStateException and the rest of the java stack didn’t look like anything to me and I kept getting the same error over and over.


Where to look for API execution logs in ExaCC?

When a database deployment is created on Oracle Database Exa@CC, log files from the creation operation are stored in subdirectories of /var/opt/oracle/log. I actually forgot the path when I had this problem but a colleague reminded me of it. This log directory isn’t only storing database creation logs but way more as shown in the dbaastools log layout.

dbaastools LOGs

Example: under the CDB that I have previously created in my vm cluster you can see the variety of subfolders

[oracle@clvmd01]$ ll /var/opt/oracle/log/MYCDB
drwxrwx--- 2 oracle oinstall 4096 Mar 23 15:12 bkup
drwxr-xr-x 2 oracle oinstall 4096 Mar 24 11:26 bkup_api_log
drwxrwx--- 2 oracle oinstall 4096 Mar 24 00:01 cleandblogs –-> adrci cleanup
drwxrwx--- 2 oracle oinstall 4096 Mar 23 15:12 creg --> cloud db registration
drwxrwx--- 3 oracle oinstall   20 Mar 23 13:55 database  /create
drwxrwx--- 3 oracle oinstall   16 Mar 23 13:55 dbaasapi /db /createdb 
drwxrwx--- 2 oracle oinstall 4096 Mar 24 00:26 obkup
drwxr-xr-x 5 oracle oinstall   46 Mar 24 11:29 pdb /create /open /delete /close
drwxrwx--- 2 oracle oinstall    6 Mar 24 11:27 rman

  • As you can see, there are several logs related to other tools and operations in the log location . You will find log directories for dbaascli, dbaasapi, bkup_api,obkup, rman and ADR operations which is very handy when you have several databases to manage. 

Root cause

Now back to our small issue, since I know where to find my dbaascli command log. Let’s find out why the pilot was crashing.

[oracle@clvmd01]$ cat /var/opt/oracle/log/MYCDB/pdb/create/pilot_2022-03-23_04-26-45-PM

Refer associated stacktrace #oracle.install.commons.pilot.JobManagementHelper: ---# Begin Stacktrace #---------------------------
ID: oracle.install.commons.pilot.JobManagementHelper:78
java.io.FileNotFoundException: /
var/opt/oracle/log/pilot_checkpoints/conf/jobs/lastJob.inf.lck

(Permission denied)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at oracle.install.commons.pilot.util.ConcurrentFileLockUtil.acquireLock(ConcurrentFileLockUtil.java:152)
at oracle.install.commons.pilot.JobManagementHelper.createSubmittedJobInfo(JobManagementHelper.java:129)         at oracle.install.commons.pilot.PilotApplicationHandler.executeApp(PilotApplicationHandler.java:428)         at oracle.install.commons.pilot.PilotApplicationHandler.performOperation(PilotApplicationHandler.java:253)         at oracle.install.commons.pilot.PilotApplication.execute(PilotApplication.java:210)         at oracle.install.commons.pilot.PilotApplication.startup(PilotApplication.java:142)         at oracle.install.commons.pilot.PilotApplication.main(PilotApplication.java:428)

---# End Stacktrace #----------------------------- INFO: [2022-03-23 16:26:45.670 CDT][Thread-2][PilotApplicationHelper$1.run:227] Shutting down the pilot application.

When I checked the job file that was behind the concurrent lock, I realized that the file was owned by root the first time and all subsequent runs as oracle couldn’t overwrite it as shown below.

[oracle@clvmd01]$ ll /var/opt/oracle/log/pilot_checkpoints/conf/jobs/
-rw-r----- 1 oracle oinstall 54 Mar 23 14:41 lastJob.inf
-rw-rw---- 1 root   root      0 Mar 23 12:53 lastJob.inf.lck


Solution

I just changed the permission of that job file, so my next run of that create pdb job will be able to write on the lck which fixed my problem.

[root@clvmd01]# chown oracle:oinstall /var/opt/oracle/log/pilot_checkpoints/conf/jobs/lastJob.inf.lck


Conclusion

Bottom line here is, although new cli tools usage is rapidly growing, it is as important to know the log location for these api based executions (dbaastools) than their syntax. You will always find more in the log directory than the explicit command output they show.

        Thank you for reading