Wednesday, February 1, 2023

Terraform tips: How to Recover Your Deployment From a Terraform apply Crash

-- “Because sh*t happens, be ready for when it does!” --

How to Get the Latest OS Image in Oracle Cloud Infrastructure (OCI) using  Terraform | by Guillermo Musumeci | Medium

terraform Recovery Memes| funnyIntro

Infrastructure Automation is a lifesaver for OPS teams in their day-to-day duties, but when it comes to givin' full control to new tools and frameworks, watch out for problems that can no longer be fixed manually. It’s also a reminder that it is easier to create a mess than to clean it. 

Today, we'll show you how to recover from a failed Terraform apply when half of the resources have already been provisioned, but an error occurred in the middle of the deployment. This can be especially challenging if you have a large number of resources, and you can't destroy them either in the console or through Terraform destroy. Not to mention all them $$ that your Cloud will keep charging for the unusable resources, yikes!


I. How did the Terraform Apply Crash happen

This image has an empty alt attribute; its file name is image-1.png


How did I get here? It was not even my code, I was actually just minding my business tryin' to spin up a Fortinet firewall in Oracle Cloud using their reference architecture (oracle-quickstart).
Here’s a link to the GitHub repo I used but the code is not relevant at all. These config errors can happen anytime. 

Github - Free social media icons

https://github.com/oracle-quickstart/oci-fortinet/drg-ha-use-case


The scenario is simple:

  1. Clone the GitHub repository that contains the terraform configuration

  2. Adjust the authentication parameters such as your API keys for the OCI environment etc

  3. Run the terraform init command

  4. Run the terraform plan command to make sure everything looks good

  • $ cd ~/forti-firewall/oci-fortinet/use-cases/drg-ha-use-case
    $ terraform init
    $ terraform -v

    Terraform v1.0.3
    + provider registry.terraform.io/oracle/oci v4.105.0
    + provider registry.terraform.io/hashicorp/template
    v2.2.0

    $ terraform plan

    Plan: 66 to add, 0 to change, 0 to destroy.
  • All the plan checks come back positive, we're ready to go! let’s focus on the apply

    $ terraform apply -–auto-approve
    oci_core_drg.drg: Creating...
    oci_core_vcn.web[0]: Creating...
    oci_core_app_catalog_listing_resource_version_agreement.mp_image_agreement[0]: Creating...
    oci_core_vcn.hub[0]: Creating...
    oci_core_vcn.db[0]: Creating...
    oci_core_drg_attachment.db_drg_attachment: Creation complete after 14s oci_core_drg_route_distribution_statement.firewall_drg_route_distribution_statement_two: Creating... oci_core_drg_route_distribution_statement.firewall_drg_route_distribution_statement_one: Creation complete after 1s
    oci_core_drg_route_distribution_statement.firewall_drg_route_distribution_statement_two: Creation complete after 1s
    oci_core_drg_route_table.from_firewall_route_table: Creation complete after 23s [id=ocid1.drgroutetable.oc1.ca-toronto-1.aaaaaaaag3ohsxxxxxxxxxxxxnykq]
    Error: Invalid index on network.tf line 240, in resource "oci_core_subnet" "mangement_subnet": 240: security_list_ids = [data.oci_core_security_lists.allow_all_security.security_lists[0].id] ├──────────────── data.oci_core_security_lists.allow_all_security.security_lists is empty list of object The given key does not identify an element in this collection value: the collection has no elements.
    ...

       SURPRISE !!!
       Despite our successful terraform plan, our deployment was halted few minutes later due to the below error This image has an empty alt attribute; its file name is image-2.png


Now, If want to fix my code issue, a clean wipe out is the only way to go (terraform destroy), but is it possible?

Not even close:

  • The deployment stopped half way through with blocking errors

  • It’s really stuck, as we can't terraform destroy to undo our changes, nor can we proceed further

  • The plan is a mess because the dependency is now a mess (data source element empty etc)


                                             The section below, explains why


II. Obvious cleanup options and outcomes (how’d it go?)

  1. Terraform Destroy: Doesn’t work, still getting the same data source error so it’s stuck

    $ terraform destroy --auto-approve

    oci_core_vcn.hub[0]: Refreshing state... [id=ocid1.vcn.oc1.ca-toronto-1xxx]
    oci_core_network_security_group_security_rule.rule_ingress_all: Refreshing state... [id=F44A50] oci_core_network_security_group_security_rule.rule_egress_all: Refreshing state... [id=B14C98]
    ... Error: Invalid index on network.tf line 240, in resource "oci_core_subnet" "mangement_subnet": 240: security_list_ids = [data.oci_core_security_lists.allow_all_security.security_lists[0].id] ├──────────────── data.oci_core_security_lists.allow_all_security.security_lists is empty list of object

    …2 more occurrence of the same error on 2 other security lists data source

  2. Destroy from the Console:  Same, Resources can’t be destroyed no matter what order I chose. 

This image has an empty alt attribute; its file name is image.png

              I used the super handy OCI Tenancy Explorer which was suggested to me by Suraj Ramesh in twitter

This image has an empty alt attribute; its file name is image-5.png  
The OCI Console kept showing "Won't be deleted" and delete button was even grayed out due to dependency.


II. Root Cause of the Crash


All right so the situation is clear it’s stale mate, but we can already infere the below conclusions 

  • Fetching an empty list from a data source in a resource block can cause a deployment to fail miserably

  • `terraform plan` could never detect such errors because it can only be known during the apply 

  • terraform doesn’t have a failsafe mode that allows to recover when such buggy thing happens

  • data source are to be used with caution :)

    -- “Time to switch to the hacking mode!” --

III. Solution

After hours of toiling to find a way to get outta this jam, I understood I wouldn't be able to fix the terraform error while I was mired in the swamp. I had to find a way to get back to square one,which led me to the below fix.

ALL YOU CAN TAINT:

"terraform taint" is a command used to mark a resource as 'tainted' in the state file.This will force the next apply to destroy and recreate of the resource. This is cool but to taint all the resources created so I came up with a bulk command using state list to do that (to taint 33 resources.)

$  terraform state list | grep -v ^data |  xargs -n1 terraform taint
Resource instance oci_core_drg.drg has been marked as tainted.
Resource instance oci_core_drg_attachment.db_drg_att has been marked as tainted.

Resource instance oci_core_vcn.db[0] has been marked as tainted. Resource instance oci_core_vcn.hub[0] has been marked as tainted. Resource instance oci_core_vcn.web[0] has been marked as tainted. Resource instance oci_core_volume.vm_volume-a[0] has been marked as tainted. Resource instance oci_core_volume.vm_volume-b[0] has been marked as tainted.
...33 resources tainted in total

This bulk taint allows do cleanup all the created resources through terraform destroy after an implicit refresh. 

$ terraform destroy --auto-approve
oci_core_drg.drg: Refreshing state... [id=ocid1.xxx] oci_core_vcn.hub[0]: Refreshing state... [id=ocid1.vcn.oc1.ca-toronto-1.xxx]
...
Plan: 0 to add, 0 to change, 33 to destroy.
oci_core_route_table.ha_route_table[0]: Destroying
...


Destroy complete! Resources: 33 destroyed.

Having all resources wiped out by terraform, we can now begin anew with a clean slate.




IV. What If I Deployed Using Oracle Cloud Resource Manager ?


 Resource Manager is an OCI solution that enables users to manage, provision, and govern their cloud resources.
It has a unified view of all deployed resources using Terraform under the hood. A deployment is called a stack were we can load our terraform configuration to automate and orchestrate deployments (plan-apply-destroy). 
This makes sense if you want to keep your deployment configurations centralized in your cloud.

This image has an empty alt attribute; its file name is image-6.pngDeploy to Oracle Cloud

The used repository had even a link to deploy it from OCI Resource manager as shown above
Note: Other Cloud providers offer similar service but they have their own IaC language which is not Terraform..

Solution: Exporting the State and Tainting


If you were running your terraform apply from OCI Resources manager. Then you would have to:

1. Download the configuration from RM (or clone from Github)

This image has an empty alt attribute; its file name is image-9.png

2. Import the state file directly from RM see below (Import State)

Install the PeopleSoft Cloud Manager Stack with Resource Manager

 Important: Load both the terraform configuration(unzipped) and the state file in the same directory

3. Taint your resources: after an init and refresh all resources in the cloud are visible you can start tainting them

$ cd ~/exported_location/drg-ha-use-case
$ terraform init
$
terraform refresh
$ terraform state list | grep -v ^data |  xargs -n1 terraform taint
... # all resources are now tainted
# Destroy the resources

$ terraform destroy --auto-approve
Destroy complete! Resources: 33 destroyed.


CONCLUSION


  • We just learned how to quickly remediate a Terraform deployment that got stuck due to a blocking error.

  • Although terraform doesn’t have a failsafe mode, we can still leverage `taint` in similar failure cases.

  • I also got to code review :) third party terraform configs (opened and answered 2 issues for this stack)

  • Before loading your deployment into Resource Manager, It’s important to deploy/test it locally first (better for troubleshooting .i.e taint)

  • The logical error behind the failure? a mistake from the maintainers (wrong data source compartment)

  • `taint` is rather deprecated , HashiCorp recommends using the -replace option with terraform apply

     $ terraform apply -replace="aws_instance.example[0]
    


Wednesday, December 21, 2022

ZDM troubleshooting part 4: NONCDBTOPDB_CONVERSION fails (GUID conflicts with existing PDB)

This image has an empty alt attribute; its file name is image-3.png
Intro

Every time I think I’m done with my ZDM troubleshooting series, a new topic pops up :). I have also learned that every migration is distinct and brings its own unique challenges, so I have decided to keep the series open-ended. In this post, we will discuss the steps you should take when a conversion of a non-CDB to a PDB fails during a ZDM migration to identify the root cause, and get your migration back on track. So, let's dive in!

Note: you ca always explore the other ZDM troubleshooting related post below:
- Migration failing at ZDM_CONFIGURE_DG_SRC
- Migration failing at ZDM_SWITCHOVER_SRC plus hack 

My ZDM environment

  • ZDM: 21.3 build

Property                                                  Source                                              Target                                      
RAC NOYES
Encrypted NO YES
CDB NO YES
Release 12.212.2
PlatformOn prem LinuxExaCC


 

Prerequisites

All the prerequisites related to the ZDM VM, the Source and Target Database system were satisfied before running the migration.

Responsefile

Prepare a responsefile for a Physical Online Migration with the required parameters (see excerpt). I will just point out that ZDM 21.3 now supports Data Guard Broker configuration .

$ cat physical_online_demo.rsp | grep -v ^#
TGT_DB_UNIQUE_NAME=TGTCDB
MIGRATION_METHOD=ONLINE_PHYSICAL
DATA_TRANSFER_MEDIUM=DIRECT
PLATFORM_TYPE=EXACC
...etc

 

Run migration until the DG config –step1

As usual Irun the migrate command with the -pauseafter ZDM_CONFIGURE_DG_SRC to stop when the replication is configured in order to resume the full migration a later time. 

$ZDM_HOME/bin/zdmcli migrate database –sourcedb SRCDB \ -sourcenode srcHost -srcauth zdmauth \ -srcarg1 user:zdmuser \ -targetnode tgtNode \ -tgtauth zdmauth \ -tgtarg1 user:opc \ -rsp ./physical_online_demo.rsp –ignore ALL -pauseafter ZDM_CONFIGURE_DG_SRC

Resume migration  –step2

Now that the Data guard Configuration is complete. It’s time to resume the full migration to the end. 

$ zdmservice resume job –jobid 1

Querying job status

As you can see, It didn’t take long before noticing that the Switchover step failed. 

$ zdmservice query job –jobid 1
zdmhost.domain.com: Audit ID: 39
Job ID: 1
User: zdmuser
Client: zdmhost
Job Type: "MIGRATE"
Current status: FAILED
Result file path: "/u01/app/oracle/zdmbase/chkbase/scheduled/job-1-*log" ...
Job execution elapsed time: 1 hours 25 minutes 41 seconds
ZDM_GET_SRC_INFO .............. COMPLETED
ZDM_GET_TGT_INFO .............. COMPLETED
ZDM_PRECHECKS_SRC ............. COMPLETED
ZDM_PRECHECKS_TGT ............. COMPLETED
ZDM_SETUP_SRC ................. COMPLETED
ZDM_SETUP_TGT ................. COMPLETED
ZDM_PREUSERACTIONS ............ COMPLETED
ZDM_PREUSERACTIONS_TGT ........ COMPLETED
ZDM_VALIDATE_SRC .............. COMPLETED
ZDM_VALIDATE_TGT .............. COMPLETED
ZDM_DISCOVER_SRC .............. COMPLETED
ZDM_COPYFILES ................. COMPLETED
ZDM_PREPARE_TGT ............... COMPLETED
ZDM_SETUP_TDE_TGT ............. COMPLETED
ZDM_RESTORE_TGT ............... COMPLETED
ZDM_RECOVER_TGT ............... COMPLETED
ZDM_FINALIZE_TGT .............. COMPLETED
ZDM_CONFIGURE_DG_SRC .......... COMPLETED
ZDM_SWITCHOVER_SRC ............ COMPLETED
ZDM_SWITCHOVER_TGT ............ COMPLETED
ZDM_POST_DATABASE_OPEN_TGT .... COMPLETED
ZDM_DATAPATCH_TGT ............. COMPLETED
ZDM_NONCDBTOPDB_PRECHECK ...... COMPLETED
ZDM_NONCDBTOPDB_CONVERSION .... FAILED
ZDM_POST_MIGRATE_TGT .......... PENDING
ZDM_POSTUSERACTIONS ........... PENDING
ZDM_POSTUSERACTIONS_TGT ....... PENDING
ZDM_CLEANUP_SRC ............... PENDING
ZDM_CLEANUP_TGT ............... PENDING


Troubleshooting the error

To determine the content of the error message, the best approach is to check the $ZDM_BASE logs hosted locally on the target node “$ORACLE_BASE/zdm/zdm_targetDB_$jobID/zdm/log

-- Target node
$ cd $ORACLE_BASE/zdm/zdm_TGTCDB_1/zdm/log

$ tail ./zdm_noncdbtopdb_conversion_*.log
[jobid-1][2022-12-14][mZDM_Queries.pm:556]:[DEBUG] Output is :
SQL*Plus: Release 12.2.0.1.0 Production on Wed Dec 14 2022 ..
Connected to: Oracle Database 12c EE Extreme Perf Release 12.2.0.1.0 - 64bit Production
CREATE PLUGGABLE DATABASE zdm_aux_SRCDB using '/tmp/zdm_aux_SRCDB.xml' NOCOPY
TEMPFILE REUSE

 *  ERROR at line 1:

 ORA-65122: Pluggable database GUID conflicts with the GUID of an existing  container.

[jobid-1][2022-12-14][mZDM_convert_noncdb2pdb.pl:522]:[ERROR
failed to create the PDB 'zdm_aux_SRCDB'

As you can see above, the issue is related to the new PDB created in the target CDB from the auxiliary database


What Happened


In an online physical migration from a non-CDB database to a PDB on a target container, ZDM creates an auxiliary standby database in the background to ensure replication consistency before the final switchover. After the switchover is complete, a data patch is applied and an unplug and plug operation is done to convert the auxiliary DB into a PDB on the target container database (CDB).


Why is ZDM failing to create the new PDB?


Turns out, ZDM attempted to construct a new PDB from the generated XML manifest but was unsuccessful, as Oracle had assigned a default GUID from another PDB in the CDB. I wouldn't say this happens for all instances of target CDBs with existing PDBs (I already completed such migrations in the past), but in this case, two databases had already been migrated to the same target CDB before this one.

Subsidiary question
Why  did the CREATE PLUGGABLE DATABASE command use an existing GUID instead of generating a new one?
I don’t have the answer to this yet but we chose to open an SR and see if there was a workaround for this issue.


SOLUTION: ZDM pl script rewrite

 
Force ZDM to use the clone option :

  • ZDM uses a PDB plugin script called mZDM_convert_noncdb2pdb.pl to perform the PDB conversion

  • All we need to do is to update the script at the plugin section and add AS CLONE  to the create pluggable database command

Location : Make a copy of the below file on he ZDM host before the change

    cp $ZDM_HOME/rhp/zdm/mZDM_convert_noncdb2pdb.pl mZDM_convert_noncdb2pdb.pl.old

    The perl script contains variables replacing the PDB name & XML manifest, but the error here occurred because the script had a "NOCOPY" statement without an "AS CLONE" clause.   

    This image has an empty alt attribute; its file name is image-2.pngTo fix the issue, we just need amend it and add the missing clone part.

      …else
      {
      @slq_stmts =(
      ...

      CREATE PLUGGABLE DATABASE $sdb AS CLONE USING '$descfile' NOCOPY TEMPFILE REUSE);
      }


      Note:

      • It is only advised to perform the below change in this particular case or when asked by Oracle support.


      Resume the job

      this is it, ZDM will now create the PDB as clone implying a new GUID after resuming the job to complete the rest of our online physical migration 

      $ zdmservice resume job –jobid 1
      $ zdmservice query job –jobid 1
      ...
      ZDM_CONFIGURE_DG_SRC .......... COMPLETED
      ZDM_SWITCHOVER_SRC ............ COMPLETED
      ZDM_SWITCHOVER_TGT ............ COMPLETED
      ZDM_POST_DATABASE_OPEN_TGT .... COMPLETED
      ZDM_DATAPATCH_TGT ............. COMPLETED
      ZDM_NONCDBTOPDB_PRECHECK ...... COMPLETED
      ZDM_NONCDBTOPDB_CONVERSION .... COMPLETED
      ZDM_POST_MIGRATE_TGT .......... PENDING
      ZDM_POSTUSERACTIONS ........... PENDING
      ZDM_POSTUSERACTIONS_TGT ....... PENDING
      ZDM_CLEANUP_SRC ............... PENDING
      ZDM_CLEANUP_TGT ............... PENDING


      Conclusion

      • We learned that ZDM may sometimes try to reuse an existing GUID while converting DB to PDB 
      • This may be fixed natively in future releases of ZDM.    
      • I can’t assume this behavior would be the same in all cases, because I already moved databases to a CDB with many PDBs without any problem in the past
      • Oracle documentation is explicit about the "AS CLONE" clause. Nevertheless, I don't think the same DB was migrated in the past to the same destination.
        Specifying AS CLONE also ensures that Oracle Database generates new identifiers(GUID,DBID) if the target CDB already contains a PDB that was created using the same set of data files
      • You might not run into the same error, but this is the quickest fix in case it happens. 

              Thank you for reading