Friday, December 8, 2023

OCI FortiGate HA Cluster - Reference Architecture: Code review & Fixes

Intro


OCI Quick Start repositories on GitHub are collections of Terraform scripts and configurations provided by Oracle. These repositories are designed to help orgs quickly deploy common infrastructure setups on OCI Platform.
Each Quick Start focuses on a specific use case or workload, which simplify the process of provisioning on OCI using Terraform. A sort of IaC based reference architecture.


Today, we will code review one of those reference architecture which is a Fortinet firewall Solution deployed in OCI.
Note: This article won’t discuss the architecture, but will rather address its terraform code flaws and fixes.



Why some errors never get to your OCI Resource Manager stack?


  • Certain Terraform errors may not reach your RM stack due to its design. For instance, RM allows the hardcoding of specific variables, like availability domains, directly in its interface. This sidesteps the need for these variables to be checked by native conditions in the TF code.

  • Moreover, RM reads these variables from the schema.yaml file, altering the behavior compared to local Terraform CLI execution. This approach can result in certain errors being handled or bypassed within the RM environment, creating a distinction from standard Terraform workflows.



The stack: FortiGate HA Cluster using DRG - Reference Architecture


The stack is a result of the collaboration of both Oracle and Fortinet. This architecture is based on a Hub & Spoke topology, using FortiGate firewall from OCI Marketplace. I actually deployed it while working on one of my projects.


For details of the architecture, see Set up a hub-and-spoke network topology.


The repository


You will find this terraform config under the main oci-fortinet github repository. But not in the root directory.



The Errors


At the time of writing this, the errors were still not fixed despite opening issues and sharing the fix. You can see that the last commit goes back to 2 years. You will need to clone the repo and cd to the drg-ha-use-case subdirectory 

$ git clone https://github.com/oracle-quickstart/oci-fortinet.git

$ cd use-cae/drg—ha-use-case

$ terraform init


1.  Data source error in Regions with unique AD

  

You will face this issue on a region with only one availability domain (i.e ca-toronto-1) as the data source of the availability domain will fail the terraform execution plan.


CAUSE:  See issue #8 

  • In the above error terraform complains about the availability data source having only one element

  • This impacts 2 of the “oci_core_instance resource” blocks (2 web-vms, 2 db-vms).

  •  Problem? 

    • count.index for the data source block will always be equal 0 on single AD regions (1 element).
      See data_source.tf line 8-10. This configuration hasn’t been tested in single AD regions.

      $ vi data_source.tf

      # ------ Get list of availability domains
      8 data "oci_identity_availability_domains" "ADs" { 9  compartment_id = var.tenancy_ocid 10 }



  • Reason:

    • In terraform the count.index always starts at 0, if you have a resource with a count of 4, the count.index object will be 0, 1, 2, and 3.

    • Let’s take for example the "web-vms" oci_core_instance block in compute.tf > line 235

    • If we run the condition:
      - The variable availability_domaine_name is empty
      - The ads data source length = 1 element. That means that the AD name will be equal to
      ads data_source collection with an index value of [0+1] =

    • data…ads.availability_domains[1] doesn’t exist as it only contains 1 element
       

Solution 

Complete the full availability domain conditional expression on line 235 and line 276 (web-vms/db-vms)

  • Add the case where data source ads.availability_domains has 1 element (the region has one AD only)



Bad logic 

Seeking the name of the count.index+1 availability domain is still wrong when the region has more than 1 AD

  • Example: say you want to create 3 vms and your region has 2 Availability domains >1 .

    • The first iteration [0] will set count.index+1 = 1 ( 2nd data source element = AD2) 

    • Then the second iteration sets a count.index+1 = 2 ( 3rd data source element=AD3)

    • The 2nd and 3rd iteration will always fail because there’s only 2 ADs (index list [0,1]).



2. Wrong compartment argument in the security list data sources

  

Another issue you will run into is a failure to deploy subnets due to data source collection being empty (no element).


CAUSE:  See issue #9 

  • In the above error terraform complains that {allow_all_security} data source is empty

    • This impacts all fortigate subnets blocks in the config as they all share the same security lists.

Reason:

  • In this configuration there are 2 compartments , one for compute and another for network resources

  • If you take a look at  "allow_all_security"  block in datasource.tf > line 64-to-74

  • You’ll notice a wrong compartment ID in the security lists data source (compute instead of network)


  

    Solution 
     

    This was a silly mistake, but took me a day to figure out while delving through a pile of new terraform files.

    All you need to do is replace the compute compartment variable by var.network_compartment_ocid

    Edit network.tf line 64-74

    # ------ Get the Allow All Security Lists for Subnets in Firewall VCN

    data
    "oci_core_security_lists" "allow_all_security" {
      compartment_id = var.network_compartment_ocid    <--- // CORRECT Compartment
      vcn_id         = local.use_existing_network ? var.vcn_id: oci_core_vcn.hub.0.id
    ...


    3. More code inconsistencies


    I wasn’t done debugging as I found other misplaced compartment variables in some vnic attachments data sources

    • See datasource.tf : line 103-115 &118-130, you need to replace them by var.compute_compartment_ocid 



    Conclusion & recommendations

    • This type of undetected code issues ,is why I never trust the first deployment in Resource Manager.
      In order to avoid problems in the future, especially if you decide to migrate out of RM at some point, I suggest the following workflow:

      1. Run locally and validate any code bug

      2. Run on Resource Manager

      3. Store to git repo (blue print with eventual versioning)

    • I hope this was helpful as the issues I opened are still unsolved for over a year in their GitHub repo.  



    No comments:

    Post a Comment