Skip to content

0005: Federated Queries in Redshift

STATUS

Accepted

CONTEXT

Our data warehouse requires access to production data from Adgem and other systems to fully leverage our analytical capabilities. This operational data is stored in a Postgres RDS instance within the Adgem account.

Considered Options

  • Federated Queries - Use Redshifts capability to query external databases
  • ELT - Use Airflow (or similar) to load data into Redshift
  • Zero-ETL feature of Redshift

DECISION

We will use Federated Queries within Redshift to query the production Postgres database. This will allow us to compose new data sets from existing warehouse data as well as Adgem. This will avoid the complexity and maintenance of a separate ETL process.

To enable this, we will need to configure VPC peering between the data-warehouse account and the Adgem account. We will also need to create a new IAM role in the data-warehouse account that has access to a secret containing the credentials for the Adgem database. We will add an external schema to the Redshift database that points to the Adgem database. To have Redshift reach through the VPC Peering and make the connection work, we will enable Enhanced VPC routing on the Redshift cluster. Enhanced VPC routing instructs Redshift to route traffic through its VPC (it doesn't do this by default!), and requires an internet gateway on the subnet to allow access to S3 for COPY/UNLOAD operations. Finally, we need to configure a security group for the peering connection and update the Adgem DB's security group to allow a connection from the peering connection SG.

Unfortunately, at this time, dbt is unable to query from a federated query schema. To workaround this, we are able to make a materialized view from the source table. Dbt is able to read this relation and operate properly. A downside to this solution is the need to refresh the materialized view at a regular cadence. We will have a small airflow task as part of any dag that uses these tables to refresh before running dbt.

Connectivity Diagram

flowchart LR subgraph adgemaccount [Adgem Acct] subgraph adgemvpc [Adgem VPC] subgraph adgemsg [Security Group] agdb[(Adgem DB)] end end dbsecret[Secret Credentials] key[KMS Encryption Key] end redshift --external schema--> pc[peering connection] --reads--> agdb subgraph pcsg [Security Group] pc[peering connection] end subgraph warehouse [Data Warehouse Acct] subgraph warehousevpc [Warehouse VPC] redshift[(Redshift)] end role[Default Role] igw[Internet Gateway] end redshift --assumes--> role role --reads--> key role --decrypts--> dbsecret warehousevpc --> igw --copy/unload--> s3 s3[(S3)]

Sample airflow processing

flowchart LR subgraph airflow [Airflow] refresh[Refresh Federated Views]-->dbt import[Import Other Data]-->dbt dbt-->output output[(Analysis Tables)] end

Other options

Although we have the use of Airbyte well established for ingesting data from external APIs, we have not used it internally yet. Additionally, any Airbyte process we would implement would need its own Airflow Dag and all the requisite maintenance that comes with such a setup.

Zero-ETL is a very promising new feature of Redshift. This allows Redshift to create read replicas of MySql data from the db binary logs in real time. Unfortunately, this feature doesn't work with Postgres so we'll need Fed Queries for Adgem. And, worse yet, Zero-ETL does not work with Redshift Enhanced-VPC Routing. Enhanced VPC routing is required though for Federated Queries, so we can't do Zero-ETL for Cosmic.

CONSEQUENCES

All raw production data will be captured, available immediately within the data-warehouse.

Risks

These queries interact directly with the production database instance. We have seen these queries have a negative impact on Cosmic Rewards database performance. materialized views updated at a regular cadence, rather than directly querying the federated tables for adhoc purposes helps. However, the performance impacts will necessitate a larger server or a read replica to avoid all negative impacts.

NOTES

References

Original Author

  • Ron White

Approval date

Approved by

Appendix