0005: Federated Queries in Redshift
STATUS
Accepted
CONTEXT
Our data warehouse requires access to production data from Adgem and other systems to fully leverage our analytical capabilities. This operational data is stored in a Postgres RDS instance within the Adgem account.
Considered Options
- Federated Queries - Use Redshifts capability to query external databases
- ELT - Use Airflow (or similar) to load data into Redshift
Zero-ETLfeature of Redshift
DECISION
We will use Federated Queries within Redshift to query the production Postgres database. This will allow us to compose new data sets from existing warehouse data as well as Adgem. This will avoid the complexity and maintenance of a separate ETL process.
To enable this, we will need to configure VPC peering between the data-warehouse account and the Adgem account. We will also need to create a new IAM role in the data-warehouse account that has access to a secret containing the credentials for the Adgem database. We will add an external schema to the Redshift database that points to the Adgem database. To have Redshift reach through the VPC Peering and make the connection work, we will enable Enhanced VPC routing on the Redshift cluster. Enhanced VPC routing instructs Redshift to route traffic through its VPC (it doesn't do this by default!), and requires an internet gateway on the subnet to allow access to S3 for COPY/UNLOAD operations. Finally, we need to configure a security group for the peering connection and update the Adgem DB's security group to allow a connection from the peering connection SG.
Unfortunately, at this time, dbt is unable to query from a federated
query schema. To workaround this, we are able to make a materialized view
from the source table. Dbt is able to read this relation and
operate properly. A downside to this solution is the need to refresh
the materialized view at a regular cadence. We will have a small
airflow task as part of any dag that uses these tables to refresh
before running dbt.
Connectivity Diagram
Sample airflow processing
Other options
Although we have the use of Airbyte well established for ingesting data from external APIs, we have not used it internally yet. Additionally, any Airbyte process we would implement would need its own Airflow Dag and all the requisite maintenance that comes with such a setup.
Zero-ETL is a very promising new feature of Redshift. This allows Redshift to create read replicas of MySql data from the db binary logs in real time. Unfortunately, this feature doesn't work with Postgres so we'll need Fed Queries for Adgem. And, worse yet, Zero-ETL does not work with Redshift Enhanced-VPC Routing. Enhanced VPC routing is required though for Federated Queries, so we can't do Zero-ETL for Cosmic.
CONSEQUENCES
All raw production data will be captured, available immediately within the data-warehouse.
Risks
These queries interact directly with the production database
instance. We have seen these queries have a negative impact
on Cosmic Rewards database performance. materialized views
updated at a regular cadence, rather than directly querying
the federated tables for adhoc purposes helps. However,
the performance impacts will necessitate a larger server or
a read replica to avoid all negative impacts.
NOTES
References
- Enhanced VPC routing
- Getting started with using federated queries to PostgreSQL
- Set up single user rotation for AWS Secrets Manager
- Creating a secret and IAM role to use federated queries
- Access AWS Secrets Manager secrets from a different account
- PR #14: ADR-0005: Federated Queries
- PR #21: feat(mkdocs): capability to build a static site of the docs using mkdocs
- PR #127: docs: backfill PR reference links for existing ADRs
Original Author
- Ron White