0017: Federated Table Materialized Views
STATUS
Accepted
CONTEXT
As specified in 0005: Federated Queries in Redshift, we are using AWS Redshift's Federated Query capability to read from production Postgres and Mysql tables from within the Redshift data warehouse.
Once we had the tables visible in Redshift in external schemas, we
created "materialized view" copies of the tables in a prod_mirror
schema. We do this for 2 reasons:
1. Minimize the query pressure on the source database, controlling
when the tables are queried. Querying the materialized view
occurs strictly within Redshift.
2. dbt cannot read federated queries, but it can recognize
materialized views.
By and large this is working well. However, a few challenges have
arose based on our implementation that need addressing.
1. Although dbt recognizes the materialized views for use in models,
these views are considered "source" tables. There is a current
limitation in dbt unit tests where you cannot (easily) mock data
for source tables, but you can simply mock data from other
analysis models.
2. There are many uses for these app production tables in analysis
contexts. Since these materialized views are tied to the raw
databases, we've needed to grant access to the prod_mirror
schema to the reporter role. That role was intended to only
look at the analysis databases and tables.
3. The way Redshift tracks the materialized views is by creating faux
tables (tables with the prefix mv_tbl__). These tables should
not be queried directly, but because they’re in the same schema
they’re visible to the analysis tools (Metabase and Tableau).
We've seen some users erroneously querying these faux tables
instead of the intended views.
We're looking for a solution that eases our dbt unit testing problems,
while addressing the reporter permissions and visibility issues.
Considered Options
- Continue to grant elevated permissions to
reporterand don't mock these tables in dbt. - DECISION Create a new
appschema in the analysis database, and populate it with new dbt analysis model copies of the views. - Other ideas??
DECISION
We will create a new schema app to house dbt models for each
materialized view. The dag(s) we currently have to refresh the
materialized views will be modified to also run dbt and set
permissions correctly for the new schema.
Although it will be a straight copy, this is similar to the established process for salesforce and tune data copied to the analysis db.
CONSEQUENCES
Pros
- These tables will be available to analysts using Metabase and Tableau without a back door to the raw database.
- Permissions for these tables will be consistent with patterns established elsewhere in the data pipelines.
- Data engineers will be able to mock data from these sources
Cons
- We're adding an additional setup step when bringing a new
production table in for use:
- make the materialized view with liquibase
- make the dbt model in airflow-dags
- update the dag to refresh both
- We're creating a third copy of data (production app, materialized view, and now analysis table). THough this isn't different than what we're doing elsewhere.
Risks
- Minimal additional load and storage for MWAA and Redshift.
Changes
We will need to modify ag_to_aa_dag.py to run the dbt models and set permissions on the new schema.
Responsibility
Currently, we are delegating responsibility for refreshing a
materialized view to the code that runs most frequently. For example
event_ingestor_day.py
runs every 5 minutes, so we have it running a view refresh to have the
most current data at run time. All of the other materialized views
are refreshed by
ag_to_aa_dag.py
every half hour. Wherever the dbt models are run, there will need to
be permissions updates to grant access to the reporter role for the
newly refreshed dbt tables.
NOTES
References
- PR #52: docs(ADR-0022): Federated table materialized views
- PR #68: fix: Flatten indexes of docs to prep for auto-index merge
- PR #127: docs: backfill PR reference links for existing ADRs
Original Author
Ron White