0017: Federated Table Materialized Views

STATUS

Accepted

CONTEXT

As specified in 0005: Federated Queries in Redshift, we are using AWS Redshift's Federated Query capability to read from production Postgres and Mysql tables from within the Redshift data warehouse.

Once we had the tables visible in Redshift in external schemas, we created "materialized view" copies of the tables in a prod_mirror schema. We do this for 2 reasons: 1. Minimize the query pressure on the source database, controlling when the tables are queried. Querying the materialized view occurs strictly within Redshift. 2. dbt cannot read federated queries, but it can recognize materialized views.

By and large this is working well. However, a few challenges have arose based on our implementation that need addressing. 1. Although dbt recognizes the materialized views for use in models, these views are considered "source" tables. There is a current limitation in dbt unit tests where you cannot (easily) mock data for source tables, but you can simply mock data from other analysis models. 2. There are many uses for these app production tables in analysis contexts. Since these materialized views are tied to the raw databases, we've needed to grant access to the prod_mirror schema to the reporter role. That role was intended to only look at the analysis databases and tables. 3. The way Redshift tracks the materialized views is by creating faux tables (tables with the prefix mv_tbl__). These tables should not be queried directly, but because they’re in the same schema they’re visible to the analysis tools (Metabase and Tableau). We've seen some users erroneously querying these faux tables instead of the intended views.

We're looking for a solution that eases our dbt unit testing problems, while addressing the reporter permissions and visibility issues.

Considered Options

Continue to grant elevated permissions to reporter and don't mock these tables in dbt.
DECISION Create a new app schema in the analysis database, and populate it with new dbt analysis model copies of the views.
Other ideas??

DECISION

We will create a new schema app to house dbt models for each materialized view. The dag(s) we currently have to refresh the materialized views will be modified to also run dbt and set permissions correctly for the new schema.

Although it will be a straight copy, this is similar to the established process for salesforce and tune data copied to the analysis db.

CONSEQUENCES

Pros

These tables will be available to analysts using Metabase and Tableau without a back door to the raw database.
Permissions for these tables will be consistent with patterns established elsewhere in the data pipelines.
Data engineers will be able to mock data from these sources

Cons

We're adding an additional setup step when bringing a new production table in for use:
1. make the materialized view with liquibase
2. make the dbt model in airflow-dags
3. update the dag to refresh both
We're creating a third copy of data (production app, materialized view, and now analysis table). THough this isn't different than what we're doing elsewhere.

Risks

Minimal additional load and storage for MWAA and Redshift.

Changes

We will need to modify ag_to_aa_dag.py to run the dbt models and set permissions on the new schema.

Responsibility

Currently, we are delegating responsibility for refreshing a materialized view to the code that runs most frequently. For example event_ingestor_day.py runs every 5 minutes, so we have it running a view refresh to have the most current data at run time. All of the other materialized views are refreshed by ag_to_aa_dag.py every half hour. Wherever the dbt models are run, there will need to be permissions updates to grant access to the reporter role for the newly refreshed dbt tables.

0017: Federated Table Materialized Views

STATUS

CONTEXT

Considered Options

DECISION

CONSEQUENCES

Pros

Cons

Risks

Changes

Responsibility

NOTES

References

Original Author

Approval date

Approved by

Appendix