0007: EKS Cluster and Datahub provisioning

STATUS

Accepted

CONTEXT

The data team has been exploring tools / products that would the enable teams across our business to explore, understand and derive data insights with minimal technical guidance from the data team
Two notable open-source tools have been identified Datahub (ready to adopt) and Metabase (under evaluation)
The TIGR and DevOps team would collaborate to provision the EKS clusters following recommended best practices for efficient day 2 ops
This ADR is to document all decisions (related to EKS and Datahub) that are discussed and agreed upon by both teams and the VP of Engineering.

Considered Options

Data Cataloguing Tools
- Acryl, a managed cloud version of datahub
- Alation, an enterprise cataloguing tool
- Offerings within Tableau and Sigma
k8s
- EKS
CI/CD
- Github Actions
- ArgoCD
- Codepipeline

DECISION

OSS vs Managed Datahub

Datahub OSS deployment recommends using Kubernetes in production.
Datahub also offers a managed service called Acryl, which was considered to avoid the burden of managing Kubernetes clusters. Acryl is priced at $60,000 per year for up to 20 monthly active users.
Proposal: Utilize the OSS version and self-manage via EKS on AA's AWS
- From a cost perspective, it would be more efficient to deploy and manage Datahub on AA's AWS.
- Since our tech team would manage these clusters, we could use them for:
  - Deploying other OSS tools that require Kubernetes, such as Metabase and Airbyte.
  - Supporting other scalability objectives we may have for our tech products or internal tools.

Centralized vs Decentralized Clusters

After extensive discussions, we opted for a hybrid approach. We created a new account within the production OU and provisioned all resources in that account. This strategy aims to isolate all resources into a single account while allowing flexibility to adopt either a centralized or decentralized approach as our requirements evolve. A centralized approach would have complicated our CDK project, whereas a decentralized approach would have restricted our clusters to the data team. The original research is preserved in Appendix D for future reference.

The architecture diagram is specified on this miro.

EKS Pods Identities vs IAM roles for Service Accounts

We chose IAM roles for Service Accounts because our team is more familiar with this option compared to EKS Pod Identities. Additionally, we use RBAC for Kubernetes, assigning Permission Sets provisioned through code with privileges on the Kubernetes clusters to perform operations via CLI for any debugging or maintenance activities.

Managed Services for Datahub

Datahub recommends using AWS managed services for storage services (MySQL, Elasticsearch and Kafka) that are required for operating Datahub.
We're currently utilizing RDS running a mysql database, use of Elasticsearch Service, Managed Kafka is deferred until our adoption grows and k8s cost increase due to need for increased capacity which would be made evident with degradation of datahub performance.

Managing Access to Datahub

We recommend managing access to Datahub through an Identity Provider (IDP), with Google being the optimal choice for our organization since it allows access to all members. We set up Single Sign-On (SSO) using OpenID Connect and disabled authentication via username and password.

Github Repository Structure

We have implemented a monorepo strategy where the cluster, supporting services, and workloads to be deployed are all contained within a single GitHub repository, managed through distinct CDK stacks. However, as our adoption scales and we transition to a centralized EKS cluster model, we may consider the approach detailed in Appendix C.

Pipeline Setup

We evaluated two methods for setting up the CI/CD pipeline and chose to adopt GitHub Actions. This decision was driven by the complexity of each option relative to our Kubernetes adoption scope. After thorough discussion and debate, we concluded that we only require a pipeline to deploy CDK code for provisioning EKS clusters and deploying workloads using prepackaged Helm Charts. The simplest solution was to use GitHub Actions to deploy code via cdk deploy. If we later decide to deploy custom applications on Kubernetes, ArgoCD would be the preferred best practice.

We've left the original research put together in Appendix A if needed for future reference.

CONSEQUENCES

Data team can begin to utilize the data cataloguing tool
AA's tech teams would have EKS clusters available as a service to utilize for scalability needs. We've already identified Airbyte as a potential.
Increased workload for the DevOps team, since k8s is being newly adopted at AA, our team would have to up-skill for day-1 and day-2 ops.

Risks

Datahub adoption success relies on following best practices outlined for DBT, Airflow. It also demands increased documentation which increases the overload on the team. We believe this investment would reduce overall cycle times for the data team. See Appendix B for Datahub Best Practices for adoption success.
This would be the first EKS cluster provisioned and managed by our tech team which implies we have a learning curve to go through.
- Good test coverage for our CDK code and robust pipelines would greatly aid the process.
- A steady approach where development/maintenance goes hand-in-hand with research would greatly aid the process.

NOTES

References

Original Author

Nishanth Kaladharan nkaladharan@adaction.com

Fabian Leon fleon@adaction.com

Approval date

Approved by

Appendices

Appendix A - CI/CD Options

1. Imperative Approach Using GitHub Actions and Helm

In this approach, the CI/CD pipeline is defined explicitly through automation scripts, and each step is executed in a controlled manner.

Direct Management

The deployment process for Helm charts is managed directly within the CI/CD pipeline. GitHub Actions workflows handle the entire process, from building Docker images to deploying them on the EKS cluster.

Customization and Control

Each deployment action is explicitly specified. This includes choosing specific Helm chart versions, setting environment-specific parameters, and manually triggering deployments as needed.

Reactive Process

Deployments are triggered based on changes detected in the codebase or configuration files.

Pros: - Compatibility: This approach can be more straightforward to integrate with existing tools and processes. - Fine-grained Control: Detailed customization of each deployment step allows for precise control. - Immediate Feedback: Provides quick feedback on changes, allowing for rapid iteration.

Cons: - Maintenance Overhead: Requires frequent updates and management of CI/CD scripts and configurations. - Complexity: Each change or addition requires manual updates to the workflow, increasing the risk of errors. - Security Concerns: This approach often requires the production api-server endpoint to be publicly accessible, which is not recommended due to security risks. Exposing the cluster to the internet can lead to potential vulnerabilities. Additionally, to allow GitHub Actions to access the cluster, the API server must be public or reachable from GitHub.

2. Declarative Approach Using GitOps with ArgoCD

This approach utilizes GitOps principles, where the desired state of the system, including Helm charts, is stored in a Git repository. ArgoCD automates the synchronization of the cluster with this desired state.

Using ArgoCD Involves:

Desired State Model: The system's desired state, including all Kubernetes resources and Helm chart configurations, is declaratively defined in the Git repository. This model specifies what the system should look like at any given time.
Continuous Synchronization: ArgoCD continuously monitors the Git repository. When changes are detected, it automatically synchronizes the EKS cluster to match the desired state, ensuring that the deployed state aligns with the specified configuration.
Pull-based Automation: Rather than manually triggering deployments, ArgoCD automatically pulls changes from the Git repository and applies them, reducing the need for manual oversight and intervention.

Pros: - Automation and Consistency: Ensures the cluster state is consistently aligned with the declared state in the Git repository, reducing manual intervention. - Reduced Operational Overhead: Automates much of the deployment and management process, reducing the need for manual updates. - Rollbacks: Easier to implement, as reverting to a previous state can be as simple as reverting a commit in the Git repository. - Enhanced Security: This approach does not require the cluster to be exposed to the internet, as all changes are pulled from a secure, internal Git repository, minimizing potential attack surfaces.

Cons: - Initial Setup Complexity: Setting up ArgoCD and configuring GitOps workflows can be complex and require a deeper understanding of Kubernetes and GitOps principles. - Learning Curve: We will need to invest time in learning how to effectively use ArgoCD and GitOps practices.

Appendix B - Datahub Best Practices and Adoption Strategies

Adoption Strategies

Identifying Business Users / Stakeholder that give feedback
- Documentation initiatives die when a value add is not evident.
- The datahub team recommends that we identify stakeholders / domains within a certain impact radius and get feedback from them as we add meta data to our data products.
- Not only does this drive better adoption, it would also enable us to identify current challenges that stakeholders face and address with the platform with documentation rather than trying to add documentation for everything.
Identifying Domains and Classifying Datasets:
- Similar to above identify domains that would benefit from adoption of datahub.
- Classify assets that belong to that domain and get feedback from teams that work in that domain.
A blog on Governance that one of the Acryl Engineers wrote Data Governance, but Make It a Team Sport

Ingestions Automations

Out-of-Box: Datahub's DBT connector is mature and automatically picks up these meta mappings out of the box from the yaml files. We should utilize a linter to mandate the meta information we need from DBT.
Custom Automations: Another great tool is utilizing transformers, a great use case for transformers could be marking a dataset with a status that denotes if the dataset is active or is being deprecated. Any additional meta information we need to add to

Meta Data Automations

Datahub Actions Framework this is for automations on meta data post ingestion. Let's say we want notifications when a dataset's changes a column or the status or owner
We can develop new custom Actions via Python
This set of automations is primarily driven by our business processes and the opportunities we see to optimize it.

Appendix C - Github Repository structure for centralized EKS Cluster

AdAction/central-eks: Used to version control CDK code that provisions the cluster VPC and RAM sharing, its subnets, the EKS cluster and its worker nodes (EC2 resources) in the cluster account.
AdAction/datahub: Used to version control CDK code that provisions datahub on the EKS cluster and the managed services needed to run EKS such as RDS.
AdAction/: A custom application that runs on the cluster as a Pod should go into it's own github repository.

Appendix D - Centralized vs Decentralized Clusters and Network Configuration options

Centralized vs Decentralized

Centralized
- The primary benefit here is cost and simplicity, This shared cluster account would utilized by the data team as well as other workloads that may be needed for Cosmic or AdGem.
- In this approach, we would create a new Cluster Account which will contain the cluster VPC, its subnets, the EKS cluster and its worker nodes (EC2 resources).
- The resources necessary for running a specific workload (for instance RDS and ElasticSearch for Datahub) would be deployed in the respective team / products workload accounts.
- Source: Centralized EKS Cluster, Multi Account Strategy, EKS Best Practices Guides
De-Centralized
- In this approach we'd be deploying eks clusters to respective workload accounts. In our context, this would mean that we deploy the eks clusters to the data-engineering-prod workload account.
Proposal: Spinning up clusters in different workload accounts would increase costs. Recommending a centralized cluster approach.

RAM vs Peering Connections

Resource Access Manager
- RAM enables sharing VPC subnets from Cluster account to workload accounts. This implies the resources needed for a workload can be deployed in the same vpc as that of the EKS Cluster.
Peering Connections
- This approach would be similar to how we manage peering connections for the warehouse ie the Cluster VPC is the peer requestor and the workload VPCs accept requests.
Proposal: Utilize RAM, since that seems the simpler, cost effective and recommended option. Peering connections would need to set up for each VPC w/ the cluster VPC in a workload account.