0044: Data Retention Policy for Personal and Indirect PII
{ Update [Title] but leave [Number] untouched as the auto-index-docs workflow will update it for you! }
STATUS
Accepted
CONTEXT
As a performance marketing firm processing user data for rewarded ads, we collect and process personal data and indirect PII (IP addresses, IDFA, GAID, device identifiers) throughout our attribution and billing pipeline. We are subject to multiple data protection regulations including GDPR, CCPA/CPRA, and other US state privacy laws, as well as COPPA considerations for age-gated content.
Currently, we have no formalized data retention or deletion policies in place. Data persists indefinitely in our systems (Redshift, S3 Firehose archives, production Postgres databases) without automated cleanup, creating compliance risks and unnecessary storage costs.
Business Requirements
Our data serves multiple operational purposes with varying retention needs:
- Fraud detection: Requires 90 days of detailed transaction data
- Attribution: Most campaigns use 90-day attribution windows; some extend to 180 days or lifetime
- Billing & dispute resolution: Requires 12 months for chargebacks, disputes, and audit purposes
- Performance analytics: Requires 1 year of transactional data; 2+ years of aggregated/anonymized data
Given the complexity of managing different retention windows for different campaign types, we will implement a uniform 12-month retention period for all PII-containing operational data, which satisfies all business requirements while simplifying implementation.
Technical Context
Our data architecture includes:
- S3 Firehose → Redshift: Primary analytics warehouse (5-15 min latency). Archives contain raw PII including IP addresses
- Production Postgres: Operational databases with active campaign data, user accounts, real-time operations
- Data synchronization: Postgres → Redshift via federated schemas and materialized views (5-30 min refresh intervals)
- Third-party constraints: TUNE retains transactional details for 123 days, then only summary data
We have existing tooling (Airflow/MWAA, dbt) that could be leveraged for automated deletion and archiving workflows.
Backup and Disaster Recovery Context
- RDS automated backups: 7-day retention for all production Postgres databases
- Redshift automated snapshots: 1-day retention (AWS default, configurable up to 35 days)
- Redshift manual snapshots: Currently retained indefinitely; will be updated to align with 12-month policy
Legal Context
- Publishers are typically data controllers; we act as processor (though controller-controller relationships are emerging)
- Standard contract language states data retained "only as long as needed for this agreement, disputes and billing"
- DPAs in place with publishers, advertisers, and TUNE
- COPPA compliance: We do not knowingly collect data from children under 13; publishers are responsible for age-gating
Considered Options
Option 1: Indefinite Retention (Status Quo)
- Retain all data indefinitely across all systems
- Pros: Maximum data availability for analysis and dispute resolution
- Cons: High compliance risk, unnecessary storage costs, violates GDPR data minimization principle
- Rejected: Not compliant with GDPR, CCPA, or contractual obligations
Option 2: Aggressive Deletion (90-Day Retention)
- Delete all PII after 90 days to align with fraud detection and most attribution windows
- Pros: Minimal compliance risk, lowest storage costs
- Cons: Cannot support 12-month billing disputes/chargebacks or long-term analytics; doesn't accommodate campaigns with longer attribution windows
- Rejected: Insufficient for business operations
Option 3: Uniform 12-Month Retention with Anonymization
- Retain detailed transactional data with PII for 12 months, then transition to anonymized/aggregated data for analytics
- Pros: Balances compliance and operational needs; simpler to implement than tiered retention; satisfies all business requirements including longest attribution windows and billing disputes
- Cons: Retains PII longer than strictly necessary for fraud detection (over-retention for some purposes)
- Selected: Best balance of compliance, business needs, and implementation simplicity
DECISION
We will implement a uniform 12-month retention policy with anonymization (Option 3) as follows:
Retention Periods by Purpose
- Primary Operational Data (Redshift, S3 Firehose):
- 12 months: Full detailed data including indirect PII for all operational purposes (fraud detection, attribution, billing disputes, chargebacks, and audit requirements)
-
After 12 months: Anonymize/aggregate or delete PII; retain anonymized summary data indefinitely for historical analytics
-
Production Postgres Databases:
- Active campaigns + 12 months: Retain full operational data
-
After campaign end + 12 months: Archive or anonymize historical data (subject to technical impact assessment)
-
Application Logs (Solarwinds/Papertrail):
-
Retain per vendor default policies (unlikely to contain PII)
-
Aggregated/Summary Data:
-
Retain indefinitely (no PII, used for company performance analytics)
-
Database Backups:
- RDS automated backups: 7 days (existing AWS configuration)
- Redshift automated snapshots: 1 day (existing AWS configuration, sufficient for disaster recovery)
- Redshift manual snapshots: Maximum 365 days to align with 12-month retention policy
Data Subject to Retention Policy
The following data types are considered indirect PII and subject to retention limits:
- IP addresses
- Device identifiers (IDFA, GAID, OAID, Android ID)
- Geographic data at city/postal code level
- User agent strings
- Any other identifiers that could reasonably identify an individual
Anonymization Requirements
After retention periods expire, data must be either:
- Deleted entirely, OR
- Anonymized by removing/hashing all indirect PII fields, OR
- Aggregated such that individual users cannot be identified
Anonymized/aggregated data may be retained indefinitely for business analytics.
Implementation Approach
- Use existing Airflow/MWAA and dbt infrastructure for automated workflows
- Implement deletion/anonymization as scheduled jobs in data warehouse
- Apply same retention policies to both Redshift and S3 Firehose archives
- Coordinate with platform engineering on database backup retention policies
- Assess operational impact before implementing Postgres archiving/deletion
CONSEQUENCES
Positive Outcomes
- Regulatory compliance: Aligns with GDPR data minimization principles, CCPA requirements, and contractual obligations
- Reduced legal risk: Clear, documented retention policy defensible to regulators and in disputes
- Implementation simplicity: Uniform 12-month retention period eliminates complexity of tracking different campaign-specific retention windows
- Cost savings: Reduced storage costs for historical data
- Customer trust: Demonstrates commitment to data privacy
- Operational clarity: Clear guidelines for data lifecycle management
Negative Outcomes
- Implementation complexity: Requires development of automated deletion/anonymization workflows
- Over-retention for some purposes: Uniform 12-month retention keeps data longer than strictly necessary for fraud detection (90 days), which technically violates strict data minimization, though justified by implementation simplicity and business needs
- Historical analysis limitations: Some detailed historical analysis may no longer be possible after data is anonymized
- Potential operational disruption: Deletion from Postgres databases may impact service operations if not carefully implemented
- Recovery limitations: Cannot recover deleted/anonymized data if needed for unforeseen legal proceedings beyond 12 months
Impact on Architecture
- New scheduled jobs: Airflow/MWAA DAGs for automated deletion/anonymization
- dbt transformations: Logic to anonymize PII fields in aging data
- Monitoring requirements: Alerts and dashboards to track retention policy compliance
- Documentation updates: Data dictionaries and schema documentation must identify PII fields
- Application changes: Will likely require Laravel scheduled tasks for Postgres archiving/deletion
Impact on Operations
- Dispute resolution: Must complete investigations within 12-month window
- Audit preparation: Must rely on summary data for audits beyond 12 months
- Analytics workflows: Analysts must understand data availability windows
- Database maintenance: Regular archiving/deletion jobs will impact database performance during execution
Risks
- Data deletion errors: Risk of accidentally deleting data still needed for active disputes
-
Mitigation: Implement hold/flag system for data involved in active disputes; maintain audit trail of deletions
-
Incomplete anonymization: Risk of re-identification through quasi-identifiers
-
Mitigation: Use industry-standard anonymization techniques; periodic privacy reviews
-
Operational impact on Postgres: Deletion/archiving may degrade service performance
-
Mitigation: Thorough testing in staging; implement during low-traffic windows; assess impact before full rollout
-
Compliance gaps during transition: Period between policy adoption and full implementation
-
Mitigation: Prioritize highest-risk data (longest retention) for initial cleanup; document interim state
-
Evolving regulations: Future regulations may require different retention periods
-
Mitigation: Design flexible system that can adapt retention periods; annual policy review
-
Third-party constraints: TUNE and other platforms may limit our ability to delete data
- Mitigation: Document constraints; ensure contracts allow for data deletion requests
NOTES
Implementation Phases
Phase 1: Policy Documentation & Planning (Months 1-2) - Finalize retention periods with legal/compliance review - Document PII fields across all systems - Design anonymization/deletion workflows - Assess Postgres archiving impact with platform engineering
Phase 2: Infrastructure Setup (Months 3-4) - Implement Airflow/MWAA jobs for Redshift and S3 - Develop dbt anonymization transformations - Create monitoring and alerting - Implement dispute hold/flag system
Phase 3: Initial Cleanup (Month 5) - Backfill deletion of data beyond 12-month retention - Verify data integrity post-deletion - Address any operational issues
Phase 4: Postgres Implementation (Months 6-7) - Implement Laravel scheduled tasks for archiving/deletion (likely approach) - Test operational impact in staging - Gradual rollout to production
Phase 5: Ongoing Operations - Scheduled jobs run automatically - Quarterly compliance audits - Annual policy review
Open Questions
- Specific retention requirements for CPA platform data (Rakuten, etc.)
- Legal review of anonymization techniques to ensure adequate de-identification
- Coordination with publishers/advertisers on updated DPA terms
References
- Data Inventory & Purposes Questionnaire
- GDPR Article 5(1)(e) - Data Minimization Principle
- CCPA/CPRA Section 1798.100 - Consumer Rights
- [Internal] TUNE Data Processing Agreement
- PR #83: docs: add data retention policy ADR for PII management
- PR #127: docs: backfill PR reference links for existing ADRs
Original Author
Ron White (@ronco)
Approval date
[To be determined]
Approved by
[To be determined - likely Legal, Engineering Leadership, Compliance Officer]
Appendix
A: Data Inventory Summary
At ad impression and click: - event, adgem_uid, app_id, campaign_id, carrier_name, city, connection_type, country, device identifiers (gaid, idfa), ip, isp, lat/lon, os_version, platform, player_id, postal_code, publisher_id, useragent, and related metadata
At install and post-install events: - activity_kind fields including device identifiers (adid, android_id, idfa, oaid), ip_address, user_agent, device information, geographic data, and conversion events
B: System Architecture Overview
Standard Data Flow: MMP → Postback Processor → Tune → Kinesis → Firehose → S3 → Redshift
CPA Campaign Flow: Cron job queries CPA platforms (e.g., Rakuten) → posts to Tune → follows standard flow
Database Synchronization: Production Postgres → Redshift (via federated schemas + materialized views, 5-30 min refresh)
C: Compliance Requirements Matrix
| Regulation | Key Requirements | Retention Impact |
|---|---|---|
| GDPR | Data minimization, purpose limitation | Must justify retention beyond operational need |
| CCPA/CPRA | Right to deletion, disclosure requirements | Must implement deletion within 45 days of request |
| State Privacy Laws | Similar to CCPA across multiple states | GDPR compliance generally sufficient |
| COPPA | No data collection from children under 13 | Publishers handle age-gating; our SDK complies |
| Contractual | "Only as long as needed for agreement, disputes, billing" | 12-month outer limit for most contractual purposes |