Skip to content

0044: Data Retention Policy for Personal and Indirect PII

{ Update [Title] but leave [Number] untouched as the auto-index-docs workflow will update it for you! }

STATUS

Accepted

CONTEXT

As a performance marketing firm processing user data for rewarded ads, we collect and process personal data and indirect PII (IP addresses, IDFA, GAID, device identifiers) throughout our attribution and billing pipeline. We are subject to multiple data protection regulations including GDPR, CCPA/CPRA, and other US state privacy laws, as well as COPPA considerations for age-gated content.

Currently, we have no formalized data retention or deletion policies in place. Data persists indefinitely in our systems (Redshift, S3 Firehose archives, production Postgres databases) without automated cleanup, creating compliance risks and unnecessary storage costs.

Business Requirements

Our data serves multiple operational purposes with varying retention needs:

  • Fraud detection: Requires 90 days of detailed transaction data
  • Attribution: Most campaigns use 90-day attribution windows; some extend to 180 days or lifetime
  • Billing & dispute resolution: Requires 12 months for chargebacks, disputes, and audit purposes
  • Performance analytics: Requires 1 year of transactional data; 2+ years of aggregated/anonymized data

Given the complexity of managing different retention windows for different campaign types, we will implement a uniform 12-month retention period for all PII-containing operational data, which satisfies all business requirements while simplifying implementation.

Technical Context

Our data architecture includes:

  • S3 Firehose → Redshift: Primary analytics warehouse (5-15 min latency). Archives contain raw PII including IP addresses
  • Production Postgres: Operational databases with active campaign data, user accounts, real-time operations
  • Data synchronization: Postgres → Redshift via federated schemas and materialized views (5-30 min refresh intervals)
  • Third-party constraints: TUNE retains transactional details for 123 days, then only summary data

We have existing tooling (Airflow/MWAA, dbt) that could be leveraged for automated deletion and archiving workflows.

Backup and Disaster Recovery Context

  • RDS automated backups: 7-day retention for all production Postgres databases
  • Redshift automated snapshots: 1-day retention (AWS default, configurable up to 35 days)
  • Redshift manual snapshots: Currently retained indefinitely; will be updated to align with 12-month policy
  • Publishers are typically data controllers; we act as processor (though controller-controller relationships are emerging)
  • Standard contract language states data retained "only as long as needed for this agreement, disputes and billing"
  • DPAs in place with publishers, advertisers, and TUNE
  • COPPA compliance: We do not knowingly collect data from children under 13; publishers are responsible for age-gating

Considered Options

Option 1: Indefinite Retention (Status Quo)

  • Retain all data indefinitely across all systems
  • Pros: Maximum data availability for analysis and dispute resolution
  • Cons: High compliance risk, unnecessary storage costs, violates GDPR data minimization principle
  • Rejected: Not compliant with GDPR, CCPA, or contractual obligations

Option 2: Aggressive Deletion (90-Day Retention)

  • Delete all PII after 90 days to align with fraud detection and most attribution windows
  • Pros: Minimal compliance risk, lowest storage costs
  • Cons: Cannot support 12-month billing disputes/chargebacks or long-term analytics; doesn't accommodate campaigns with longer attribution windows
  • Rejected: Insufficient for business operations

Option 3: Uniform 12-Month Retention with Anonymization

  • Retain detailed transactional data with PII for 12 months, then transition to anonymized/aggregated data for analytics
  • Pros: Balances compliance and operational needs; simpler to implement than tiered retention; satisfies all business requirements including longest attribution windows and billing disputes
  • Cons: Retains PII longer than strictly necessary for fraud detection (over-retention for some purposes)
  • Selected: Best balance of compliance, business needs, and implementation simplicity

DECISION

We will implement a uniform 12-month retention policy with anonymization (Option 3) as follows:

Retention Periods by Purpose

  1. Primary Operational Data (Redshift, S3 Firehose):
  2. 12 months: Full detailed data including indirect PII for all operational purposes (fraud detection, attribution, billing disputes, chargebacks, and audit requirements)
  3. After 12 months: Anonymize/aggregate or delete PII; retain anonymized summary data indefinitely for historical analytics

  4. Production Postgres Databases:

  5. Active campaigns + 12 months: Retain full operational data
  6. After campaign end + 12 months: Archive or anonymize historical data (subject to technical impact assessment)

  7. Application Logs (Solarwinds/Papertrail):

  8. Retain per vendor default policies (unlikely to contain PII)

  9. Aggregated/Summary Data:

  10. Retain indefinitely (no PII, used for company performance analytics)

  11. Database Backups:

  12. RDS automated backups: 7 days (existing AWS configuration)
  13. Redshift automated snapshots: 1 day (existing AWS configuration, sufficient for disaster recovery)
  14. Redshift manual snapshots: Maximum 365 days to align with 12-month retention policy

Data Subject to Retention Policy

The following data types are considered indirect PII and subject to retention limits:

  • IP addresses
  • Device identifiers (IDFA, GAID, OAID, Android ID)
  • Geographic data at city/postal code level
  • User agent strings
  • Any other identifiers that could reasonably identify an individual

Anonymization Requirements

After retention periods expire, data must be either:

  1. Deleted entirely, OR
  2. Anonymized by removing/hashing all indirect PII fields, OR
  3. Aggregated such that individual users cannot be identified

Anonymized/aggregated data may be retained indefinitely for business analytics.

Implementation Approach

  • Use existing Airflow/MWAA and dbt infrastructure for automated workflows
  • Implement deletion/anonymization as scheduled jobs in data warehouse
  • Apply same retention policies to both Redshift and S3 Firehose archives
  • Coordinate with platform engineering on database backup retention policies
  • Assess operational impact before implementing Postgres archiving/deletion

CONSEQUENCES

Positive Outcomes

  • Regulatory compliance: Aligns with GDPR data minimization principles, CCPA requirements, and contractual obligations
  • Reduced legal risk: Clear, documented retention policy defensible to regulators and in disputes
  • Implementation simplicity: Uniform 12-month retention period eliminates complexity of tracking different campaign-specific retention windows
  • Cost savings: Reduced storage costs for historical data
  • Customer trust: Demonstrates commitment to data privacy
  • Operational clarity: Clear guidelines for data lifecycle management

Negative Outcomes

  • Implementation complexity: Requires development of automated deletion/anonymization workflows
  • Over-retention for some purposes: Uniform 12-month retention keeps data longer than strictly necessary for fraud detection (90 days), which technically violates strict data minimization, though justified by implementation simplicity and business needs
  • Historical analysis limitations: Some detailed historical analysis may no longer be possible after data is anonymized
  • Potential operational disruption: Deletion from Postgres databases may impact service operations if not carefully implemented
  • Recovery limitations: Cannot recover deleted/anonymized data if needed for unforeseen legal proceedings beyond 12 months

Impact on Architecture

  • New scheduled jobs: Airflow/MWAA DAGs for automated deletion/anonymization
  • dbt transformations: Logic to anonymize PII fields in aging data
  • Monitoring requirements: Alerts and dashboards to track retention policy compliance
  • Documentation updates: Data dictionaries and schema documentation must identify PII fields
  • Application changes: Will likely require Laravel scheduled tasks for Postgres archiving/deletion

Impact on Operations

  • Dispute resolution: Must complete investigations within 12-month window
  • Audit preparation: Must rely on summary data for audits beyond 12 months
  • Analytics workflows: Analysts must understand data availability windows
  • Database maintenance: Regular archiving/deletion jobs will impact database performance during execution

Risks

  1. Data deletion errors: Risk of accidentally deleting data still needed for active disputes
  2. Mitigation: Implement hold/flag system for data involved in active disputes; maintain audit trail of deletions

  3. Incomplete anonymization: Risk of re-identification through quasi-identifiers

  4. Mitigation: Use industry-standard anonymization techniques; periodic privacy reviews

  5. Operational impact on Postgres: Deletion/archiving may degrade service performance

  6. Mitigation: Thorough testing in staging; implement during low-traffic windows; assess impact before full rollout

  7. Compliance gaps during transition: Period between policy adoption and full implementation

  8. Mitigation: Prioritize highest-risk data (longest retention) for initial cleanup; document interim state

  9. Evolving regulations: Future regulations may require different retention periods

  10. Mitigation: Design flexible system that can adapt retention periods; annual policy review

  11. Third-party constraints: TUNE and other platforms may limit our ability to delete data

  12. Mitigation: Document constraints; ensure contracts allow for data deletion requests

NOTES

Implementation Phases

Phase 1: Policy Documentation & Planning (Months 1-2) - Finalize retention periods with legal/compliance review - Document PII fields across all systems - Design anonymization/deletion workflows - Assess Postgres archiving impact with platform engineering

Phase 2: Infrastructure Setup (Months 3-4) - Implement Airflow/MWAA jobs for Redshift and S3 - Develop dbt anonymization transformations - Create monitoring and alerting - Implement dispute hold/flag system

Phase 3: Initial Cleanup (Month 5) - Backfill deletion of data beyond 12-month retention - Verify data integrity post-deletion - Address any operational issues

Phase 4: Postgres Implementation (Months 6-7) - Implement Laravel scheduled tasks for archiving/deletion (likely approach) - Test operational impact in staging - Gradual rollout to production

Phase 5: Ongoing Operations - Scheduled jobs run automatically - Quarterly compliance audits - Annual policy review

Open Questions

  1. Specific retention requirements for CPA platform data (Rakuten, etc.)
  2. Legal review of anonymization techniques to ensure adequate de-identification
  3. Coordination with publishers/advertisers on updated DPA terms

References

Original Author

Ron White (@ronco)

Approval date

[To be determined]

Approved by

[To be determined - likely Legal, Engineering Leadership, Compliance Officer]

Appendix

A: Data Inventory Summary

At ad impression and click: - event, adgem_uid, app_id, campaign_id, carrier_name, city, connection_type, country, device identifiers (gaid, idfa), ip, isp, lat/lon, os_version, platform, player_id, postal_code, publisher_id, useragent, and related metadata

At install and post-install events: - activity_kind fields including device identifiers (adid, android_id, idfa, oaid), ip_address, user_agent, device information, geographic data, and conversion events

B: System Architecture Overview

Standard Data Flow: MMP → Postback Processor → Tune → Kinesis → Firehose → S3 → Redshift

CPA Campaign Flow: Cron job queries CPA platforms (e.g., Rakuten) → posts to Tune → follows standard flow

Database Synchronization: Production Postgres → Redshift (via federated schemas + materialized views, 5-30 min refresh)

C: Compliance Requirements Matrix

Regulation Key Requirements Retention Impact
GDPR Data minimization, purpose limitation Must justify retention beyond operational need
CCPA/CPRA Right to deletion, disclosure requirements Must implement deletion within 45 days of request
State Privacy Laws Similar to CCPA across multiple states GDPR compliance generally sufficient
COPPA No data collection from children under 13 Publishers handle age-gating; our SDK complies
Contractual "Only as long as needed for agreement, disputes, billing" 12-month outer limit for most contractual purposes