0009: Postback Processor Phase 2 Architecture

STATUS

Accepted

CONTEXT

Currently, when MMPs have a conversion that triggers a postback, they are making the postback directly to Tune. This has many shortcomings, causing data loss. Specific shortcomings of Tune are that it has no capability to process fraudulent postbacks and it will ignore any postback that comes in for paused campaigns.

To address these shortcomings, we would like to directly capture these postbacks ourselves to have a more complete picture of the MMP activities. This cleaner and more comprehensive feed of data will be used for: more sophisticated fraud detection, elimination of breakage, and improved cost reporting and invoicing.

Beyond collecting this data, Tune will still need to receive postbacks for upstream processing by our systems and publishers. As a result, the new system must make postback calls to Tune.

Considered Options

1. An asynchronous queue based system for receiving and forwarding postbacks. The exposed API would be API Gateway, and the backing queue system would be:
a. Kinesis -> high throughput, no guaranteed delivery.
b. SQS -> very high throughput, guaranteed at least once delivery. - DECISION
c. SQS Fifo -> lower throughput, guaranteed exactly once delivery.
1. Synchronous API Gateway Lambda to store and forward postbacks.

DECISION

The postback processor will be comprised of API Gateway backed by an SQS queue to receive MMP postbacks. The queued postback events will be handled by lambdas whose responsibility will be to both send postbacks to Tune, and send the data to Data Firehose for storage in Redshift.

Sequence

sequenceDiagram MMP->>API Gateway: Postback sent API Gateway-->>SQS: Event Queued SQS--)Lambda handler: Event Consumed Lambda handler->>Tune: Postback sent Lambda handler->>Firehose: Data sent AWS Airflow-->>Firehose: Consumes AWS Airflow->>Redshift: Saves

Architecture

The full C4 System Context and Container Context diagrams for phase 1, 2, and 3 are available in this miro doc. Please see the frame titled "Phase 2 Container Context Option C (High throughput, at-least once delivery, order not guranteed)" for this particular solution. Provided as an image here for convenience:

DATA Load

This system uses Redshift's COPY command to load the firehose data into Redshift in an Airflow DAG. This setup is very similar to the one detailed in Analytical Events Architecture.

RAW schema

The preliminary schema is very basic relying on a semi-structured field to capture all of the request parameters. Request parameters are not standardized across the different MMPs, so we haven't attempted to extract and normalize the parameters yet.

erDiagram EVENT { event varchar(255) data SUPER event_time TIMESTAMP }

Other options

A synchronous implementation was ruled out due to high risk of data loss. With a queue (and connected dead-letter queue) if any system or code issue arises we can replay messages.

Kinesis was ruled out due to its lack of delivery guarantee. Rather it provides at most once assurances. Additionally it does not provide a dead letter queue.

Simple Queue Service does provide a dead letter queue, and guaranteed delivery. SQS FIFO was appealing for its exactly once delivery and guaranteed ordering. However, it is more expensive and has a much lower throughput compared to basic SQS (300 messages/sec vs unlimited). Additionally, Tune can handle duplicate postbacks, making exactly once delivery more of a luxury than a must have.

Alternate queue systems were dismissed in preference for AWS managed infrastructure.

We briefly explored a High Availability solution for API gateway and SQS. This would use region redundancy and DNS failover for API Gateway, and distributed queues for SQS. Having no HA is deemed an acceptable risk for the following reasons:

 * Most (if not all) MMPs will attempt several retries over a
   period of time if their postbacks fail.
 * The data can still be accessed via MMP reports or APIs.
 * The SLA of the services is quite high (99.9% for SQS and 99.95%
   for API Gateway).
 * The potential revenue loss for downtimes connected to those SLA
   does not justify the added expense for a HA solution.

CONSEQUENCES

Positive outcomes
Robust fault tolerant solution
Comprehensive dataset for analysis
Negative outcomes
No high availability: If SQS or API Gateway have outages the system will be unavailable.
Duplicates are possible in firehose due to more than once SQS delivery. While not ideal, this can be mitigated during dbt analysis model generation.

Risks

An extended AWS outage (in either SQS or API Gateway) could lead to lost postbacks and thus lost data and revenue.

0009: Postback Processor Phase 2 Architecture

STATUS

CONTEXT

Considered Options

DECISION

Sequence

Architecture

DATA Load

RAW schema

Other options

CONSEQUENCES

Risks

NOTES

References

Original Author

Approval date

Approved by

Appendix