Blog
Search
K
Comment on page
🚀

Introducing Artie Transfer

Jan 17, 2023
Author: Robin Tang
Being an avid data warehouse (DWH) user myself at various roles, I have always found limitations (and frustrations!) around the data latency between our DWH and the data that sits in the online transactional (OLTP) databases. Depending on where I worked, the lag spanned multiple hours to day(s) and larger companies tend to be on the slower end.

What are some use cases that are hindered by this replication lag?

Given that the DWH is a platform, there are a multitude of use cases that can be enabled or strengthened by having shorter data lag. I’ll go over some examples below:

Operations heavy companies

Ops heavy companies typically have requirements and business processes that are constantly changing. As a result, the engineering team typically struggles to keep up - thus solutions like Zapier, Typeform, Retool, Tinybird and other no-code solutions have become part of the standard toolkit for these types of companies.
These tools can stack on top of each other and reference the DWH data, of which the efficacy is determined by the data replication lag. That said, it's not necessary for every table to be blazing fast. That would be nice to have… but still, most companies typically do have a handful of critical tables where it would be necessary. As an example, a food delivery company may set up a Retool application as a custom Zendesk application such that it can pull the customer’s most recent orders and interactions.
In addition to the previous examples, companies operating in these industries would care to reduce the data lag in the following tables (what we call “critical tables”):
Industry
Critical Table(s)
Food delivery
orders
Real-estate
inventory, orders
Rideshare
rides
Support
tickets, emails

Lifecycle and Paid Marketing Use Cases

Within Lifecycle Marketing, it is common to purchase marketing automation tools like Iterable, Braze and Intercom. Each of these tools have their own version of what the user model and events should look like, such that marketers can create email templates like: Hi {{first_name}}!
Examples of additional user attributes to send:
  1. 1.
    Paid marketing: Once the customer requests a ride, we'd like to send as many customer traits as possible to destinations like Google and Facebook so they can optimize their algorithm and find more look-alikes.
  2. 2.
    Nurture campaigns: When the customer signs up on our website, we'd like to drop them in a drip campaign that effectively welcomes and onboards the customer. We'd like to reference dynamic fields like product interactions (Did they do more than just sign up? Did they view an article already? Have they requested a ride yet?) and other customer attributes.

So how do we send this data today?

Typically, teams will have different pipelines to build custom schemas to send to various destinations. With this type of setup, there are various drawbacks:
  1. 1.
    Changes are blocked by engineering.
  2. 2.
    User fields are almost never in one service. As a result, we’ll need to call other services (now we’ll need error handling and retries). Backfilling may also generate too much traffic and DDOS internal services not equipped to handle the load.
  3. 3.
    Requires maintenance and ongoing support. If a field is incorrectly added and not referencing an indexed field, it may cause the whole pipeline to slow down.
  4. 4.
    Data is usually not in the DWH. As a result, it’s hard to do segmentation and reporting.
This could be solved by building materialized views, especially if we could use dbt (a more expressive framework) to create these views. However, these solutions are not mature enough to handle most marketing use cases.

Why are dbt or materialized views not sufficient?

When creating a materialized view, we need to specify how often these views are regenerated by specifying a schedule (example: Snowflake tasks). To visualize the latency on this, we can draw out the following equation:
For example: we can have the materialized view run every 5 minutes, but it doesn't really matter if the referenced table is being updated every 6 hours. The resulting view is still 6 hours behind.
This is why we built Artie Transfer. Artie Transfer is able to alleviate the first part of the equation: MAX(Referenced Table Staleness) by removing the raw OLTP table latency. This way, for faster views, we can just increase the frequency of view generation.

How are companies solving the data lag problem today and why isn’t it a great fix?

The traditional process for updating DWH data looks something like this:
  • Taking a snapshot of the Postgres table (in CSV) via pg_dump
  • Parse the output and format the data for DWH consumption
  • Upload this data to DWH
And repeat... how often can you run this on a daily basis though?
For savvy companies that are experiencing this pain - they would typically purchase Fivetran for Databases or Fivetran Teleport. However, this is not always a viable option for companies for various reasons such as cost, missing features / integrations, and others.
Engineering teams can and have tried to build this out themselves, but the solution is tricky and hard to scale. Why?
  • Change data capture (CDC) does not support Data Definition Languages (DDL) such as adding or dropping columns
  • Companies typically use more than one database type and each type requires a different parser and potentially a new pipeline
  • Reliability is feature zero. Faster data replication is great, but only if the results are reliable. If an in-house process misses a CDC event or processes rows out of order, the resulting data is no longer eventually consistent and we end up with the wrong view
  • DWHs are meant to handle high QPS (queries per second) for COPY commands and this would require a workaround to handle data mutations (update and delete)

Enter Artie Transfer

Artie Transfer continuously streams OLTP data (via CDC) and replays them onto the specified DWH and shrinks the replication lag from hours/days down to seconds. We believe this will allow the whole ecosystem to generate better insights and unlock additional use cases that were previously unattainable due to this constraint.
In order to provide this promise, this is how Artie Transfer is configured under the hood —
The end-end flow is broken up into two sections:
#1 - Capture the CDC events
We will be running a connector to read the database logs and publish them onto Kafka. Our preference is to run Debezium whenever available.
  • There will be one topic per table so we can independently scale the downstream workloads depending on table throughput.
  • The Kafka message’s partition key will be the primary key for the table and one topic per table.
#2 - Artie Transfer
Artie Transfer then subscribes to the Kafka Topic and will start to build an in-memory database of what the topics look like (one Artie Transfer consumer can subscribe to one or more Kafka topics). Artie Transfer will then flush whenever the flush interval has elapsed (default is 10s) or when the in-memory database is full, whichever one comes first. When Artie Transfer flushes, it will make an optimize query to the downstream data warehouse and merge micro-batches of changed data.
To support this workload, Artie Transfer has the following features:
  1. 1.
    Automatic retries & idempotency. We take reliability seriously and it's feature 0. Latency reduction is nice, but doesn't matter if the data is wrong. We provide automatic retries and idempotency such that we will always achieve eventual consistency.
  2. 2.
    Automatic table creation. Transfer will create the table in the designated database if the table doesn't exist.
  3. 3.
    Error reporting. Provide your Sentry API key and errors from data processing will appear in your Sentry project.
  4. 4.
    Schema detection. Transfer will automatically detect column changes and apply them to the destination.
  5. 5.
    Scalable architecture. Transfer's architecture stays the same whether we’re dealing with 1GB or 100+ TB of data.
  6. 6.
    Sub-minute latency. Transfer is built with a consumer framework and is constantly streaming messages in the background. Say goodbye to schedulers!
In addition to this offering, we will also be providing a paid managed version that includes:
  1. 1.
    Setting up your database to enable CDC and provide connectors to read and publish the messages onto Kafka
  2. 2.
    Providing a managed version of Artie Transfer
Lastly, we believe that approaching zero replication lag between OLTP and OLAP databases should be the norm and that this should be widely accessible. As such, we have open-sourced Artie Transfer!
If you are interested in learning more: check out Artie Transfer’s Github repo, schedule a demo here or drop us a note at [email protected]! To stay updated on our progress – check out our LinkedIn and Twitter page!