Change Data Capture Explained: How CDC Streams Database Changes in Real-Time

Change Data Capture (CDC) explained: real-time database replication, log-based streaming, push vs. pull methods, and Kafka CDC implementation patterns.

Stéphane Derosiaux · March 9, 2023

Change Data Capture Explained: How CDC Streams Database Changes in Real-Time

Your analytics dashboard shows yesterday's data. Your data warehouse is 6 hours behind. Your downstream systems make decisions on stale information.

Change Data Capture (CDC) solves this. It detects every INSERT, UPDATE, and DELETE in your source database and streams those changes to targets in real-time.

How CDC Works: Transaction Log Monitoring

CDC monitors the transaction logs of your source database. Every database maintains these logs for crash recovery, recording every modification before it hits disk. CDC software reads these logs and extracts the changes.

This log-based approach is non-invasive. No triggers on tables. No polling queries hitting production. The database already writes these logs, CDC just reads them.

Push vs. Pull: Two CDC Architectures

Pull-based CDC: The target system periodically reads the source's transaction logs to find changes. Simple to implement but introduces latency. The target must poll continuously, and changes only propagate at poll intervals.

Push-based CDC: The source database pushes changes to targets as they happen. Lower latency but requires infrastructure to receive and buffer changes. If the target goes offline, you lose changes unless you implement a queue.

The push model with a durable queue (like Kafka) is the standard for production systems. Changes stream continuously, targets can catch up if they fall behind, and you get exactly-once delivery semantics.

Why CDC Beats Traditional Replication

Traditional replication copies entire tables. CDC copies only what changed.

Selective replication: Replicate specific tables or columns, not the whole database. Copy only customer orders to the analytics warehouse while keeping PII in the source.

Multi-source aggregation: Stream changes from MySQL, PostgreSQL, and MongoDB into a single data warehouse. Each source's CDC feeds into Kafka, and consumers build unified views.

Audit trails: Every change is captured with timestamp and operation type. Compliance teams can reconstruct who changed what and when.

Four CDC Mechanisms

Row versioning: Each row has a version number that increments on every change. Version 35 means 35 modifications since creation. Simple but requires tracking the last-seen version per row.

Update timestamps: A timestamp column overwrites on every modification. Query for rows where timestamp > last_sync_time. Simpler than versioning but loses change history.

Publish/subscribe queues: Changes push to a message queue. The source and target are decoupled. The queue buffers spikes and enables horizontal scaling of consumers.

Database log scanners: Install a scanner that reads the write-ahead log (WAL) or redo log. Changes trigger immediate replication. This is what Debezium, Oracle GoldenGate, and SQL Server CDC use internally.

CDC Implementation Options

Native database features handle CDC for single-vendor environments. Oracle GoldenGate and SQL Server Change Data Capture are built into their respective databases.

For heterogeneous environments, Debezium is the standard. It's an open-source CDC platform built on Kafka Connect that supports MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, and others. Debezium reads each database's transaction log format and produces standardized change events to Kafka topics.

CDC Tradeoffs

Resource consumption: Reading transaction logs and streaming changes consumes CPU and network bandwidth on both source and target. High-volume OLTP systems may need dedicated CDC infrastructure.

Operational complexity: You're adding another distributed system to monitor. Schema changes in the source must propagate to CDC configurations. Failed CDC jobs mean data gaps.

Initial snapshot: CDC captures changes, not existing data. Most CDC tools support initial snapshots, but snapshotting a large table while capturing concurrent changes requires careful coordination.

CDC with Kafka

Kafka is the natural fit for CDC. Debezium produces change events to Kafka topics. From there, consumers can: populate Elasticsearch for search, feed a data warehouse for analytics, trigger microservice workflows, or maintain materialized views.

Managing Kafka-based CDC pipelines requires visibility into topics, consumer lag, and message contents. Conduktor provides a UI for managing CDC topics, monitoring replication lag, and debugging change events across your Kafka ecosystem.