Five Kafka Scaling Problems That Hit Enterprises Hardest

Schema registries won't save you. Real Kafka challenges at scale: zombie topics, unclear ownership, legacy integrations.

James WhiteJames White · April 15, 2025
Five Kafka Scaling Problems That Hit Enterprises Hardest

Apache Kafka is now core infrastructure for data-driven organizations. But integrating it into large, diverse environments creates complexity that catches teams off guard.

I recently met with architects, product managers, and tech leads from five Conduktor customers across finance, retail, and logistics. Despite their differences, the same patterns emerged: unexpected costs, data quality gaps, governance concerns, and friction between people and technology.

Schema Registries Don't Guarantee Data Quality

Many teams equate data quality with schema enforcement, but issues often live within the data itself. At one European postal service, inconsistent formats and data types cause friction between producers and consumers. Schema enforcement ensures type correctness, not semantic correctness.

Another team admitted they only notice data quality issues when they directly impact the business. Limited observability into the data itself means problems rarely surface through technical dashboards, which focus on system metrics rather than anomalies like distribution shifts or outliers.

These issues are organizational as much as technical. One attendee described "blurry borders" around ownership of data quality, schema governance, and retention. Software developers are nominally responsible for quality, but expectations lack consistency across the organization.

A major multinational retailer uses Kafka as a key component of their order system. Their data platform lead found that despite implementing schema registries in 2019, there was no validation of data within messages. Teams lacked the ability to monitor live streaming data in Kafka, a notoriously difficult problem for real-time observability.

This team only discovered issues when they surfaced in downstream applications and business KPIs. Without automated alerting, finding and fixing problems was slow. Mean Times to Detection and Resolution suffered, and the approach remained reactive.

Self-Service Data Access Takes Weeks, Not Minutes

Even when data is clean and reliable, teams struggle to find, understand, or use it, especially when operational and analytical layers are disconnected.

At one logistics service, requesting operational data to be persisted for data scientists and analysts takes several weeks. Analysts file tickets with the platform team, who set up Connectors and S3 buckets. Validating internal permissions to consume the data blocks full automation.

In contrast, another retailer built a one-click system to move Kafka data into their analytical estate. Data owners can land data into BigQuery through a fully automated process. Users browse and request data via catalogs, subject to owner approval.

This approach enabled nearly 700 connectors and synced nearly 2,000 topics. By making data landing opt-in and requiring owners to push only data needed for broader use, platform teams standardized the process while accelerating implementation. Clear ownership lines made persisted data easily available for consumers.

Flink is powerful, but attendees were uncertain how to adopt it within a Platform-as-a-Service operating model.

One platform team at a large retailer considered running Flink as a centralized service. Application teams would use the Flink service for stream processing but not manage infrastructure. They rejected this model due to ambiguous ownership, unclear incident response responsibilities, and difficulty aligning platform capabilities with application team needs. They now stick with Kafka Streams, keeping logic inside the application domain.

Another attendee felt Flink could work for specific use cases if wholly owned by her team. Because her team would write, deploy, and operate the stream processing code, they could mitigate concerns about poorly implemented logic or processing failures. Her focus, though, was rapidly persisting data for analysts rather than stream processing.

Zombie Topics and Unchecked Retention Drive Up Costs

As Kafka usage grows, so do inefficiencies: misused partitions, excessive retention, and unused topics and schemas generating unnecessary expenses.

Kafka topics are partitioned to support scaling and parallel processing, but too many partitions for low-volume topics drive up storage and compute costs. One company set a limit of 10 partitions per self-service user, requiring manual override for additional partitions.

Excessively long retention policies are another common issue. In one example, data was retained for almost a decade because no one reviewed the original legacy configuration. Storage costs ballooned.

Zombie topics and schemas see no traffic but continue to exist, most often in non-production or legacy environments. One team applied a seven-day cleanup policy in dev environments, automatically flagging unused assets for deletion. Combined with better forecasting and cost visualization tools, this encouraged better practices without blocking teams.

Legacy Systems Make Migration Impossible, So Kafka Becomes the Bridge

Kafka is now more than a streaming layer. It's becoming the backbone connecting legacy systems, modern microservices, and external partners. This evolution brings new obstacles.

Many organizations still run legacy systems like MQ, MFT (Managed File Transfer), and SOAP/XML. Hidden dependencies make removing these older services risky.

Teams use Kafka as their integration solution instead. One bank, constrained by regulations, cannot fully migrate to the cloud. They created a single team to handle both MQ-based infrastructure and on-premise Kafka. Another organization replaced MQ with Kafka but still uses file- and REST-based sharing with smaller partners not ready for Kafka.

Sharing Kafka data with external partners adds complications. Organizations need to expose Kafka data externally without compromising security or adding operational overhead. One company supports 30+ external integrations via REST Proxy but acknowledges this won't scale due to security and governance concerns. They're exploring ways to modernize for growing demand for real-time data powering personalized experiences and AI-driven insights.

Many organizations are adopting OpenAPI and AsyncAPI to standardize specifications across teams and reduce vendor lock-in. These specifications provide machine-readable contracts for how services expose and consume data, making discovery, integration, and governance more transparent.

None of these challenges are unique. They're common to any organization scaling Kafka. But they demonstrate that not every problem is technical. Some stem from governance, culture, and processes.

Enterprises that succeed will centralize guardrails while enabling developer autonomy. To learn how Conduktor helps teams get there, sign up for a demo.