For the last 15 years, Kafka has been one of the most important pieces of infrastructure at LinkedIn. In fact, Kafka was originally created at LinkedIn to solve a problem that almost every large-scale distributed system faces: how do you reliably move massive amounts of data between services?

At LinkedIn, thousands of services constantly generate data. Some services publish data, while others consume it. These consumers don't just need the latest updates, they often need access to the entire history of data.

Why? Because software isn't perfect.

A consumer may contain a bug. After fixing that bug, engineers often need to replay historical data to verify that the fix works correctly. Without access to historical events, recovering from these situations becomes extremely difficult.

Kafka solved this problem by acting as a centralized log system that stored data in a durable, replayable, and fault-tolerant manner. Over time, it became the backbone of LinkedIn's infrastructure, supporting everything from user activity tracking and metrics collection to stream processing, AI workloads, data lake integrations, application messaging, and even database replication.

But eventually, LinkedIn reached a point where Kafka itself became difficult to scale.

And that is what led to Northguard.

In case you have no clue what Kafka is, please checkout our blog: What is Kafka?

When Success Becomes the Problem

Back in 2010, LinkedIn had around 90 million members. Today, the platform serves more than 1.2 billion members. That growth dramatically increased the amount of data flowing through Kafka.

LinkedIn's Kafka infrastructure eventually grew to:

More than 32 trillion records per day
Around 17 petabytes of data daily
400,000 topics
More than 10,000 machines
Over 150 clusters

At this scale, several challenges started becoming increasingly difficult to manage.

Scalability Challenges: Adding new use cases meant more than simply storing additional data. Every new workload introduced more traffic, more metadata, more brokers and more clusters. Over time, metadata management itself started becoming a bottleneck.
Operational Challenges: Running more than a hundred Kafka clusters isn't simple. Keeping load balanced across those clusters required additional operational systems and tooling. The infrastructure required to manage Kafka was becoming increasingly complex.
Availability and Consistency Trade-offs: Kafka uses partitions as its unit of replication. At LinkedIn's scale, this introduced trade-offs. When failures occurred, maintaining availability often meant sacrificing consistency because partitions were relatively heavyweight replication units.
Durability Limitations: For some of LinkedIn's most critical applications, Kafka's durability guarantees were no longer strong enough. The company needed stronger guarantees without sacrificing performance.

Defining the Requirements for a New System

LinkedIn wasn't looking for a small improvement.

They needed a system that could:

Scale data storage
Scale metadata management
Scale cluster size
Provide strong consistency
Maintain high availability
Deliver low latency
Support high throughput
Reduce operational overhead
Remain cost efficient
Work across different hardware environments

The result was Northguard.

Introducing Northguard

Northguard is LinkedIn's next-generation log storage system designed specifically around two goals: Scalability and Operability. To achieve this, Northguard introduces several fundamental architectural changes.

Instead of centralizing state, Northguard:

Shards data
Shards metadata
Minimizes global state
Uses decentralized membership management
Balances load automatically

At its core, Northguard runs as a cluster of brokers that communicate with clients and other brokers. To understand how it works, we need to look at its data model.

The Building Blocks of Northguard

Northguard organizes data into four main layers: Record → Segment → Range → Topic

Records (The Smallest Unit)

A record is the most granular piece of data in Northguard.

Each record contains:

A key
A value
User-defined headers

Everything is stored as bytes.

Segments (The Unit of Replication)

Multiple records are grouped into a segment. A segment is extremely important because it becomes Northguard's unit of replication.

Segments can be Active (accepting writes) and Sealed (immutable).

A segment is sealed when:

It reaches 1 GB
It remains active for over an hour
A replica failure occurs

Once sealed, it can no longer be modified.

Ranges (Northguard's Log Abstraction)

A range is a collection of segments that represents a portion of the keyspace. Instead of treating an entire log as one large object, Northguard divides logs into ranges.

Ranges can also be:

Active
Sealed

And they can evolve over time through splitting and merging.

Topics (Collections of Ranges)

A topic consists of multiple ranges that collectively cover the entire keyspace. As traffic grows, ranges can split. As traffic decreases, ranges can merge.

This allows topics to evolve dynamically instead of remaining fixed forever.

The Idea That Makes Northguard Different: Log Striping

One of the most important innovations in Northguard is something LinkedIn calls log striping. To understand why this matters, consider how replication works in traditional systems.

When large logs are replicated as single units:

Some brokers end up carrying significantly more load than others.
New brokers often remain underutilized.
Rebalancing requires moving large amounts of existing data.

This creates operational pain. Northguard solves this by breaking logs into smaller pieces.

Instead of replicating an entire log:

Individual segments have their own replica sets.
New segments can naturally be assigned to new brokers.
Load distributes itself over time.

As new segments are continuously created, the cluster automatically balances without needing large-scale data movement. This significantly improves operability.

Why LinkedIn Chose Ranges Instead of Partitions

At first glance, ranges may seem similar to Kafka partitions. But LinkedIn had specific reasons for choosing ranges.

They wanted:

Correct data placement
Minimal disruption during scaling
Ordering guarantees
Better support for stream processing

When a range splits, only producers writing to that specific range are affected. The rest of the system continues operating normally. This avoids the "stop-the-world" coordination that alternative approaches might require.

Ranges also provide useful ordering guarantees and naturally align across topics, making stream-processing joins much more efficient. In many cases, expensive shuffle operations can be avoided entirely.

Scaling Metadata Without Central Bottlenecks

Scaling data is only half the problem. Metadata can become a bottleneck too. Northguard addresses this through a distributed metadata architecture built around vnodes. Each vnode stores a shard of metadata and is backed by the Raft consensus protocol.

A vnode leader, called a coordinator, manages metadata for:

Topics
Ranges
Segments

The system uses a Dynamically-Sharded Replicated State Machine (DS-RSM) that distributes metadata using consistent hashing. This prevents hotspots and allows metadata to scale horizontally.

Decentralized Cluster Membership

Northguard also eliminates centralized membership management. Instead, it uses the SWIM protocol.

SWIM uses gossip-based communication to:

Detect failures
Propagate membership changes
Share cluster information

This allows cluster state to scale much more efficiently than centralized approaches.

Storage, Replication, and Testing

Northguard includes several additional improvements:

Write-ahead logging (WAL)
Direct I/O
RocksDB-based indexing
Streaming protocols for produce and consume operations
Self-healing replication

But perhaps the most interesting aspect is testing. LinkedIn runs Northguard under deterministic simulation. They simulate years of cluster activity every day while injecting failures such as:

Broker crashes
Network partitions
Disk corruption
Packet loss
Configuration deployments

This allows engineers to reproduce and debug failures long before they occur in production.

The Migration Problem

Building Northguard was only half the challenge. The harder problem was migration.

LinkedIn had:

Hundreds of clusters
Hundreds of thousands of topics
Thousands of applications

Downtime was not acceptable. And asking every application team to migrate manually was unrealistic. This led to the creation of Xinfra.

Introducing Xinfra

Xinfra is a virtualization layer that sits above both Kafka and Northguard. Its goal is simple: Provide a unified Pub/Sub experience regardless of the underlying infrastructure.

With Xinfra:

Topics are virtualized.
Topics are no longer tied to a single physical cluster.
Applications interact with a consistent API.

This abstraction allows LinkedIn to migrate topics between Kafka and Northguard without requiring application changes.

How Migration Works

Xinfra supports migration through topic epochs. A topic can have: One epoch in Kafka and Another epoch in Northguard.

During migration:

A new epoch is created in Northguard.
Producers begin dual-writing.
Consumers gradually migrate.
Rollback remains possible.
Dual writes are eventually disabled.

The entire process remains transparent to users. Applications continue operating throughout the migration.

Where Things Stand Today

Today, more than 90% of LinkedIn applications use Xinfra clients. LinkedIn has already migrated thousands of topics from Kafka to Northguard, representing trillions of records per day.

The company plans to continue expanding adoption while adding capabilities such as:

Automatic topic scaling
Improved fault tolerance
Enhanced virtualization features

Takeaway

Kafka served LinkedIn well for years, but its scale eventually exposed challenges around metadata management, availability, durability, and operations.
Northguard solves these problems through log striping, distributed metadata, self-healing replication, and decentralized cluster management, making the system easier to scale and operate.
Xinfra adds a virtualization layer that allows applications to work across both Kafka and Northguard, enabling seamless migrations without downtime.
The biggest lesson is that at massive scale, it's not just data that needs to scale metadata, operations, and migration strategies must scale too.

Official blog from LinkedIn: Introducing Northguard and Xinfra: scalable log storage at LinkedIn

By now, you must have had a clear idea of, How LinkedIn Built Northguard and Xinfra to Move Beyond Kafka? In a nutshell, As Kafka reached its limits at LinkedIn's massive scale, the company built Northguard to provide scalable, self-balancing, highly durable log storage and Xinfra to enable seamless migration and Pub/Sub virtualization. Together, they solve challenges around scalability, metadata management, availability, consistency, and operational complexity while supporting trillions of records daily.

Congratulations! You've just advanced another step in your tech journey. Keep progressing!