Storage Engine

In Kafka when a topic is created, each partition is assigned to a broker acting as leader with zero or more following brokers. Message durability is achieved by replicating data to the followers. In durable systems, when a client produces a message, an acknowledgment is received only after it has been replicated to the followers. If the leader or one of the followers fail, the affected topic partitions need to be replicated again to a replacement broker to ensure that messages are not lost.

To ensure even distribution of load when adding more brokers requires reassignment of topic partitions to the new brokers. Similarly, when removing brokers, affected topic partitions need to be reassigned to the remaining brokers. These processes usually require planning to ensure data is not lost in the process and the additional replication load doesn't impact operational performance.

A topic partition can become "hot" for a variety of reasons, 80% of traffic might only use 20% of the partitions. This might be due to an insufficient number of partitions during design - what is the "right" number of partitions for tomorrow, or in 6 months or in 5 years? Or just plain bad luck with the partitioning algorithm - our 2 biggest customers hash to the same partition. With the leader/follower model, the leader can become a bottleneck. While this can be can be mitigated by fetching from followers, the load is not evenly distributed over the whole cluster.

It is also common for the underlying storage to be replicated already (e.g., RAID or other NAS), and for Kafka to be "duplicating" that replication. For example, in many virtualized environments where each broker is actually using shared replicated storage. In these cases broker replication is used to maintain availability when a broker is unavailable.

In highly available and durable Cloud environments it is typical for brokers to be distributed across Availability Zones, with each partition replicated from one zone to at least one other. Inter-zone network charges for replication can be prohibitive together with the additional storage charges for the replicated data.

In Tansu storage is separated from the broker and is assumed to be durable:

S3 is designed to exceed 99.999999999% (11 nines)
PostgreSQL with continuous archiving streaming transaction logs files to an archive
The memory storage engine is designed for ephemeral non-production environments

In Tansu, brokers are stateless without any replicated storage to manage. A cluster can be scaled up or down without any operation planning. Ensuring that data is durable, responding to hardware failures or scaling storage to demand is the responsibility of the storage provider. This results in simpler brokers, and a clear separation of responsibilities.