Reading Time: 4 minutes

Kafka Log Compaction

This tutorial is to demonstrate how Log Compaction works in Apache Kafka. For this tutorial, I will use .NET and a local Docker image to run Apache Kafka. This example will demonstrate how to create distributable settings that could be stored in a Kafka topic. This is similar to how Kafka Connect saves its configuration for connectors (if I understood that correctly).

GitHub

GitHub: Kafka Log Compaction

Getting Started

There is a startup.ps1 script that will start a Docker image of Kafka. It will basically run a Docker Compose command with an environment variables file. You will need to enter values into the .env file.

From the root folder you can either run the startup.ps1 or run this command to start the Kafka Docker image.

docker-compose -f docker/docker-compose.yml --env-file ./.env up -d

Create a Kafka Topic with Log Compaction

We will execute terminal access to the Docker container in order to use the Kafka binary files and create a topic with log compaction enabled.

Note: Apache ZooKeeper is being deprecated and there is a flag --zookeeper that will be deprecated.
KIP-604: Remove ZooKeeper Flags from the Administrative Tools

Create Topic

Topic Configuration

I recommend reading Javier Holguera’s blog post “Kafka quirks: tombstones that refuse to disappear” it helped me a lot. These settings are not optimal for a topic that receives a lot of volume continuously. Log compaction configured like this will run more frequently affecting the performance of this topic.

Confluent: Topic Configuration

cleanup.policy=compact

This setting is rather straight forward. Setting the cleanup policy to “compact” will enable general log compaction. However, that’s not enough for optimal configuration.

delete.retention.ms=100

The delete retention setting is the amount of time to retain tombstone markers for log compacted topics.

segment.ms=100

The segment configuration value controls the period of time after which Kafka will force the log to roll causing log compaction.

min.cleanable.dirty.ratio=0.01

This setting affects how often the log compactor will attempt to clean the log.

.NET App

The .NET App will create and read distributed settings from the Kafka topic app.settings.

Consuming Data

On initial startup the application will read the processed settings values from the Kafka topic.

The var consumeResult = consumer.Consume(TimeSpan.FromSeconds(5)); passes a TimeSpan of 5 seconds, this causes the consumer to run and if there are no offsets to consume it moves on to the next task. This also returns null when that happens. This is common during the initial run.

I’m also setting the setting.LastProcessed = consumeResult.Message.Timestamp.UtcDateTime; so that in theory I could filter out and select the most recent setting based on the time stamp.

Producing Data

As the application runs it will update the past processed time and update that every 20 seconds. The expectation is that there will be a single record with a unique key.

The key things to point out here are the GetProducer() method takes <string, string> where the first type is the key and the second type is the value.

Verifying Compaction using Kafka Tool

We will use the Kafka Tool to inspect the data on the Kafka log to make sure there are no duplicates and that there is only one value that is being updated.

You can see that log compaction happens periodically and will reduce records based on the “Key”. Unfortunately, I could not get it to compact down to a single record. That was rather disappointing.

BUG: Tombstones can Survive Forever

Kafka is still young and has a lot of growing to do. It appears that there is a bug where tombstoned records can persist even after they have been marked for deletion.

It’s also important to know that tombstoned records will come back as NULLs.

Apache.org: Tombstones can survive forever

Further Reading

https://kafka.apache.org/documentation/
Kafka quirks: tombstones that refuse to disappear
Confluent: Topic Configuration