Few practical Apache Kafka Tips
Apache Kafka is conceptually simple and elegant messaging system. But like most things in life, the devil lies in the details. There are quite a few things that might not be obvious but will bite you unless you are aware of them. Here is a list of things I have found useful while developing with Kafka and running a moderately sized cluster in production.
null keys
Often this is a bug in the Producer but Kafka will happily chug along distributing records on a round robin basis. If your Consumers are expecting ordered keys, you will likely run into data corruption. Make sure your Consumers are written to handle missing keys by at least generating an Exception.
Consumers liveliness
If your consumer starts to crash on a message (most likely a bug in application code), you will get stuck in a poll loop. Your consumer will pick next message and crash before committing. This will continue indefinitely. Again, make sure consumers are robust by handling unexpected errors with a top level try/catch/finally block.
Adding partitions dynamically(1)
This is tricky. Adding a partition dynamically on a running cluster might break (one key,one partition) guarantee. This happens because partition count is used in computing key hash and while a new partition is being added, the hash might be inconsistent. Make sure to ‘freeze’ your publishers and drain your consumers while this is being done.
Adding partitions dynamically(2)
Just because you added a new partition doesn’t mean old data will automagically be redistributed to new partition. Only new records go to new partition. This means same key might be present in two partitions after new partitions are added. Again, your consumers will need to deal with this scenario.
Hot ‘keys’
If one of your keys is running hot, it’ll start to saturate corresponding consumer. This might ‘starve’ other keys since a consumer can serve multiple partitions/keys. One strategy is to use custom partitioner to send this key to a specific partition and run a beefy consumer against it.
Auto Commit
- Auto commit happens at poll() and close() only. If you exit poll loop and don’t call consumer.close(), there won’t be any commits. This has obvious implications of duplicate messages and increasing consumer lag.
- If you are strictly following (read,process,poll) pattern in one thread, autocommit is ok. Remember autocommit doesn’t happen in background but in the same thread calling ‘poll’ so as long as you are following this serial pattern of (read,process,poll), you are fine with autocommit (see next point though)
- You’d still need to handle duplicate messages. Let’s assume the consumer crashes after Reading and Processing a Record. Since it’s not committed, the message will be redelivered to the consumer on next Poll. In general, with Kafka, you should work to make all your consumers idempotent (i.e. handle duplicates gracefully)
- I prefer manual commits since it is more explicit. Even with manual commits, there are possibilities of duplicate messages. On the other had committing too early might mean dropped messages. In Financial applications this is never acceptable. The only choice in such cases is to design your application to be idempotent and tolerate duplicate messages.
Heartbeats
Hearbeats are sent in a background thread in the consumer. You might be stuck elsewhere and still sending heartbeats. You’ll be timed out as per session timeout. Don’t assume a consumer or producer is alive simply because it’s sending hearbeats! Actually monitor each process individually for liveliness.
Assignments
We used to be able to do a consumer.poll(0) to retrieve assignments. This doesn’t work with newer Kafka. Instead, use a Rebalance Listener to keep an up-to-date copy of assignments.
Commit(A)Sync
I prefer to use commitSync as much as possible. Unless you’re hard-pressed for each nanosecond, this is better than added complexity of Async commits. Remember commitAsync does not retry failures so you are on the hook for handling failures asynchronously making life even more difficult.
Concurrent Consumers
Kafka wants you to scale horizontally by adding more partitions and more consumer processes. If you want to process multiple records in same process concurrently, understand that Kafka Consumer is not thread safe. Follow these guidelines for Multithreaded Kafka consumption :
- Disable auto commit.
- Group your records (output of poll) by partition.
- Submit each partition group as one task to your threadpool, so these are processed by one thread serially
- Pause corresponding partition by calling consumer.pause(). This means next poll will not bring back records from these partitions
- Once a partition group is finished, explicitly commit offsets for that partition then call consumer.resume()
- Make sure to provide rebalance listeners and ‘drain’ your threadpool before proceeding with reassignments
- don’t forget to drain the pool and ‘close’ the consumer if you are existing the poll loop
Strong consistency
Strong Consistency is possible as long as your transaction spans only Kafka Topics. You can’t have transactionally safe writes across Topics and your database (or file system or any non-kafka resource).
So, it’s quite possible to commit a processed message to an output topic and not have the message persist in your database, leading to inconsistencies between the two. There is simply no way around it and there are no simple and magical solutions. Your only recourse is to design your application to be tolerant to such occurrences.
There really are only two possibilities. Depending upon the order of offset commits, you would either lose messages (i.e. not get these into your Database) or get duplicate messages. There is no ‘recovering’ from a message loss — you can’t recover from what you can’t see. The only realistic possibility is to accept duplicate messages and design your application to handle these.
You can follow these guidelines to detect duplicates:
- For RDBMSes, create an unique index on a ‘key’ and let duplicates fail on unique indexes upon insertion
- If that becomes too slow, cache keys in memory. Look into building a consistent in-memory cache of keys using Debezium. If cache size becomes a bottleneck, you might need to build out Bloom Filters and feed it via Debezium.
Monitor Continuously
- Don’t underestimate the utility of having a simple test of a synthetic message making it from a publisher to a consumer.
- Under replication of partition is a minimum metric to watch for. This is easy to lose track of since System keeps running even with under replicated partitions. This has to be proactively monitored. Much like monitoring CPU and Disk reaching capacity. This also might mean a broker is down.
- At a minimum, monitor for all Broker processes running. Again this might go undetected unless you have alerts configured.
- Watch out for unclean Leader Election. This happens when replicated partition is not up to date with the leader and it’s being turned into a leader. It’s being turned into a leader because this is the best Kafka
has at this instant. When this happens, you will lose data. It’s best to disable unclean leader election. - Watch for Split Brain controllers. Due to Network partitioning two nodes are might get the role of controller. A Kafka cluster is designed to have only one controller. More than on controller will lead to all kinds of unexpected behavior. The only way to deal with this situation is to restart both controllers (or simply delete the controller node in zookeeper and let it elect a new controller)
- As always watch for Disks getting full. Overworked CPU probably means brokers are getting overloaded due to unbalanced load.
- Consumer Lags. An increasing consumer lag could mean your consumers are either overloaded or simply stuck. Since adding a new partition is not trivial, you’d be forced to scale your consumers vertically by providing more threads and/or throttling your publishers if possible.
- Audit the messages by creating simple counts over a time Window. How many messages were sent between times t0 and t1. Compare that with number of messages consumed in the same time window. This can be done by implementing your own Producer and Consumer interceptors.
Kafka is conceptually simple but like any distributed computing system, it is quite complex in it’s implementation. Unfortunately, many of the implementation complexities have been pushed to application code since it’s not possible to provide a one size fits all solution — unless a centralized RDBMS based solution.
Hope you find this list useful!