Learning Kafka with a simple cluster over Docker
While installing Kafka on your machine is not very difficult, getting a cluster of nodes ready to play with, is somewhat involved. If you don't want to spend money on AWS or GCP for your learning experiments, here are few simple steps to get a cluster up and running on your Linux laptop or desktop. All you need is a desktop with Ubuntu and at least 32GB RAM.
This setup tries to mimic a real cluster with multiple servers.The instructions are for an Ubuntu host but should be generally applicable to other Linux variants.
It creates three separate docker containers on the same host. The containers are on a separate docker network - most likely that is the case for a real world cluster.
- Install docker engine on your ubuntu host.
- git clone this project:
git clone https://github.com/praveenray/kafka-docker-cluster.git
- Download and save jdk11 tar.gz file in the data directory:
curl -o data/jdk11.tar.gz “https://download.java.net/java/GA/jdk11/13/GPL/openjdk-11.0.1_linux-x64_bin.tar.gz"
- Download and save confluent platform tar.gz file the data directory :
curl -o data/confluent.tar.gz “http://packages.confluent.io/archive/6.1/confluent-6.1.1.tar.gz"
- Build your docker image which embeds JDK and confluent platform downloaded in last step:
docker build -t ubuntu-kafka .
- Run three node cluster:
BROKER_COUNT=3 ./run.sh
This will create a kafka cluster of 3 docker nodes called server-0, server-1 and server-2 on a network called my-kafka-cluster-net. Examine if docker servers are running:
docker ps
- Create a new topic:
docker exec server-1 /opt/software/confluent/bin/kafka-topics --bootstrap-server server-1:9092 --create --topic my-first-topic
- See topic details:
docker exec server-1 /opt/software/confluent/bin/kafka-topics --bootstrap-server server-1:9092 --describe --topic my-first-topic
- Shutdown cluster:
docker kill $(docker ps -a -q) && docker network rm my-kafka-cluster-net
Listeners and Advertised Listeners
Kafka distributes your data across many nodes. Nodes are kept 'in-sync' by an underlying software called Apache Zookeeper. Writing data to the cluster is done by 'kafka producer' library and reading from the cluster is done by ‘Kafka Consumer Library’. Your (java) code must utilize these two libraries to interact with the Cluster. There are libraries available for other languages.
Each node must be configured with two parameters called listeners and advertised.listeners (among others).
You can find excellent information on listeners and advertised listeners here.
Port numbers and host names in listener configurations can get confusing. In order to understand these configs, remember two basic but non-obvious facts:
- Everything starts with port number. If you contact a broker node on port 9000, the advertised listener which is configured for port 9000 (on that node) is returned to you.
- The Kafka Cluster is not completely self managed. Broker/Leader information is exposed to the clients. This is different from other clusters you might be familiar with where the client can send it's requests to ANY of the nodes and distribution of data is taken care of without client knowing anything about rebalancing, leadership status etc.
This implies your client needs to be aware of how many nodes, Leadership status of a node, Port numbers on which to reach each node. Although, these are taken care of by Kafka Client library, the fact remains that this information is exposed to the client. This means each node's advertised port must be reachable through firewall/load-balancers etc
For example, let's assume following setup:
- Broker1: (server-1:9091,server-1:9091), (server-1:10091, external-name-broker-1.com:10091)
- Broker2: (server-2:9091,server-2:9091), (server-2:10092, external-name-broker-2.com:10092)
- Broker3: (server-3:9091,server-3:9091),(server-3:10093, external-name-broker-3.com:10093)
So, two pairs are configured for each host. First Element in a pair is telling the broker to listen on a particular port. Second is the corresponding advertised listener.
This configuration will be used as follows.
A client comes calling to broker1 at port number 9091. Broker1 responds with second item in Pair1 - server-1:9091. All further correspondence between this client and broker1 must use (server-1:9091). This means client must be able to make a connection with the name server-1 at port 9091.
Another client comes calling to broker1 at port number 10091. Broker1 responds with second item in Pair2 : external-name-broker-1.com:10091. All further correspondence between this client and broker1 must use (external-name-broker-1.com:10091). This means client must be able to make a connection with the hostname external-name-broker-1.com at port 10091. This is probably the case with Clients outside of the Kafka LAN.
This also implies that LoadBalancer/Firewall between client and the broker node must allow traffic through port 10091.
Since a client is allowed to connect to any node in the cluster, ALL externally exposed ports must be allowed through the Firewall. In this example, the firewall must allow ports 10091, 10092, 10093.
Other useful fun facts :
- Kafka Topics have partition Leaders. ALL writes for a topic partition go through same Leader node. Leader is the single point of failure and yes, it might become a hot spot if you are unlucky and not watching over your cluster. It's possible for a single node to end up a designated leader for multiple partitions even though other nodes are available to fulfill this role See Leader Skew
- In the same vein, due to unlucky turn of events (node failures and node restarts), it's possible for an uneven partition distribution where a Broker ends up with more than average number of partitions See Broker Skew
- Kafka cluster has a controller Node which has certain operations going through it (and only through it). Another potential single point of failure and hotspot candidate
Good news being, these single points of failures are not clients' responsibilities. Fortunately Kafka (and underlying zookeeper) watch over failing Leaders and work to elect new leaders without any manual intervention. Although, mortals like us must watch for Leader Skews and Broker Skews.
How data is distributed
Data distribution is achieved via breaking a topic into multiple partitions and distributing incoming messages across partitions by keys. As long as your keys are evenly distributed, your data will be evenly distributed across brokers. Once again, the responsibility has been pushed up to the clients.
How data Consumers are distributed
One consumer process is assigned to each partition - parallelizing the processing of data. There are lots of nuances regarding what to do when a process dies but in effect, it boils down to watching for death of a consumer (and birth of new consumers) and dynamically reassigning partitions to newly born consumers or remaining consumers in case of a consumer failure.
This is done by the controller node automatically - however the Kafka Admins must watch over for excessive rebalances.