The idea is to replicate messages from Apache Kafka to Azure Event Hubs using Kafka’s MirrorMaker.
Apache Kafka has become the most popular streaming and messaging open- source tool. Many organizations have implemented it on premise or in a public cloud. And many are content with Kafka’s performance and are hesitant to migrate to a Kafka-like service in the cloud. For example, one such service in Azure is Event Hubs. A “simple, secure and scalable real-time data ingestion” service in Azure.
LinkedIn has developed Kafka and donated it to the community in 2011. Microsoft cannot deny the popularity Kafka has gained and is therefore offering the possibility to use Kafka API to work against Event Hubs. This allows Kafka developers to continue using Kafka APIs without any disturbance. Only the configuration changes.
MirrorMaker is ran on the consumer side – Kafka – in this case. However, it is advised to run MirrorMaker on the producer side, this would be on a server in Azure on this occasion.
For the sake of simplicity, 5 Kafka topics are defined: prod.test[1-5]. Kafka and MirrorMaker configurations are standard.
Consumer side: Kafka
First, a working Kafka is needed.
In the GitHub repository cloud and local alternatives are available:
- provisioned to AWS using Terraform. This way you can create a Kafka cluster. The README.md files should give more details about provisioning Kafka in AWS.
- created using Docker container. This gives you a Kafka service suitable for development and testing on your local computer. Keep in mind that Kafka needs to be started manually inside the container by executing script /opt/startall.sh.
If Kafka in Docker container is used, the server.properties file is very standard:
broker.id=0 log.dirs=/var/logs/kafka-logs zookeeper.connect=localhost:2181 zookeeper.connection.timeout.ms=6000 listeners=PLAINTEXT://localhost:9092 advertised.listeners=PLAINTEXT://localhost:9092 offsets.topic.replication.factor=1 advertised.host.name=127.0.0.1
Using the above configuration, value for parameter bootstrap-server is localhost:9092. Alternatively, 127.0.0.1 can be used since parameter advertised.host.name is defined in the file.
Once either of Kafka alternatives is up and running, test messages can be produced.
Message producer: Scala
Kafka should be up and running and the DNS of Kafka server(s) or localhost is the input parameter when initializing an instance of the class.
Scala script is executed from the client and produces numerous messages that are randomly assigned to topics mentioned above. Client can be any server with Scala installation since DNS names are used to communicate with Kafka and all you need is to be able to reach the Kafka’s DNS names.
The script is going to produce messages to one of the topics randomly. Ten is the default number of messages produced, but this parameter can be adjusted since it is method’s input parameter.
I am generating messages using Scala CLI. Stepping into the folder where RandomMessage.scala is located and starting Scala is a good start before executing the following commands:
:load RandomMessage.scala val rm = new RandomMessage("localhost", "test") //second parameter is name of topic with prefix prod. and suffix [1-5] rm.CreateMessages(10000) //10 is default parameter
Output should be something like the following:
Topic: prod.test1. Message: This is message number 9701
Topic: prod.test1. Message: This is message number 9702
Topic: prod.test2. Message: This is message number 9703
Topic: prod.test3. Message: This is message number 9704
Topic: prod.test1. Message: This is message number 9705
Topic: prod.test4. Message: This is message number 9706
Topic: prod.test4. Message: This is message number 9707
Topic: prod.test5. Message: This is message number 9708
Topic: prod.test1. Message: This is message number 9709
Topic: prod.test4. Message: This is message number 9710
This should create 5 topics in Kafka and add some messages to the topics.
As soon as the script is executed it is possible to check in Kafka if topics are created:
$KAFKA_HOME/bin/kafka-topics.sh --bootstrap-server 127.0.0.1:9092 –list
should return a list of topics available:
Reading from one topic:
$KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic prod.test3 --from-beginning
should return lines like these:
This is message number 6207
This is message number 6215
This is message number 6217
This is message number 6219
This is message number 6221
The topics and the messages are now “sitting” in Kafka. The next step is to get them into Azure Event Hubs.
Producer side: Event Hubs
Kafka’s MirrorMaker replicates the messages from Kafka to Event Hubs. This post is not going into details about configuration of MirrorMaker, it will just prepare the configuration files to produce a working example.
In this repository, Terraform is used to provision Event Hubs. If you already have an existing Event Hubs, that works too, just make sure you don’t have topics that match the names of topics used in this Proof-of-Concept.
There are two files needed for MirrorMaker to work: one for consumer and one for producer side (for better illustration, check the graphic on top).
Below is an example of consumer.config file for Kafka running locally:
bootstrap.servers=localhost:9092 group.id=example-mirrormaker-group exclude.internal.topics=true client.id=mirror_maker_consumer partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor auto.offset.reset=earliest
The last parameter will fetch all messages from Kafka. This is good for testing purposes because you avoid having a live producer to Kafka to see it works. It does however follow the offset rules like any other Kafka’s producer.
To configure producer.config what is needed from Event Hubs is Connection string–primary key, which can be found in the settings of the Event Hubs service, under Shared access policies. Clicking on the policy opens connection strings on the right side. Copy the string. Below is an example of the file.
Replace NAMESPACE with the unique namespace of your choice. Replace CONNECTION_STRING_PRIMARY_KEY with the string from Event Hubs.
bootstrap.servers=NAMESPACE.servicebus.windows.net:9093 client.id=mirror_maker_producer sasl.mechanism=PLAIN security.protocol=SASL_SSL sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="CONNECTION_STRING_PRIMARY_KEY";
Now that the configuration files are in place, the MirrorMaker can be started:
$KAFKA_HOME/bin/kafka-mirror-maker.sh --consumer.config $KAFKA_HOME/config/consumer.config --num.streams 1 --producer.config $KAFKA_HOME/config/producer.config --whitelist="prod.*"
The last parameter defines which topics should be replicated to Event Hubs. Above Scala example generated 10.000 messages to Kafka so it is expected to have 10.000 messages also in Event Hubs.
In Azure, it is obvious that all messages have been consumed by the Event Hubs – the blue line. The chart can be interpreted as all messages were consumed within the same minute.
Note: the green colour is used for presentation of messages that were captured – saved to some storage available by Azure. This is out of scope for this post.
The messages are now in Event Hubs. Its is up to the retention parameter how long they will be available there. In the repository above, the value is set to one – one day.
The messages are in the Event Hubs now. It would make sense to save them to a permanent storage so that they can be used for analysis. This is covered in this blog post.