Streaming messages from Kafka to EventHub with MirrorMaker

Idea

The idea is to replicate messages from Apache Kafka to Azure Event Hubs using Kafka’s MirrorMaker.

Introduction

Apache Kafka has become the most popular streaming and messaging open- source tool. Many organizations have implemented it on premise or in a public cloud. And many are content with Kafka’s performance and are hesitant to migrate to a Kafka-like service in the cloud. For example, one such service in Azure is Event Hubs. A “simple, secure and scalable real-time data ingestion” service in Azure.

LinkedIn has developed Kafka and donated it to the community in 2011. Microsoft cannot deny the popularity Kafka has gained and is therefore offering the possibility to use Kafka API to work against Event Hubs. This allows Kafka developers to continue using Kafka APIs without any disturbance. Only the configuration changes.

Architecture

MirrorMaker is ran on the consumer side – Kafka – in this case. However, it is advised to run MirrorMaker on the producer side, this would be on a server in Azure on this occasion.

For the sake of simplicity, 5 Kafka topics are defined: prod.test[1-5]. Kafka and MirrorMaker configurations are standard.

Consumer side: Kafka

First, a working Kafka is needed.

In the GitHub repository cloud and local alternatives are available:

  • provisioned to AWS using Terraform. This way you can create a Kafka cluster. The README.md files should give more details about provisioning Kafka in AWS.
  • created using Docker container. This gives you a Kafka service suitable for development and testing on your local computer. Keep in mind that Kafka needs to be started manually inside the container by executing script /opt/startall.sh.

Configuring server.properties

If Kafka in Docker container is used, the server.properties file is very standard:

broker.id=0
log.dirs=/var/logs/kafka-logs
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=6000
listeners=PLAINTEXT://localhost:9092
advertised.listeners=PLAINTEXT://localhost:9092
offsets.topic.replication.factor=1
advertised.host.name=127.0.0.1

Using the above configuration, value for parameter bootstrap-server is localhost:9092. Alternatively, 127.0.0.1 can be used since parameter advertised.host.name is defined in the file.

Once either of Kafka alternatives is up and running, test messages can be produced.

Message producer: Scala

Kafka should be up and running and the DNS of Kafka server(s) or localhost is the input parameter when initializing an instance of the class.

Scala script is executed from the client and produces numerous messages that are randomly assigned to topics mentioned above. Client can be any server with Scala installation since DNS names are used to communicate with Kafka and all you need is to be able to reach the Kafka’s DNS names.

The script is going to produce messages to one of the topics randomly. Ten is the default number of messages produced, but this parameter can be adjusted since it is method’s input parameter.

I am generating messages using Scala CLI. Stepping into the folder where RandomMessage.scala is located and starting Scala is a good start before executing the following commands:

:load RandomMessage.scala
val rm = new RandomMessage("localhost", "test") //second parameter is name of topic with prefix prod. and suffix [1-5]
rm.CreateMessages(10000) //10 is default parameter

Output should be something like the following:

Topic: prod.test1. Message: This is message number 9701
Topic: prod.test1. Message: This is message number 9702
Topic: prod.test2. Message: This is message number 9703
Topic: prod.test3. Message: This is message number 9704
Topic: prod.test1. Message: This is message number 9705
Topic: prod.test4. Message: This is message number 9706
Topic: prod.test4. Message: This is message number 9707
Topic: prod.test5. Message: This is message number 9708
Topic: prod.test1. Message: This is message number 9709
Topic: prod.test4. Message: This is message number 9710

This should create 5 topics in Kafka and add some messages to the topics.

As soon as the script is executed it is possible to check in Kafka if topics are created:

$KAFKA_HOME/bin/kafka-topics.sh --bootstrap-server 127.0.0.1:9092 –list

should return a list of topics available:

__consumer_offsets
prod.test0
prod.test1
prod.test2
prod.test3
prod.test4
prod.test5

Reading from one topic:

$KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic prod.test3 --from-beginning

should return lines like these:

This is message number 6207
This is message number 6215
This is message number 6217
This is message number 6219
This is message number 6221

The topics and the messages are now “sitting” in Kafka. The next step is to get them into Azure Event Hubs.

Producer side: Event Hubs

Kafka’s MirrorMaker replicates the messages from Kafka to Event Hubs. This post is not going into details about configuration of MirrorMaker, it will just prepare the configuration files to produce a working example.

In this repository, Terraform is used to provision Event Hubs. If you already have an existing Event Hubs, that works too, just make sure you don’t have topics that match the names of topics used in this Proof-of-Concept.

Configuration files

There are two files needed for MirrorMaker to work: one for consumer and one for producer side (for better illustration, check the graphic on top).

Below is an example of consumer.config file for Kafka running locally:

bootstrap.servers=localhost:9092
group.id=example-mirrormaker-group
exclude.internal.topics=true
client.id=mirror_maker_consumer
partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor
auto.offset.reset=earliest

The last parameter will fetch all messages from Kafka. This is good for testing purposes because you avoid having a live producer to Kafka to see it works. It does however follow the offset rules like any other Kafka’s producer.

To configure producer.config what is needed from Event Hubs is Connection string–primary key, which can be found in the settings of the Event Hubs service, under Shared access policies. Clicking on the policy opens connection strings on the right side. Copy the string. Below is an example of the file.

Replace NAMESPACE with the unique namespace of your choice. Replace CONNECTION_STRING_PRIMARY_KEY with the string from Event Hubs.

bootstrap.servers=NAMESPACE.servicebus.windows.net:9093
client.id=mirror_maker_producer
sasl.mechanism=PLAIN
security.protocol=SASL_SSL
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="CONNECTION_STRING_PRIMARY_KEY";

Now that the configuration files are in place, the MirrorMaker can be started:

$KAFKA_HOME/bin/kafka-mirror-maker.sh --consumer.config $KAFKA_HOME/config/consumer.config --num.streams 1 --producer.config $KAFKA_HOME/config/producer.config --whitelist="prod.*"

The last parameter defines which topics should be replicated to Event Hubs. Above Scala example generated 10.000 messages to Kafka so it is expected to have 10.000 messages also in Event Hubs.

In Azure, it is obvious that all messages have been consumed by the Event Hubs – the blue line. The chart can be interpreted as all messages were consumed within the same minute.

Note: the green colour is used for presentation of messages that were captured – saved to some storage available by Azure. This is out of scope for this post.

The messages are now in Event Hubs. Its is up to the retention parameter how long they will be available there. In the repository above, the value is set to one – one day.

Next step

The messages are in the Event Hubs now. It would make sense to save them to a permanent storage so that they can be used for analysis. This is covered in this blog post.