The idea is to replicate messages from Apache Kafka to Azure Event Hubs using Kafka’s MirrorMaker.
Apache Kafka has become the most popular streaming and messaging open- source tool. Many organizations have implemented it on premise or in a public cloud. And many are content with Kafka’s performance and are hesitant to migrate to a Kafka-like service in the cloud. For example, one such service in Azure is Event Hubs. A “simple, secure and scalable real-time data ingestion” service in Azure.
LinkedIn has developed Kafka and donated it to the community in 2011. Microsoft cannot deny the popularity Kafka has gained and is therefore offering the possibility to use Kafka API to work against Event Hubs. This allows Kafka developers to continue using Kafka APIs without any disturbance. Only the configuration changes.
MirrorMaker is ran on the consumer side – Kafka – in this case. However, it is advised to run MirrorMaker on the producer side, this would be on a server in Azure on this occasion.
For the sake of simplicity, 5 Kafka topics are defined: prod.test[1-5]. Kafka and MirrorMaker configurations are standard.
Consumer side: Kafka
First, a working Kafka is needed.
In the GitHub repository cloud and local
alternatives are available:
- provisioned to AWS using Terraform. This way you can create a Kafka cluster. The README.md files should give more details about provisioning Kafka in AWS.
- created using Docker container. This gives you a Kafka service suitable for development and testing on your local computer. Keep in mind that Kafka needs to be started manually inside the container by executing script /opt/startall.sh.
If Kafka in Docker container is used, the server.properties file is very standard:
Using the above configuration, value for parameter bootstrap-server is localhost:9092. Alternatively, 127.0.0.1 can be used since parameter advertised.host.name is defined in the
Once either of Kafka alternatives is up and
running, test messages can be produced.
Message producer: Scala
Kafka should be up and running and the DNS
of Kafka server(s) or localhost is the input parameter when initializing an
instance of the class.
Scala script is executed from the client and produces numerous messages that are randomly assigned to topics mentioned above. Client can be any server with Scala installation since DNS names are used to communicate with Kafka and all you need is to be able to reach the Kafka’s DNS names.
The script is going to produce messages to
one of the topics randomly. Ten is the default number of messages produced, but
this parameter can be adjusted since it is method’s input parameter.
I am generating messages using Scala CLI. Stepping into the folder where RandomMessage.scala is located and starting Scala is a good start before executing the following commands:
val rm = new RandomMessage("localhost", "test") //second parameter is name of topic with prefix prod. and suffix [1-5]
rm.CreateMessages(10000) //10 is default parameter
Output should be something like the following:
Topic: prod.test1. Message: This is message number 9701
Topic: prod.test1. Message: This is message number 9702
Topic: prod.test2. Message: This is message number 9703
Topic: prod.test3. Message: This is message number 9704
Topic: prod.test1. Message: This is message number 9705
Topic: prod.test4. Message: This is message number 9706
Topic: prod.test4. Message: This is message number 9707
Topic: prod.test5. Message: This is message number 9708
Topic: prod.test1. Message: This is message number 9709
Topic: prod.test4. Message: This is message number 9710
This should create 5 topics in Kafka and
add some messages to the topics.
As soon as the script is executed it is possible to check in Kafka if topics are created:
$KAFKA_HOME/bin/kafka-topics.sh --bootstrap-server 127.0.0.1:9092 –list
should return a list of topics available:
Reading from one topic:
$KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic prod.test3 --from-beginning
should return lines like these:
This is message number 6207
This is message number 6215
This is message number 6217
This is message number 6219
This is message number 6221
The topics and the messages are now “sitting”
in Kafka. The next step is to get them into Azure Event Hubs.
Producer side: Event Hubs
Kafka’s MirrorMaker replicates the messages
from Kafka to Event Hubs. This post is not going into details about
configuration of MirrorMaker, it will just prepare the configuration files to
produce a working example.
In this repository, Terraform is used to provision Event Hubs. If you already have an existing Event Hubs, that works too, just make sure you don’t have topics that match the names of topics used in this Proof-of-Concept.
There are two files needed for MirrorMaker
to work: one for consumer and one for producer side (for better illustration,
check the graphic on top).
Below is an example of consumer.config file for Kafka running locally:
The last parameter will fetch all messages
from Kafka. This is good for testing purposes because you avoid having a live
producer to Kafka to see it works. It does however follow the offset rules like any other Kafka’s
To configure producer.config what is needed from Event Hubs is Connection string–primary key, which can be found in the settings of the Event Hubs service, under Shared access policies. Clicking on the policy opens connection strings on the right side. Copy the string. Below is an example of the file.
Replace NAMESPACE with the unique namespace of your choice. Replace CONNECTION_STRING_PRIMARY_KEY with the string from Event Hubs.
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="CONNECTION_STRING_PRIMARY_KEY";
Now that the configuration files are in place, the MirrorMaker can be started:
$KAFKA_HOME/bin/kafka-mirror-maker.sh --consumer.config $KAFKA_HOME/config/consumer.config --num.streams 1 --producer.config $KAFKA_HOME/config/producer.config --whitelist="prod.*"
The last parameter defines which topics should be replicated to Event Hubs. Above Scala example generated 10.000 messages to Kafka so it is expected to have 10.000 messages also in Event Hubs.
In Azure, it is obvious that all messages
have been consumed by the Event Hubs – the blue line. The chart can be
interpreted as all messages were consumed within the same minute.
the green colour is used for presentation of
messages that were captured – saved to some storage available by Azure. This is
out of scope for this post.
The messages are now in Event Hubs. Its is
up to the retention parameter how
long they will be available there. In the repository above, the value is set to
one – one day.
The messages are in the Event Hubs now. It would make sense to save them to a permanent storage so that they can be used for analysis. This is covered in this blog post.