Streaming with Storm – simple example with HDFS bolt

This post describes a simple Storm topology – random words are written to HDFS. The topology is uploaded on the cluster from the client node. Nimbus is on the cluster’s NameNode. I have 4 DataNodes and on each of them a Supervisor is installed. More on how I installed and configured Storm can be found here.

Services used

I am using Hortonworks 2.4, Hadoop is version 2.7.1, Storm is version 0.10.0. All services were installed through Ambari.

Preparing development environment

Create a new maven project. How to install maven is explained here.

mvn archetype:generate -DgroupId=org.package -DartifactId=storm-project -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

When the project is created, step into the directory (in this case it is storm-project) where the pom.xml file is also located.

In the org.package (./src/main/java/org.package), create folder spout. The App.java can be deleted.

There are 3 files important for this topology: pom.xml, the spout file and the topology file.

Prepare pom.xml

The pom file for this case includes Storm dependencies, with scope provided. Storm jars are not packed together with the topology! It is important to match the versions.

maven-shade-plugin

Add build node with the plugin

    <build>
        <sourceDirectory>src/</sourceDirectory>
        <resources>
            <resource>
                <directory>${basedir}</directory>
                <includes>
                    <include>*</include>
                </includes>
            </resource>
        </resources>
        <outputDirectory>classes/</outputDirectory>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>1.4</version>
                <configuration>
                    <createDependencyReducedPom>true</createDependencyReducedPom>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass></mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

clojure

Add clojure in the dependencies node. Be sure to check for newer version

<dependency>
    <groupId>org.clojure</groupId>
    <artifactId>clojure</artifactId>
    <version>1.8.0</version>
</dependency>

storm-core

Make sure the version matches Storm installation

<dependency>
    <groupId>org.apache.storm</groupId>
    <artifactId>storm-core</artifactId>
    <version>0.10.0</version>
    <!-- keep storm out of the jar-with-dependencies -->
    <scope>provided</scope>
</dependency>

hadoop-client

Hadoop client XML node. Make sure the version matches your Hadoop installation. org.slf4j is omitted otherwise messages about multiple version of the package are appearing

<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-client</artifactId>
	<version>2.7.1</version>
	<exclusions>
		<exclusion>
			<groupId>org.slf4j</groupId>
			<artifactId>slf4j-log4j12</artifactId>
		</exclusion>
	</exclusions>
</dependency>

hadoop-hdfs

Hadoop hdfs XML node. Make sure the version matches your Hadoop installation. org.slf4j is again omitted

<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-hdfs</artifactId>
	<version>2.7.1</version>
	<exclusions>
		<exclusion>
			<groupId>org.slf4j</groupId>
			<artifactId>slf4j-log4j12</artifactId>
		</exclusion>
	</exclusions>
</dependency>

storm-hdfs

<dependency>
	<groupId>org.apache.storm</groupId>
	<artifactId>storm-hdfs</artifactId>
	<version>0.10.1</version>
</dependency>

Now that the pom.xml is in order, you can package the project to see if pom.xml is valid

mvn package

Build success should appear. If not, the pom.xml is invalid and should be taken care of.

Click on the next page for Spout.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s