The environment for the following project build was the following: Ubuntu 14.04 on a AWS EC2 instance, sbt version 0.13.13 (how to install it) and Apache Spark 2.0.1 on local mode (although the same procedure has been done and worked on a Hortonworks Hadoop cluster with Spark 2.0).
The Scala example file creates a SparkSession (if you are using Apache Spark version older than 2.0, check how to create all the context in order to run the example. Or upgrade to Spark 2.0!), reads a csv file into a DataFrame and outputs the DataFrame to the command line.
Create new project folder and step in it
mkdir scala-ne cd scala-ne
Create data folder, step in it and create a test data file
mkdir data cd data vi n-europe.csv
1,Oslo,Norway 2,Stockholm,Sweden 3,Helsinki,Finland 4,Copenhagen,Danmark 5,Reykjavik,Iceland
Save the data file and exit vi.
Create the Scala file
vi spark-ne.scala
Copy the following code in the Scala file (make sure the path to the csv file is valid). If you are doing this on an node that is a part of HDFS cluster, be sure to add file:// at the beginning of the file path string.
import org.apache.spark.sql.SparkSession object ne { def main(args: Array[String]) { val fil = "/SPARK-NE_PROJECT/data/n-europe.csv" val spark = SparkSession .builder .appName("Scala-Northern-E") .getOrCreate() val neDF = spark.read.csv(fil) neDF.show() } }
Create build file build.sbt
vi build.sbt
And type the following lines (make sure to leave the empty line between each line). Adjust the Scala and spark-sql version accordingly!
name := "Spark-ne" version := "1.0" scalaVersion := "2.11.8" libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
Run the following command to build the project
sbt package
The last three line of the output
[info] Packaging /user/marko/scala-ne/target/scala-2.11/spark-ne_2.11-1.0.jar ... [info] Done packaging. [success] Total time: 69 s, completed Jan 7, 2017 8:22:38 PM
Running the sbt package command the first time is going to take more time because the jar files are downloading. Maven users should feel like home here.
If you make changes to the Scala file and run the command again, it takes less time. Below is an example of the next build call
[info] Packaging /user/marko/scala-ne/target/scala-2.11/spark-ne_2.11-1.0.jar ... [info] Done packaging. [success] Total time: 8 s, completed Jan 7, 2017 8:29:12 PM
The code can now be submitted to Spark. Adjust the path to the JAR file accordingly.
$SPARK_HOME/bin/spark-submit --class ne /user/marko/scala-ne/target/scala-2.11/spark-ne_2.11-1.0.jar
If everything went as it should, the table of Northern European cities and countries is seen in the output.