Apache Spark and Swarm S3 Installation

The following guide is for Ubuntu 16.04 LTS with Apache Spark 2.3.0.

Important

This outlines the steps for a standalone demo or test configuration. For production, set up at minimum a cluster of 3 nodes, one master and 2 slaves.

First, update all of the OS packages.
```
apt-get update
apt-get upgrade
```
Download from Apache the latest version of spark with Hadoop precompiled.
Download Spark-2.3.0-bin-hadoop2.7.tgz from http://spark.apache.org/downloads.html

Java should already be installed on your Ubuntu OS, but you will need Scala:

apt-get install scala

tar xvzf spark-2.3.0-bin-hadoop2.7.tgz
mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark

Add to ~/.bashrc so that the binaries can be found in the PATH env.

export SPARK_HOME=/usr/local/spark
export PATH=$SPARK_HOME/bin:$PATH

Source .bashrc to load it up, or just re-login.
To start the Spark Scala shell with the hadoop-aws plugin, run the following:
```
spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.5
```

You will then see the following:

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_162)

Type in expressions to have them evaluated.

Type :help for more information.

scala>

Using the Content UI, create a domain such as "s3.demo.example.local" and a bucket called "sparktest".
Open the sparktest bucket and upload a test text file called castor_license.txt.

Create an S3 token for this domain and pass it to spark, as follows:

scala> sc.hadoopConfiguration.set("fs.s3a.access.key", "2d980ac1fabda441d1b15472620cb9c0")
scala> sc.hadoopConfiguration.set("fs.s3a.secret.key", "caringo")
scala> sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false")
scala> sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
scala> sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.demo.example.local")

To test, run a command to read a file from Swarm via S3 and count the words inside it:
```
scala> sc.textFile("s3a://sparktest/castor_license.txt").count()

res5: Long = 39
```
Note

s3a URI syntax is s3a://<BUCKET>/<FOLDER>/<FILE> or s3a://<BUCKET>/<FILE>

These three commands apply the wordcount application against the text file in s3 and sends back the results of the reduce operation to s3.

scala> val sonnets = sc.textFile("s3a://sparktest/castor_license.txt")
sonnets: org.apache.spark.rdd.RDD[String] = s3a://sparktest/castor_license.txt MapPartitionsRDD[3] at textFile at <console>:24

scala> val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at <console>:25

scala> counts.saveAsTextFile("s3a://sparktest/output.txt")

Browse to see the Spark UI jobs and stages (see UI examples below):

http://<IP>:4040/jobs/
http://<IP>:4040/stages/stage/?id=2&attempt=0

It also works with SSL (even self-signed certificates):

scala> sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true");

scala> sc.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.demo.sales.local")

scala> sc.textFile("s3a://sparktest/castor_license.txt").count()
res8: Long = 39

Knowledge Base

Apache Spark and Swarm S3 Installation

Spark Web UI Examples