Apache Spark and Swarm S3 Installation

The following guide is for Ubuntu 16.04 LTS with Apache Spark 2.3.0.

Important

This outlines the steps for a standalone demo or test configuration. For production, set up at minimum a cluster of 3 nodes, one master and 2 slaves.
  1. First, update all of the OS packages. 

    apt-get update
    apt-get upgrade
  2. Download from Apache the latest version of spark with Hadoop precompiled.
    Download Spark-2.3.0-bin-hadoop2.7.tgz from http://spark.apache.org/downloads.html
  3. Java should already be installed on your Ubuntu OS, but you will need Scala:

    apt-get install scala
    
    tar xvzf spark-2.3.0-bin-hadoop2.7.tgz
    mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark
  4. Add to ~/.bashrc so that the binaries can be found in the PATH env.

    export SPARK_HOME=/usr/local/spark
    export PATH=$SPARK_HOME/bin:$PATH
  5. Source .bashrc to load it up, or just re-login.
  6. To start the Spark Scala shell with the hadoop-aws plugin, run the following:

    spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.5
  7. You will then see the following:

    Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_162)
    
    Type in expressions to have them evaluated.
    
    Type :help for more information.
    
    scala>
  8. Using the Content UI, create a domain such as "s3.demo.example.local" and a bucket called "sparktest".
  9. Open the sparktest bucket and upload a test text file called castor_license.txt.
  10. Create an S3 token for this domain and pass it to spark, as follows:

    scala> sc.hadoopConfiguration.set("fs.s3a.access.key", "2d980ac1fabda441d1b15472620cb9c0")
    scala> sc.hadoopConfiguration.set("fs.s3a.secret.key", "caringo")
    scala> sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false")
    scala> sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
    scala> sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.demo.example.local")
  11. To test, run a command to read a file from Swarm via S3 and count the words inside it:

    scala> sc.textFile("s3a://sparktest/castor_license.txt").count()
    
    res5: Long = 39

    Note

    s3a URI syntax is s3a://<BUCKET>/<FOLDER>/<FILE> or s3a://<BUCKET>/<FILE>

  12. These three commands apply the wordcount application against the text file in s3 and sends back the results of the reduce operation to s3. 

    scala> val sonnets = sc.textFile("s3a://sparktest/castor_license.txt")
    sonnets: org.apache.spark.rdd.RDD[String] = s3a://sparktest/castor_license.txt MapPartitionsRDD[3] at textFile at <console>:24
    
    scala> val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
    counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at <console>:25
    
    scala> counts.saveAsTextFile("s3a://sparktest/output.txt")
  13. Browse to see the Spark UI jobs and stages (see UI examples below): 

    http://<IP>:4040/jobs/
    http://<IP>:4040/stages/stage/?id=2&attempt=0
  14. It also works with SSL (even self-signed certificates): 

    scala> sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true");
    
    scala> sc.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.demo.sales.local")
    
    scala> sc.textFile("s3a://sparktest/castor_license.txt").count()
    res8: Long = 39

Spark Web UI Examples

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.