Popularity

3.3

Growing

Activity

0.0

Stable

Stars 76

Watchers 13

Forks 21

Last Commit almost 9 years ago

Programming language: Scala

License: Apache License 2.0

Tags: Big Data

Latest version: v3.0.2

spark-deployer alternatives and similar packages

Based on the "Big Data" category.
Alternatively, view spark-deployer alternatives based on common mentions on social networks and blogs.

Apache Spark

10.0 10.0 spark-deployer VS Apache Spark

Apache Spark - A unified analytics engine for large-scale data processing
Kafka

10.0 10.0 L2 spark-deployer VS Kafka

Mirror of Apache Kafka

InfluxDB – Built for High-Performance Time Series Workloads

InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

Promo www.influxdata.com

Flink

9.9 9.9 L2 spark-deployer VS Flink

Apache Flink
Deeplearning4J

9.9 8.3 L1 spark-deployer VS Deeplearning4J

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...
Scalding

9.6 2.5 spark-deployer VS Scalding

A Scala API for Cascading
Summingbird

9.3 1.7 spark-deployer VS Summingbird

DISCONTINUED. Streaming MapReduce with Scalding and Storm
Scio

9.3 9.3 spark-deployer VS Scio

A Scala API for Apache Beam and Google Cloud Dataflow.
Reactive-kafka

8.9 8.3 spark-deployer VS Reactive-kafka

Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.
Jupyter Scala

8.8 8.2 spark-deployer VS Jupyter Scala

A Scala kernel for Jupyter
Hail

8.4 9.4 spark-deployer VS Hail

Cloud-native genomic dataframes and batch computing
BIDMach

8.2 0.0 spark-deployer VS BIDMach

CPU and GPU-accelerated Machine Learning Library
Sparkta

8.0 0.0 spark-deployer VS Sparkta

Real Time Analytics and Data Pipelines based on Spark Streaming
Gearpump

8.0 0.0 spark-deployer VS Gearpump

Lightweight real-time big data streaming engine over Akka
metorikku

7.5 2.4 spark-deployer VS metorikku

A simplified, lightweight ETL Framework based on Apache Spark
Vegas

7.4 0.0 spark-deployer VS Vegas

The missing MatPlotLib for Scala + Spark
Scoobi

6.7 0.0 spark-deployer VS Scoobi

A Scala productivity framework for Hadoop.
Scrunch

5.1 1.4 L3 spark-deployer VS Scrunch

DISCONTINUED. Mirror of Apache Crunch (Incubating)
Scoozie

4.7 0.0 spark-deployer VS Scoozie

DISCONTINUED. Scala DSL on top of Oozie XML [GET https://api.github.com/repos/klout/scoozie: 404 - Not Found // See: https://docs.github.com/rest/repos/repos#get-a-repository]
Schemer

3.5 0.0 spark-deployer VS Schemer

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
GridScale

2.2 8.0 spark-deployer VS GridScale

Scala library for accessing various file, batch systems, job schedulers and grid middlewares.
raster-frames

2.1 0.0 spark-deployer VS raster-frames

DISCONTINUED. Spark DataFrames for earth observation data
Spark Utils

1.9 4.6 spark-deployer VS Spark Utils

Basic framework utilities to quickly start writing production ready Apache Spark applications
Sparkplug

1.8 0.0 spark-deployer VS Sparkplug

Spark package to "plug" holes in data using SQL based rules ⚡️ 🔌
Shadoop

1.2 0.0 spark-deployer VS Shadoop

A wrapper for Hadoop in Scala
Spark Tools

1.0 0.0 spark-deployer VS Spark Tools

Executable Apache Spark Tools: Format Converter & SQL Processor

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of spark-deployer or a related project?

Add another 'Big Data' Package

Popular Comparisons

README

spark-deployer

A Scala tool which helps deploying Apache Spark stand-alone cluster on EC2 and submitting job.
Currently supports Spark 2.0.0+.
There are two modes when using spark-deployer: SBT plugin mode and embedded mode.

SBT plugin mode

Here are the basic steps to run a Spark job (all the sbt commands support TAB-completion):

Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
Prepare a project with structure like below:

  project-root
  ├── build.sbt
  ├── project
  │   └── plugins.sbt
  └── src
      └── main
          └── scala
              └── mypackage
                  └── Main.scala

Add one line in project/plugins.sbt:

  addSbtPlugin("net.pishen" % "spark-deployer-sbt" % "3.0.2")

Write your Spark project's build.sbt (Here we give a simple example):

  name := "my-project-name"

  scalaVersion := "2.11.8"

  libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % "2.0.0" % "provided"
  )

Write your job's algorithm in src/main/scala/mypackage/Main.scala:

  package mypackage

  import org.apache.spark._

  object Main {
    def main(args: Array[String]) {
      //setup spark
      val sc = new SparkContext(new SparkConf())
      //your algorithm
      val n = 10000000
      val count = sc.parallelize(1 to n).map { i =>
        val x = scala.math.random
        val y = scala.math.random
        if (x * x + y * y < 1) 1 else 0
      }.reduce(_ + _)
      println("Pi is roughly " + 4.0 * count / n)
    }
  }

Enter sbt, and build a config by:

  > sparkBuildConfig

(Most settings have default values, just hit Enter to go through it.)

Create a cluster with 1 master and 2 workers by:

  > sparkCreateCluster 2

See your cluster's status by:

  > sparkShowMachines

Submit your job by:

  > sparkSubmit

When your job is done, destroy your cluster with

  > sparkDestroyCluster

Advanced functions

To build config with different name or build a config based on old one:

  > sparkBuildConfig <new-config-name>
  > sparkBuildConfig <new-config-name> from <old-config-name>

All the configs are stored as .deployer.json files in the conf/ folder. You can modify it if you know what you're doing.

To change the current config:

  > sparkChangeConfig <config-name>

To submit a job with arguments or with a main class:

  > sparkSubmit <args>
  > sparkSubmitMain mypackage.Main <args>

To add or remove worker machines dynamically:

  > sparkAddWorkers <num-of-workers>
  > sparkRemoveWorkers <num-of-workers>

Embedded mode

If you don't want to use sbt, or if you would like to trigger the cluster creation from within your Scala application, you can include the library of spark-deployer directly:

libraryDependencies += "net.pishen" %% "spark-deployer-core" % "3.0.2"

Then, from your Scala code, you can do something like this:

import sparkdeployer._

// build a ClusterConf
val clusterConf = ClusterConf.build()

// save and load ClusterConf
clusterConf.save("path/to/conf.deployer.json")
val clusterConfReloaded = ClusterConf.load("path/to/conf.deployer.json")

// create cluster and submit job
val sparkDeployer = new SparkDeployer()(clusterConf)

val workers = 2
sparkDeployer.createCluster(workers)

val jar = new File("path/to/job.jar")
val mainClass = "mypackage.Main"
val args = Seq("arg0", "arg1")
sparkDeployer.submit(jar, mainClass, args)

sparkDeployer.destroyCluster()

Environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY should also be set.
You may prepare the job.jar by sbt-assembly from other sbt project with Spark.
For other available functions, check SparkDeployer.scala in our source code.

spark-deployer uses slf4j, remember to add your own backend to see the log. For example, to print the log on screen, add

libraryDependencies += "org.slf4j" % "slf4j-simple" % "1.7.14"

FAQ

Could I use other ami?

Yes, just specify the ami id when running sparkBuildConfig. The image should be HVM EBS-Backed with Java 7+ installed. You can also run some commands before Spark start on each machine by editing the preStartCommands in json config. For example:

"preStartCommands": [
  "sudo bash -c \"echo -e 'LC_ALL=en_US.UTF-8\\nLANG=en_US.UTF-8' >> /etc/environment\"",
  "sudo apt-get -qq install openjdk-8-jre",
  "cd spark/conf/ && cp log4j.properties.template log4j.properties && echo 'log4j.rootCategory=WARN, console' >> log4j.properties"
]

When using custom ami, the root device should be your root volume's name (/dev/sda1 for Ubuntu) that can be enlarged by disk size settings in master and workers.

Could I use custom Spark tarball?

Yes, just change the tgz url when running sparkBuildConfig, the tgz will be extracted as a spark/ folder in each machine's home folder.

What rules should I set on my security group?

Assuming your security group id is sg-abcde123, the basic settings is:

Type	Protocol	Port Range	Source
All traffic	All	All	`sg-abcde123`
SSH	TCP	22	`<your-allowed-ip>`
Custom TCP Rule	TCP	8080-8081	`<your-allowed-ip>`
Custom TCP Rule	TCP	4040	`<your-allowed-ip>`

How do I upgrade the config to new version of spark-deployer?

Change to the config you want to upgrade, and run sparkUpgradeConfig to build a new config based on settings in old one. If this doesn't work or you don't mind rebuilding one from scratch, it's recommended to directly create a new config by sparkBuildConfig.

Could I change the directory where configurations are saved?

You can change it by add the following line to your build.sbt:

sparkConfigDir := "path/to/my-config-dir"

How to contribute

Please report issue or ask on gitter if you meet any problem.
Pull requests are welcome.

spark-deployer

Deploy Spark cluster in an easy way.

spark-deployer alternatives and similar packages

Popular Comparisons

README

spark-deployer

SBT plugin mode

Advanced functions

Embedded mode

FAQ

Could I use other ami?

Could I use custom Spark tarball?

What rules should I set on my security group?

How do I upgrade the config to new version of spark-deployer?

Could I change the directory where configurations are saved?

How to contribute