spark-deployer alternatives and similar packages
Based on the "Big Data" category.
Alternatively, view spark-deployer alternatives based on common mentions on social networks and blogs.
-
Deeplearning4J
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn... -
Reactive-kafka
Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka. -
Schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API. -
GridScale
Scala library for accessing various file, batch systems, job schedulers and grid middlewares. -
Spark Utils
Basic framework utilities to quickly start writing production ready Apache Spark applications
InfluxDB - Purpose built for real-time analytics at any scale.
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of spark-deployer or a related project?
README
spark-deployer
- A Scala tool which helps deploying Apache Spark stand-alone cluster on EC2 and submitting job.
- Currently supports Spark 2.0.0+.
- There are two modes when using spark-deployer: SBT plugin mode and embedded mode.
SBT plugin mode
Here are the basic steps to run a Spark job (all the sbt commands support TAB-completion):
- Set the environment variables
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
. - Prepare a project with structure like below:
project-root
├── build.sbt
├── project
│ └── plugins.sbt
└── src
└── main
└── scala
└── mypackage
└── Main.scala
- Add one line in
project/plugins.sbt
:
addSbtPlugin("net.pishen" % "spark-deployer-sbt" % "3.0.2")
- Write your Spark project's
build.sbt
(Here we give a simple example):
name := "my-project-name"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0" % "provided"
)
- Write your job's algorithm in
src/main/scala/mypackage/Main.scala
:
package mypackage
import org.apache.spark._
object Main {
def main(args: Array[String]) {
//setup spark
val sc = new SparkContext(new SparkConf())
//your algorithm
val n = 10000000
val count = sc.parallelize(1 to n).map { i =>
val x = scala.math.random
val y = scala.math.random
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
}
}
- Enter
sbt
, and build a config by:
> sparkBuildConfig
(Most settings have default values, just hit Enter to go through it.)
- Create a cluster with 1 master and 2 workers by:
> sparkCreateCluster 2
- See your cluster's status by:
> sparkShowMachines
- Submit your job by:
> sparkSubmit
- When your job is done, destroy your cluster with
> sparkDestroyCluster
Advanced functions
- To build config with different name or build a config based on old one:
> sparkBuildConfig <new-config-name>
> sparkBuildConfig <new-config-name> from <old-config-name>
All the configs are stored as .deployer.json
files in the conf/
folder. You can modify it if you know what you're doing.
- To change the current config:
> sparkChangeConfig <config-name>
- To submit a job with arguments or with a main class:
> sparkSubmit <args>
> sparkSubmitMain mypackage.Main <args>
- To add or remove worker machines dynamically:
> sparkAddWorkers <num-of-workers>
> sparkRemoveWorkers <num-of-workers>
Embedded mode
If you don't want to use sbt, or if you would like to trigger the cluster creation from within your Scala application, you can include the library of spark-deployer directly:
libraryDependencies += "net.pishen" %% "spark-deployer-core" % "3.0.2"
Then, from your Scala code, you can do something like this:
import sparkdeployer._
// build a ClusterConf
val clusterConf = ClusterConf.build()
// save and load ClusterConf
clusterConf.save("path/to/conf.deployer.json")
val clusterConfReloaded = ClusterConf.load("path/to/conf.deployer.json")
// create cluster and submit job
val sparkDeployer = new SparkDeployer()(clusterConf)
val workers = 2
sparkDeployer.createCluster(workers)
val jar = new File("path/to/job.jar")
val mainClass = "mypackage.Main"
val args = Seq("arg0", "arg1")
sparkDeployer.submit(jar, mainClass, args)
sparkDeployer.destroyCluster()
- Environment variables
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
should also be set. - You may prepare the
job.jar
by sbt-assembly from other sbt project with Spark. - For other available functions, check
SparkDeployer.scala
in our source code.
spark-deployer uses slf4j, remember to add your own backend to see the log. For example, to print the log on screen, add
libraryDependencies += "org.slf4j" % "slf4j-simple" % "1.7.14"
FAQ
Could I use other ami?
Yes, just specify the ami id when running sparkBuildConfig
. The image should be HVM EBS-Backed with Java 7+ installed. You can also run some commands before Spark start on each machine by editing the preStartCommands
in json config. For example:
"preStartCommands": [
"sudo bash -c \"echo -e 'LC_ALL=en_US.UTF-8\\nLANG=en_US.UTF-8' >> /etc/environment\"",
"sudo apt-get -qq install openjdk-8-jre",
"cd spark/conf/ && cp log4j.properties.template log4j.properties && echo 'log4j.rootCategory=WARN, console' >> log4j.properties"
]
When using custom ami, the root device
should be your root volume's name (/dev/sda1
for Ubuntu) that can be enlarged by disk size
settings in master and workers.
Could I use custom Spark tarball?
Yes, just change the tgz url when running sparkBuildConfig
, the tgz will be extracted as a spark/
folder in each machine's home folder.
What rules should I set on my security group?
Assuming your security group id is sg-abcde123
, the basic settings is:
Type | Protocol | Port Range | Source |
---|---|---|---|
All traffic | All | All | sg-abcde123 |
SSH | TCP | 22 | <your-allowed-ip> |
Custom TCP Rule | TCP | 8080-8081 | <your-allowed-ip> |
Custom TCP Rule | TCP | 4040 | <your-allowed-ip> |
How do I upgrade the config to new version of spark-deployer?
Change to the config you want to upgrade, and run sparkUpgradeConfig
to build a new config based on settings in old one. If this doesn't work or you don't mind rebuilding one from scratch, it's recommended to directly create a new config by sparkBuildConfig
.
Could I change the directory where configurations are saved?
You can change it by add the following line to your build.sbt
:
sparkConfigDir := "path/to/my-config-dir"
How to contribute
- Please report issue or ask on gitter if you meet any problem.
- Pull requests are welcome.