Popularity

2.0

Declining

Activity

3.8

Declining

Stars 35

Watchers 5

Forks 6

Last Commit 14 days ago

Description

One of the biggest challenges after taking the first steps into the world of writing Apache Spark applications in Scala is taking them to production.

An application of any kind needs to be easy to run and easy to configure.

This project is trying to help developers write Spark applications focusing mainly on the application logic rather than the details of configuring the application and setting up the Spark context.

This project is also trying to create and encourage a friendly yet professional environment for developers to help each other, so please do no be shy and join through gitter, twitter, issue reports or pull requests.

Programming language: Scala

License: MIT License

Tags: Big Data Spark Utilities Scala

Latest version: v0.6

Spark Utils alternatives and similar packages

Based on the "Big Data" category.
Alternatively, view Spark Utils alternatives based on common mentions on social networks and blogs.

Kafka

10.0 9.9 L2 Spark Utils VS Kafka

Mirror of Apache Kafka
Apache Spark

10.0 10.0 Spark Utils VS Apache Spark

Apache Spark - A unified analytics engine for large-scale data processing

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

Promo www.influxdata.com

Flink

9.9 9.9 L2 Spark Utils VS Flink

Apache Flink
Deeplearning4J

9.9 6.5 L1 Spark Utils VS Deeplearning4J

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
Scalding

9.6 2.5 Spark Utils VS Scalding

A Scala API for Cascading
Scio

9.3 9.6 Spark Utils VS Scio

A Scala API for Apache Beam and Google Cloud Dataflow.
Summingbird

9.3 1.7 Spark Utils VS Summingbird

Streaming MapReduce with Scalding and Storm
Reactive-kafka

8.9 8.2 Spark Utils VS Reactive-kafka

Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.
Jupyter Scala

8.7 9.0 Spark Utils VS Jupyter Scala

A Scala kernel for Jupyter
Hail

8.3 9.8 Spark Utils VS Hail

Cloud-native genomic dataframes and batch computing
BIDMach

8.3 0.0 Spark Utils VS BIDMach

CPU and GPU-accelerated Machine Learning Library
Gearpump

8.1 0.0 Spark Utils VS Gearpump

Lightweight real-time big data streaming engine over Akka
Sparkta

8.0 0.0 Spark Utils VS Sparkta

Real Time Analytics and Data Pipelines based on Spark Streaming
Vegas

7.5 0.0 Spark Utils VS Vegas

The missing MatPlotLib for Scala + Spark
metorikku

7.4 2.4 Spark Utils VS metorikku

A simplified, lightweight ETL Framework based on Apache Spark
Scoobi

6.8 0.0 Spark Utils VS Scoobi

A Scala productivity framework for Hadoop.
Scrunch

5.1 1.4 L3 Spark Utils VS Scrunch

Mirror of Apache Crunch (Incubating)
Scoozie

4.7 0.0 Spark Utils VS Scoozie

Scala DSL on top of Oozie XML
Schemer

3.5 0.0 Spark Utils VS Schemer

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
spark-deployer

3.4 0.0 Spark Utils VS spark-deployer

Deploy Spark cluster in an easy way.
GridScale

2.2 6.6 Spark Utils VS GridScale

Scala library for accessing various file, batch systems, job schedulers and grid middlewares.
raster-frames

2.1 0.0 Spark Utils VS raster-frames

Spark DataFrames for earth observation data
Sparkplug

1.8 0.0 Spark Utils VS Sparkplug

Spark package to "plug" holes in data using SQL based rules ⚡️ 🔌
Shadoop

1.3 0.0 Spark Utils VS Shadoop

A wrapper for Hadoop in Scala
Spark Tools

1.0 0.0 Spark Utils VS Spark Tools

Executable Apache Spark Tools: Format Converter & SQL Processor

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of Spark Utils or a related project?

Add another 'Big Data' Package

Popular Comparisons

README

Spark Utils

Motivation

One of the biggest challenges after taking the first steps into the world of writing Apache Spark applications in Scala is taking them to production.

An application of any kind needs to be easy to run and easy to configure.

This project is trying to help developers write Spark applications focusing mainly on the application logic rather than the details of configuring the application and setting up the Spark context.

This project is also trying to create and encourage a friendly yet professional environment for developers to help each other, so please do no be shy and join through gitter, twitter, issue reports or pull requests.

Description

This project contains some basic utilities that can help setting up a Spark application project.

The main point is the simplicity of writing Apache Spark applications just focusing on the logic, while providing for easy configuration and arguments passing.

The code sample bellow shows how easy can be to write a file format converter from any acceptable type, with any acceptable parsing configuration options to any acceptable format.

object FormatConverterExample extends SparkApp[FormatConverterContext, DataFrame] {
  override def createContext(config: Config) = FormatConverterContext(config)
  override def run(implicit spark: SparkSession, context: FormatConverterContext): Try[DataFrame] = {
    val inputData = spark.source(context.input).read
    inputData.sink(context.output).write
  }
}

Creating the configuration can be as simple as defining a case class to hold the configuration and a factory, that helps extract simple and complex data types like input sources and output sinks.

case class FormatConverterContext(input: FormatAwareDataSourceConfiguration,
                                  output: FormatAwareDataSinkConfiguration)

object FormatConverterContext extends Configurator[FormatConverterContext] {
  import com.typesafe.config.Config
  import scalaz.ValidationNel

  def validationNel(config: Config): ValidationNel[Throwable, FormatConverterContext] = {
    import scalaz.syntax.applicative._
    config.extract[FormatAwareDataSourceConfiguration]("input") |@|
      config.extract[FormatAwareDataSinkConfiguration]("output") apply
      FormatConverterContext.apply
  }
}

Optionally, the SparkFun can be used instead of SparkApp to make the code even more concise.

object FormatConverterExample extends 
          SparkFun[FormatConverterContext, DataFrame](FormatConverterContext(_).get) {
  override def run(implicit spark: SparkSession, context: FormatConverterContext): Try[DataFrame] = 
    spark.source(context.input).read.sink(context.output).write
}

For structured streaming applications the format converter might look like this:

object StreamingFormatConverterExample extends SparkApp[StreamingFormatConverterContext, DataFrame] {
  override def createContext(config: Config) = StreamingFormatConverterContext(config).get
  override def run(implicit spark: SparkSession, context: StreamingFormatConverterContext): Try[DataFrame] = {
    val inputData = spark.source(context.input).read
    inputData.streamingSink(context.output).write.awaitTermination()
  }
}

The streaming configuration the configuration can be as simple as following:

case class StreamingFormatConverterContext(input: FormatAwareStreamingSourceConfiguration, 
                                           output: FormatAwareStreamingSinkConfiguration)

object StreamingFormatConverterContext extends Configurator[StreamingFormatConverterContext] {
  def validationNel(config: Config): ValidationNel[Throwable, StreamingFormatConverterContext] = {
    config.extract[FormatAwareStreamingSourceConfiguration]("input") |@|
      config.extract[FormatAwareStreamingSinkConfiguration]("output") apply
      StreamingFormatConverterContext.apply
  }
}

The [SparkRunnable](docs/spark-runnable.md) and [SparkApp](docs/spark-app.md) or [SparkFun](docs/spark-fun.md) together with the configuration framework provide for easy Spark application creation with configuration that can be managed through configuration files or application parameters.

The IO frameworks for [reading](docs/data-source.md) and [writing](docs/data-sink.md) data frames add extra convenience for setting up batch and structured streaming jobs that transform various types of files and streams.

Last but not least, there are many utility functions that provide convenience for loading resources, dealing with schemas and so on.

Most of the common features are also implemented as decorators to main Spark classes, like SparkContext, DataFrame and StructType and they are conveniently available by importing the org.tupol.spark.implicits._ package.

Documentation

The documentation for the main utilities and frameworks available:

[SparkApp](docs/spark-app.md), [SparkFun](docs/spark-fun.md) and [SparkRunnable](docs/spark-runnable.md)
[DataSource Framework](docs/data-source.md) for both batch and structured streaming applications
[DataSink Framework](docs/data-sink.md) for both batch and structured streaming applications

Latest stable API documentation is available here.

An extensive tutorial and walk-through can be found here. Extensive samples and demos can be found here.

A nice example on how this library can be used can be found in the spark-tools project, through the implementation of a generic format converter and a SQL processor for both batch and structured streams.

Prerequisites

Java 8 or higher
Scala 2.12
Apache Spark 3.0.X

Getting Spark Utils

Spark Utils is published to Maven Central and Spark Packages:

Group id / organization: org.tupol
Artifact id / name: spark-utils
Latest stable versions:
- Spark 2.4: 0.4.2
- Spark 3.0: 0.6.1

Usage with SBT, adding a dependency to the latest version of tools to your sbt build definition file:

libraryDependencies += "org.tupol" %% "spark-utils" % "0.6.2"

Include this package in your Spark Applications using spark-shell or spark-submit

$SPARK_HOME/bin/spark-shell --packages org.tupol:spark-utils_2.12:0.4.2

Starting a New `spark-utils` Project

The simplest way to start a new spark-utils is to make use of the spark-apps.seed.g8 template project.

To fill in manually the project options run

g8 tupol/spark-apps.seed.g8

The default options look like the following:

name [My Project]:
appname [My First App]:
organization [my.org]:
version [0.0.1-SNAPSHOT]:
package [my.org.my_project]:
classname [MyFirstApp]:
scriptname [my-first-app]:
scalaVersion [2.11.12]:
sparkVersion [2.4.0]:
sparkUtilsVersion [0.4.0]:

To fill in the options in advance

g8 tupol/spark-apps.seed.g8 --name="My Project" --appname="My App" --organization="my.org" --force

What's new?

0.6.2

Fixed core dependency to scala-utils; now using scala-utils-core
Refactored the core/implicits package to make the implicits a little more explicit

For previous versions please consult the [release notes](RELEASE-NOTES.md).

License

This code is open source software licensed under the [MIT License](LICENSE).

*Note that all licence references and agreements mentioned in the Spark Utils README section above are relevant to that project's source code only.

Spark Utils

Basic framework utilities to quickly start writing production ready Apache Spark applications

Description

Spark Utils alternatives and similar packages

Popular Comparisons

README

Spark Utils

Motivation

Description

Documentation

Prerequisites

Getting Spark Utils

Starting a New spark-utils Project

What's new?

License

Starting a New `spark-utils` Project