Dataproc Java Submitter

Java library for easy job submission to Google Cloud Dataproc

Лицензия	Лицензия The Apache Software License, Version 2.0
Категории	Категории Java Языки программирования Данные
Группа	Группа com.spotify
Идентификатор	Идентификатор dataproc-java-submitter
Последняя версия	Последняя версия 0.1.2
Дата	Дата 28 окт. 2016 г.
Тип	Тип jar
Описание	Описание Dataproc Java Submitter Java library for easy job submission to Google Cloud Dataproc
Ссылка на сайт	Ссылка на сайт https://github.com/spotify/dataproc-java-submitter
Система контроля версий	Система контроля версий https://github.com/spotify/dataproc-java-submitter

Скачать dataproc-java-submitter

Имя Файла	Размер
dataproc-java-submitter-0.1.2.pom
dataproc-java-submitter-0.1.2.jar	30 KB
dataproc-java-submitter-0.1.2-sources.jar	19 KB
dataproc-java-submitter-0.1.2-javadoc.jar	77 KB
Обзор

Как подключить последнюю версию

Apache Maven

<!-- https://jarcasting.com/artifacts/com.spotify/dataproc-java-submitter/ -->
<dependency>
    <groupId>com.spotify</groupId>
    <artifactId>dataproc-java-submitter</artifactId>
    <version>0.1.2</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.spotify/dataproc-java-submitter/
implementation 'com.spotify:dataproc-java-submitter:0.1.2'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.spotify/dataproc-java-submitter/
implementation ("com.spotify:dataproc-java-submitter:0.1.2")

Apache Buildr

'com.spotify:dataproc-java-submitter:jar:0.1.2'

Apache Ivy

<dependency org="com.spotify" name="dataproc-java-submitter" rev="0.1.2">
  <artifact name="dataproc-java-submitter" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.spotify', module='dataproc-java-submitter', version='0.1.2')
)

Scala SBT

libraryDependencies += "com.spotify" % "dataproc-java-submitter" % "0.1.2"

Leiningen

[com.spotify/dataproc-java-submitter "0.1.2"]

Зависимости

compile (5)

Идентификатор библиотеки	Тип	Версия
org.jhades : jhades	jar	1.0.4
com.google.apis : google-api-services-dataproc	jar	v1-rev8-1.22.0
com.google.apis : google-api-services-storage	jar	v1-rev71-1.22.0
com.google.guava : guava	jar	19.0
org.slf4j : slf4j-api	jar	1.7.21

provided (1)

Идентификатор библиотеки	Тип	Версия
com.google.auto.value : auto-value	jar	1.1

test (3)

Идентификатор библиотеки	Тип	Версия
junit : junit	jar	4.12
org.hamcrest : hamcrest-all	jar	1.3
org.mockito : mockito-all	jar	1.10.19

Модули Проекта

Данный проект не имеет модулей.

dataproc-java-submitter

A small java library for submitting Hadoop jobs to Google Cloud Dataproc from Java.

Why?

In many real world usages of Hadoop, the jobs are usually parameterized to some degree. Parameters can be anything from job configuration to input paths. It is common to resolve these parameter arguments in some workflow tool that eventually puts the arguments on a command line that is passed to the Hadoop job. On the job side, these arguments have to be parsed using various tools that are more or less standard.

However if the argument resolution environment is in a JVM, dropping down to a shell and invoking a command line can be pretty complicated and roundabout. It is also very limiting in terms of what can be passed to the job. It is not uncommon to take more structured data and store in some seralized format, stage the files, and have custom logic in the job to deserialize it.

This library aims to more seamlessly bridge between a local JVM instance and the Hadoop application entrypoint.

Usage

Maven dependency

<dependency>
  <groupId>com.spotify</groupId>
  <artifactId>dataproc-java-submitter</artifactId>
  <version><!-- see version in maven badge above --></version>
</dependency>

Example usage

String project = "gcp-project-id";
String cluster = "dataproc-cluster-id";

DataprocHadoopRunner hadoopRunner = DataprocHadoopRunner.builder(project, cluster).build();
DataprocLambdaRunner lambdaRunner = DataprocLambdaRunner.forDataproc(hadoopRunner);

// Use any structured type that is Java Serializable
MyStructuredJobArguments arguments = resolveArgumentsInLocalJvm();

lambdaRunner.runOnCluster(() -> {

  // This lambda, including its closure will run on the Dataproc cluster
  System.out.println("Running on the cluster, with " + arguments.inputPaths());

  return 42; // rfc: is it worth supporting a return value from the job?
});

The DataprocLambdaRunner will take care of configuring the Dataproc job so that it can run your lambda function. It will scan your local classpath and ensure that the loaded jars are staged and configured for the Dataproc job. It will also take care of serializing, staging and deserializing the lambda closure that is to be invoked on the cluster.

Note that anything referenced from the lambda has to implement java.io.Serializable

Low level usage

This library can also be used to configure the Dataproc job directly.

String project = "gcp-project-id";
String cluster = "dataproc-cluster-id";

DataprocHadoopRunner hadoopRunner = DataprocHadoopRunner.builder(project, cluster).build();

Job job = Job.builder()
    .setMainClass(...)
    .setArgs(...)
    .setProperties(...)
    .setShippedJars(...)
    .setShippedFiles(...)
    .createJob();


hadoopRunner.submit(job);

Spotify

Версии библиотеки

Версия
0.1.2 28 окт. 2016 г.
0.1.1 28 сент. 2016 г.

Dataproc Java Submitter

Лицензия

Категории

Группа

Идентификатор

Последняя версия

Дата

Тип

Описание

Ссылка на сайт

Система контроля версий