core

License	License Apache 2.0
GroupId	GroupId com.github.benfradet
ArtifactId	ArtifactId struct-type-encoder_2.12
Last Version	Last Version 0.6.0
Release Date	Release Date Feb 5, 2021
Type	Type jar
Description	Description core core
Project URL	Project URL https://github.com/BenFradet/struct-type-encoder
Project Organization	Project Organization com.github.benfradet
Source Code Management	Source Code Management https://github.com/BenFradet/struct-type-encoder

Download struct-type-encoder_2.12

Filename	Size
struct-type-encoder_2.12-0.6.0.pom
struct-type-encoder_2.12-0.6.0.jar	68 KB
struct-type-encoder_2.12-0.6.0-sources.jar	5 KB
struct-type-encoder_2.12-0.6.0-javadoc.jar	1 MB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.github.benfradet/struct-type-encoder_2.12/ -->
<dependency>
    <groupId>com.github.benfradet</groupId>
    <artifactId>struct-type-encoder_2.12</artifactId>
    <version>0.6.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.github.benfradet/struct-type-encoder_2.12/
implementation 'com.github.benfradet:struct-type-encoder_2.12:0.6.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.github.benfradet/struct-type-encoder_2.12/
implementation ("com.github.benfradet:struct-type-encoder_2.12:0.6.0")

Apache Buildr

'com.github.benfradet:struct-type-encoder_2.12:jar:0.6.0'

Apache Ivy

<dependency org="com.github.benfradet" name="struct-type-encoder_2.12" rev="0.6.0">
  <artifact name="struct-type-encoder_2.12" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.github.benfradet', module='struct-type-encoder_2.12', version='0.6.0')
)

Scala SBT

libraryDependencies += "com.github.benfradet" % "struct-type-encoder_2.12" % "0.6.0"

Leiningen

[com.github.benfradet/struct-type-encoder_2.12 "0.6.0"]

Dependencies

compile (2)

Group / Artifact	Type	Version
org.scala-lang : scala-library	jar	2.12.13
com.chuusai : shapeless_2.12	jar	2.3.3

provided (1)

Group / Artifact	Type	Version
org.apache.spark : spark-sql_2.12	jar	3.1.0

test (1)

Group / Artifact	Type	Version
org.scalatest : scalatest_2.12	jar	3.2.3

Project Modules

There are no modules declared in this project.

struct-type-encoder

Deriving Spark DataFrame schemas from case classes.

Installation

struct-type-encoder is available on maven central with the following coordinates:

"com.github.benfradet" %% "struct-type-encoder" % "0.5.0"

Motivation

When reading a DataFrame/Dataset from a data source the schema of the data has to be inferred. In practice, this translates into looking at every record of all the files and coming up with a schema that can satisfy every one of these records, as shown here for JSON.

As anyone can guess, this can be a very time-consuming task, especially if you know in advance the schema of your data. A common pattern is to do the following:

case class MyCaseClass(a: Int, b: String, c: Double)
val inferred = spark
  .read
  .json("/some/dir/*.json")
  .as[MyCaseClass]

In this case, there is no need to spend time inferring the schema as the DataFrame is directly converted to a Dataset of MyCaseClass. However, it can be a lot of boilerplate to bypass the inference by specifying your own schema.

import org.apache.spark.sql.types._
val schema = SructType(
  StructField("a", IntegerType) ::
  StructField("b", StringType) ::
  StructField("c", DoubleType) :: Nil
)
val specified = spark
  .read
  .schema(schema)
  .json("/some/dir/*.json")
  .as[MyCaseClass]

struct-type-encoder derives instances of StructType (how Spark represents a schema) from your case class automatically:

import ste.StructTypeEncoder
import ste.StructTypeEncoder._
val derived = spark
  .read
  .schema(StructTypeEncoder[MyCaseClass].encode)
  .json("/some/dir/*.json")
  .as[MyCaseClass]

No inference, no boilerplate!

Additional features

Spark `Metadata` support

It is possible to add Metada information to StructFields with the Meta annotation:

import org.apache.spark.sql.types._
import ste._

val metadata = new MetadataBuilder()
  .putLong("foo", 10)
  .putString("bar", "baz")
  .build()

case class Foo(a: String, @Meta(metadata) b: Int)

Flattening schemas

Using the ste.Flatten annotation we can eliminate repetitions from case class definitions. Take the following example:

import ste._
case class Foo(a: String, b: Int)
case class Bar(@Flatten(2) a: Seq[Foo], @Flatten(1, Seq("x", "y")) b: Map[String, Foo], @Flatten c: Foo)

StructTypeEncoder[Bar].encode

The derived schema is the following:

StructType(
  StructField("a.0.a", StringType, false) ::
  StructField("a.0.b", IntegerType, false) ::
  StructField("a.1.a", StringType, false) ::
  StructField("a.1.b", IntegerType, false) ::
  StructField("b.x.a", StringType, false) ::
  StructField("b.x.b", IntegerType, false) ::
  StructField("b.y.a", StringType, false) ::
  StructField("b.y.b", IntegerType, false) ::
  StructField("c.a", StringType, false) ::
  StructField("c.b", IntegerType, false) :: Nil
)

Now we want to read our data source with a flat schema:

import ste.StructTypeEncoder
import ste.StructTypeEncoder._
val df = spark
  .read
  .schema(StructTypeEncoder[Bar].encode)
  .csv("/some/dir/*.csv")

struct-type-encoder can derive the nested projection of a Dataframe and convert it to a Dataset by providing the class:

import StructTypeSelector._

val ds: Dataset[Bar] = df.asNested[Bar]

Benchmarks

This project includes JMH benchmarks to prove that inferring schemas and coming up with the schema satisfying all records is expensive. The benchmarks compare the average time spent parsing a thousand files each containing a hundred rows when the schema is inferred (by Spark, not user-specified) and derived (thanks to struct-type-encoder).

	derived	inferred
CSV	5.936 ± 0.035 s	6.494 ± 0.209 s
JSON	5.092 ± 0.048 s	6.019 ± 0.049 s

We see that when deriving the schemas we spend 16.7% less time reading JSON data and a 8.98% for CSV.

Versions

Version
0.6.0 Feb 5, 2021
0.5.0 Mar 25, 2019
0.4.0 Nov 24, 2018

core

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Project Organization

Source Code Management