Corc Test

Corc test commons

Лицензия	Лицензия Apache License, Version 2.0
Группа	Группа com.hotels
Идентификатор	Идентификатор corc-test
Последняя версия	Последняя версия 3.0.0
Дата	Дата 3 янв. 2020 г.
Тип	Тип jar
Описание	Описание Corc Test Corc test commons
Организация-разработчик	Организация-разработчик Hotels.com (Data Platform Team)

Скачать corc-test

Имя Файла	Размер
corc-test-3.0.0.pom
corc-test-3.0.0.jar	4 KB
corc-test-3.0.0-sources.jar	3 KB
corc-test-3.0.0-javadoc.jar	45 KB
Обзор

Как подключить последнюю версию

Apache Maven

<!-- https://jarcasting.com/artifacts/com.hotels/corc-test/ -->
<dependency>
    <groupId>com.hotels</groupId>
    <artifactId>corc-test</artifactId>
    <version>3.0.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.hotels/corc-test/
implementation 'com.hotels:corc-test:3.0.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.hotels/corc-test/
implementation ("com.hotels:corc-test:3.0.0")

Apache Buildr

'com.hotels:corc-test:jar:3.0.0'

Apache Ivy

<dependency org="com.hotels" name="corc-test" rev="3.0.0">
  <artifact name="corc-test" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.hotels', module='corc-test', version='3.0.0')
)

Scala SBT

libraryDependencies += "com.hotels" % "corc-test" % "3.0.0"

Leiningen

[com.hotels/corc-test "3.0.0"]

Зависимости

provided (12)

Идентификатор библиотеки	Тип	Версия
org.apache.hadoop : hadoop-common	jar	2.6.0
org.apache.hadoop : hadoop-mapreduce-client-common	jar	2.6.0
org.apache.hadoop : hadoop-mapreduce-client-core	jar	2.6.0
org.apache.hadoop : hadoop-yarn-api	jar	2.6.0
org.apache.hive : hive-common	jar	2.3.4
org.apache.hive : hive-exec	jar	2.3.4
org.apache.hive : hive-serde	jar	2.3.4
com.esotericsoftware.kryo : kryo	jar	2.22
com.google.protobuf : protobuf-java	jar	2.5.0
org.slf4j : slf4j-api	jar	1.7.9
junit : junit	jar	4.11
org.hamcrest : hamcrest-core	jar	1.3

test (2)

Идентификатор библиотеки	Тип	Версия
org.mockito : mockito-core	jar	1.9.5
xerces : xercesImpl	jar	2.11.0

Модули Проекта

Данный проект не имеет модулей.

   O~~~   O~~    O~ O~~~   O~~~
 O~~    O~~  O~~  O~~    O~~   
O~~    O~~    O~~ O~~   O~~    
 O~~    O~~  O~~  O~~    O~~   
   O~~~   O~~    O~~~      O~~~

Use corc to read and write data in the Optimized Row Columnar (ORC) file format in your Cascading applications. The reading of ACID datasets is also supported.

Start using

You can obtain corc from Maven Central :

Cascading Dependencies

Corc has been built and tested against Cascading 3.3.0.

Hive Dependencies

Corc is built with Hive 2.3.4. Several dependencies will need to be included when using Corc:

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>2.3.4</version>
  <classifier>core</classifier>
</dependency>
<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-serde</artifactId>
  <version>2.3.4</version>
</dependency>
<dependency>
  <groupId>com.esotericsoftware.kryo</groupId>
  <artifactId>kryo</artifactId>
  <version>2.22</version>
</dependency>
<dependency>
  <groupId>com.google.protobuf</groupId>
  <artifactId>protobuf-java</artifactId>
  <version>2.5.0</version>
</dependency>

Overview

Supported types

Hive	Cascading/Java
STRING	String
BOOLEAN	Boolean
TINYINT	Byte
SMALLINT	Short
INT	Integer
BIGINT	Long
FLOAT	Float
DOUBLE	Double
TIMESTAMP	java.sql.Timestamp
DATE	java.sql.Date
BINARY	byte[]
CHAR	String (HiveChar)
VARCHAR	String (HiveVarchar)
DECIMAL	BigDecimal (HiveDecimal)
ARRAY	List<Object>
MAP	Map<Object, Object>
STRUCT	List<Object>
UNIONTYPE	Sub-type

Constructing an `OrcFile` instance

OrcFile provides two public constructors; one for sourcing and one for sinking. However, these are provided to be more flexible for others who may wish to extend the class. It is advised to construct an instance via the SourceBuilder and SinkBuilder classes.

SourceBuilder

Create a builder:

SourceBuilder builder = OrcFile.source();

Specify the fields that should be read. If the declared schema is a subset of the complete schema, then column projection will occur:

builder.declaredFields(fields);
// or
builder.columns(structTypeInfo);
// or
builder.columns(structTypeInfoString);

Specify the complete schema of the underlying ORC Files. This is only required for reading ORC Files that back a transactional Hive table. The default behaviour should be to obtain the schema from the ORC Files being read:

builder.schemaFromFile();
// or
builder.schema(fields);
// or
builder.schema(structTypeInfo);
// or
builder.schema(structTypeInfoString);

ORC Files support predicate pushdown. This allows whole row groups to be skipped if they do not contain any rows that match the given SearchArgument:

Fields message = new Fields("message", String.class);
SearchArgument searchArgument = SearchArgumentFactory.newBuilder()
    .startAnd()
    .equals(message, "hello")
    .end()
    .build();

builder.searchArgument(searchArgument);

When passing objects to the SearchArgument.Builder, care should be taken to choose the correct type:

Hive	Java
STRING	String
BOOLEAN	Boolean
TINYINT	Byte
SMALLINT	Short
INT	Integer
BIGINT	Long
FLOAT	Float
DOUBLE	Double
TIMESTAMP	java.sql.Timestamp
DATE	org.apache.hadoop.hive.serde2.io.DateWritable
CHAR	String (HiveChar)
VARCHAR	String (HiveVarchar)
DECIMAL	BigDecimal

When reading ORC Files that back a transactional Hive table, include the VirtualColumn#ROWID ("ROW__ID") virtual column. The column will be prepended to the record's Fields:

builder.prependRowId();

Finally, build the OrcFile:

OrcFile orcFile = builder.build();

SinkBuilder

OrcFile orcFile = OrcFile.sink()
    .schema(schema)
    .build();

The schema parameter can be one of Fields, StructTypeInfo or the String representation of the StructTypeInfo. When providing a Fields instance, care must be taken when deciding how best to specify the types as there is no one-to-one bidirectional mapping between Cascading types and Hive types. The TypeInfo is able to represent richer, more complex types. Consider your ORC File schema and the mappings to Fields types carefully.

Constructing a `StructTypeInfo` instance

List<String> names = new ArrayList<>();
names.add("col0");
names.add("col1");

List<TypeInfo> typeInfos = new ArrayList<>();
typeInfos.add(TypeInfoFactory.stringTypeInfo);
typeInfos.add(TypeInfoFactory.longTypeInfo);

StructTypeInfo structTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(names, typeInfos);

or...

String typeString = "struct<col0:string,col1:bigint>";

StructTypeInfo structTypeInfo = (StructTypeInfo) TypeInfoUtils.getTypeInfoFromTypeString(typeString);

or, via the convenience builder...

StructTypeInfo structTypeInfo = new StructTypeInfoBuilder()
    .add("col0", TypeInfoFactory.stringTypeInfo)
    .add("col1", TypeInfoFactory.longTypeInfo)
    .build();

Reading transactional Hive tables

Corc also supports the reading of ACID datasets that underpin transactional Hive tables. However, for this to work effectively with an active Hive table you must provide your own lock management. We intend to make this functionality available in the cascading-hive project. When reading the data you may optionally include the virtual RecordIdentifer column, also known as the ROW__ID column, with one of the following approaches:

Add a field named 'ROW__ID' to your Fields definition. This must be of type org.apache.hadoop.hive.ql.io.RecordIdentifier. For convenience you can use the constant OrcFile#ROW__ID with some fields arithmetic: Fields myFields = Fields.join(OrcFile.ROW__ID, myFields);.
Use the OrcFile.source().prependRowId() option. Be sure to exclude the RecordIdentifer column from your typeInfo instance. The ROW__ID field will be added to your tuple stream automatically.

Usage

OrcFile can be used with Hfs, just like TextDelimited.

OrcFile orcFile = ...
String path = ...
Hfs hfs = new Hfs(orcFile, path);

Credits

Created by Dave Maughan & Elliot West, with thanks to: Patrick Duin, James Grant & Adrian Woodhead.

Legal

This project is available under the Apache 2.0 License.

Hotels.com

Hotels.com open source contributions

Версии библиотеки

Версия
3.0.0 3 янв. 2020 г.
2.0.3 17 февр. 2016 г.
2.0.2 4 дек. 2015 г.
2.0.1 2 дек. 2015 г.
2.0.0 27 окт. 2015 г.
1.1.0 28 авг. 2015 г.
1.0.0 15 мая 2015 г.