DKPro C4CorpusTools - Hadoop

Parent POM for DKPro projects. This provides some basic configuration for several Maven plugins as well as useful build profiles.

Лицензия

Лицензия

Категории

Категории

DKPro Прикладные библиотеки Machine Learning Natural Language Processing
Группа

Группа

org.dkpro.c4corpus
Идентификатор

Идентификатор

dkpro-c4corpus-hadoop
Последняя версия

Последняя версия

1.0.0
Дата

Дата

Тип

Тип

jar
Описание

Описание

DKPro C4CorpusTools - Hadoop
Parent POM for DKPro projects. This provides some basic configuration for several Maven plugins as well as useful build profiles.

Скачать dkpro-c4corpus-hadoop

Как подключить последнюю версию

<!-- https://jarcasting.com/artifacts/org.dkpro.c4corpus/dkpro-c4corpus-hadoop/ -->
<dependency>
    <groupId>org.dkpro.c4corpus</groupId>
    <artifactId>dkpro-c4corpus-hadoop</artifactId>
    <version>1.0.0</version>
</dependency>
// https://jarcasting.com/artifacts/org.dkpro.c4corpus/dkpro-c4corpus-hadoop/
implementation 'org.dkpro.c4corpus:dkpro-c4corpus-hadoop:1.0.0'
// https://jarcasting.com/artifacts/org.dkpro.c4corpus/dkpro-c4corpus-hadoop/
implementation ("org.dkpro.c4corpus:dkpro-c4corpus-hadoop:1.0.0")
'org.dkpro.c4corpus:dkpro-c4corpus-hadoop:jar:1.0.0'
<dependency org="org.dkpro.c4corpus" name="dkpro-c4corpus-hadoop" rev="1.0.0">
  <artifact name="dkpro-c4corpus-hadoop" type="jar" />
</dependency>
@Grapes(
@Grab(group='org.dkpro.c4corpus', module='dkpro-c4corpus-hadoop', version='1.0.0')
)
libraryDependencies += "org.dkpro.c4corpus" % "dkpro-c4corpus-hadoop" % "1.0.0"
[org.dkpro.c4corpus/dkpro-c4corpus-hadoop "1.0.0"]

Зависимости

compile (5)

Идентификатор библиотеки Тип Версия
org.dkpro.c4corpus : dkpro-c4corpus-boilerplate jar 1.0.0
org.dkpro.c4corpus : dkpro-c4corpus-deduplication jar 1.0.0
org.dkpro.c4corpus : dkpro-c4corpus-language jar 1.0.0
org.dkpro.c4corpus : dkpro-c4corpus-license jar 1.0.0
org.dkpro.c4corpus : dkpro-c4corpus-warc-io jar 1.0.0

provided (2)

Идентификатор библиотеки Тип Версия
org.apache.hadoop : hadoop-client jar 2.6.0
org.apache.hadoop : hadoop-common jar 2.6.0

test (3)

Идентификатор библиотеки Тип Версия
org.apache.mrunit : mrunit jar 1.1.0
org.easymock : easymock jar 3.4
junit : junit jar 4.12

Модули Проекта

Данный проект не имеет модулей.

DKPro C4CorpusTools

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

  • DKPro C4CorpusTools (or C4CorpusTools) refers to the project source codes
  • C4Corpus refers the preprocessed CommonCrawl data set (C4 = Creative Commons from Common Crawl)

Please use the following citation if you use C4Corpus or C4CorpusTools

@InProceedings{Habernal.et.al.2016.LREC,
  author    = {Habernal, Ivan and Zayed, Omnia, and Gurevych, Iryna},
  title     = {{C4Corpus: Multilingual Web-size Corpus with Free License}},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources
               and Evaluation (LREC 2016)},
  pages     = {914--922},
  month     = {May},
  year      = {2016},
  address   = {Portoro\v{z}, Slovenia},
  publisher = {European Language Resources Association (ELRA)},
  editor    = {Nicoletta Calzolari and Khalid Choukri and Thierry Declerck and Marko Grobelnik
               and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk
               and Stelios Piperidis},
  isbn      = {978-2-9517408-9-1},
  url       = {http://www.lrec-conf.org/proceedings/lrec2016/pdf/388_Paper.pdf}
}

The full LREC article is available at the UKP website.

Consult the official C4CorpusTools documentation which contains

  • C4Corpus Users's Guide
    • How to access C4Corpus at S3
    • Running boilerplate removal outside Hadoop
    • Examples of simple search in C4Corpus
  • C4Corpus Developers's Guide
    • How to run the full processing pipeline on CommonCrawl
  • Corpus statistics reported in the LREC article

As of May 2017, thanks to CommonCrawl the C4Corpus is hosted at their S3 bucket. This makes it much easier to access the data using HTTP (see the documentation).

org.dkpro.c4corpus

DKPro

Версии библиотеки

Версия
1.0.0