Terrier Spark (Scala)

A Spark wrapper for Terrier

Лицензия

Лицензия

Группа

Группа

org.terrier
Идентификатор

Идентификатор

terrier-spark
Последняя версия

Последняя версия

0.0.1
Дата

Дата

Тип

Тип

jar
Описание

Описание

Terrier Spark (Scala)
A Spark wrapper for Terrier
Ссылка на сайт

Ссылка на сайт

https://github.com/terrier-org/terrier-spark/
Организация-разработчик

Организация-разработчик

University of Glasgow
Система контроля версий

Система контроля версий

https://github.com/terrier-org/terrier-spark

Скачать terrier-spark

Как подключить последнюю версию

<!-- https://jarcasting.com/artifacts/org.terrier/terrier-spark/ -->
<dependency>
    <groupId>org.terrier</groupId>
    <artifactId>terrier-spark</artifactId>
    <version>0.0.1</version>
</dependency>
// https://jarcasting.com/artifacts/org.terrier/terrier-spark/
implementation 'org.terrier:terrier-spark:0.0.1'
// https://jarcasting.com/artifacts/org.terrier/terrier-spark/
implementation ("org.terrier:terrier-spark:0.0.1")
'org.terrier:terrier-spark:jar:0.0.1'
<dependency org="org.terrier" name="terrier-spark" rev="0.0.1">
  <artifact name="terrier-spark" type="jar" />
</dependency>
@Grapes(
@Grab(group='org.terrier', module='terrier-spark', version='0.0.1')
)
libraryDependencies += "org.terrier" % "terrier-spark" % "0.0.1"
[org.terrier/terrier-spark "0.0.1"]

Зависимости

compile (8)

Идентификатор библиотеки Тип Версия
org.apache.commons : commons-collections4 jar 4.1
org.apache.spark : spark-mllib_2.11 jar 2.1.0
org.terrier : terrier-rest-client jar 5.0
org.terrier : terrier-core jar 5.0
org.terrier : terrier-learning jar 5.0
org.apache.commons : commons-math3 jar 3.4.1
com.github.bruneli.scalaopt : scalaopt-core_2.11 jar 0.2
org.terrier : terrier-concurrent jar 5.0

test (1)

Идентификатор библиотеки Тип Версия
org.scalatest : scalatest_2.11 jar 2.2.1

Модули Проекта

Данный проект не имеет модулей.

Terrier-Spark

Terrier-Spark is a Scala library for Apache Spark that allows the Terrier.org information retrieval platform to be installed and working.

To use within a notebook, this requires Apache Toree to be installed and working.

Requirements:

  • Terrier 5.0
  • Apache Spark version 2.0 or newer
  • Jupyter & Apache Tree (optional)

Functionality

  • Retrieving a run from a Terrier index (local or remote)
  • Evaluating a run
  • Optimising the parameter of a retrieval run on a local index
  • Grid searching the parameter of a retrieval run on a local index
  • Learning a model using learning-to-rank

For known improvements/issues, see TODO.md

Example

val indexref = IndexRef.of("/path/to/index/data.properties")

val props = Map(
"terrier.home" -> terrierHome)

TopicSource.configureTerrier(props)
val topics = TopicSource.extractTRECTopics(topicsFile)
    .toList.toDF("qid", "query")

val queryTransform = new QueryingTransformer()
    .setTerrierProperties(props)
    .setIndexReference(indexref)
    .setSampleModel(model)

val r1 = queryTransform.transform(topics)
//r1 is a dataframe with results for queries in topics
val qrelTransform = new QrelTransformer()
    .setQrelsFile(qrelsFile)

val r2 = qrelTransform.transform(r1)
//r2 is a dataframe as r1, but also includes a label column
val ndcg = new RankingEvaluator(Measure.NDCG, 20).evaluateByQuery(r2).toList

More examples are provided in the example notebooks, or in our SIGIR 2018 demo paper [1].

Use from the Spark Shell

$ spark-shell --packages org.terrier:terrier-spark:0.0.1-SNAPSHOT

Use within a Jupyter Notebook

Firstly, make sure you have a working installation of Toree. Next, import Terrier and terrier-spark using some %AddDeps "magic":

%AddDeps org.terrier terrier-core 5.0 --transitive --exclude org.slf4j:slf4j-log4j12  
%AddDeps org.terrier terrier-spark 0.0.1-SNAPSHOT --repository file:/home/user/.m2/repository --transitive

You can then use the terrier-spark code directly in your Scala notebooks.

We have provided several example notebooks:

Bibliography

If you use this software, please cite one of:

  1. Combining Terrier with Apache Spark to create agile experimental information retrieval pipelines. Craig Macdonald. In Proceedings of SIGIR 2018.

  2. Agile Information Retrieval Experimentation with Terrier Notebooks. Craig Macdonald, Richard McCreadie, Iadh Ounis. In Proceedings of DESIRES 2018.

Credits

Developed by Craig Macdonald, University of Glasgow

org.terrier

Terrier.org

Версии библиотеки

Версия
0.0.1