Format: ALTO

Document Format plugin to support reading ALTO XML files

Лицензия

Лицензия

Категории

Категории

ORM Данные
Группа

Группа

uk.ac.gate.plugins
Идентификатор

Идентификатор

format-alto
Последняя версия

Последняя версия

1.1
Дата

Дата

Тип

Тип

jar
Описание

Описание

Format: ALTO
Document Format plugin to support reading ALTO XML files
Организация-разработчик

Организация-разработчик

GATE
Система контроля версий

Система контроля версий

https://github.com/GateNLP/gateplugin-Format_ALTO

Скачать format-alto

Как подключить последнюю версию

<!-- https://jarcasting.com/artifacts/uk.ac.gate.plugins/format-alto/ -->
<dependency>
    <groupId>uk.ac.gate.plugins</groupId>
    <artifactId>format-alto</artifactId>
    <version>1.1</version>
</dependency>
// https://jarcasting.com/artifacts/uk.ac.gate.plugins/format-alto/
implementation 'uk.ac.gate.plugins:format-alto:1.1'
// https://jarcasting.com/artifacts/uk.ac.gate.plugins/format-alto/
implementation ("uk.ac.gate.plugins:format-alto:1.1")
'uk.ac.gate.plugins:format-alto:jar:1.1'
<dependency org="uk.ac.gate.plugins" name="format-alto" rev="1.1">
  <artifact name="format-alto" type="jar" />
</dependency>
@Grapes(
@Grab(group='uk.ac.gate.plugins', module='format-alto', version='1.1')
)
libraryDependencies += "uk.ac.gate.plugins" % "format-alto" % "1.1"
[uk.ac.gate.plugins/format-alto "1.1"]

Зависимости

provided (1)

Идентификатор библиотеки Тип Версия
uk.ac.gate : gate-core jar 8.6

test (1)

Идентификатор библиотеки Тип Версия
uk.ac.gate : gate-plugin-test-utils jar 8.6

Модули Проекта

Данный проект не имеет модулей.

GATE Support for ALTO XML documents

This plugin provides support for reading documents stored as ALTO XML. The format is usually used to store OCR based transcriptions of documents and hence contains information on the position within the page of the text as well as the text itself. It's popular among libraries and museums as a way of providing digital copies of scanned document and manuscripts. For example, the British Libray offers a number of collections of digitised books in this format.

The code provided by this plugin focuses purely on the text content of ALTO XML files and completely ignores the positional information. Specifically it reads the String elements that appear within TextBlock elements that are within the PrintSpace of each page. This means that text in the header, footer, and margins are ignored. This is based on previous experiance with processing multi-page formats (such as PDFs) where the header and footer make the processing of text which flows across pages exceptionally problematic. This may change in future versions.

To activate the plugin (once loaded) set the mime type to application/xml+alto when loading documents.

uk.ac.gate.plugins

GateNLP

GATE - General Architecture for Text Engineering

Версии библиотеки

Версия
1.1
1.0