PDF Extractor

Extract data and metadata from PDF files in a hierarchial JSON format.

Лицензия	Лицензия The Apache License, Version 2.0
Категории	Категории PDF Данные
Группа	Группа com.beehyv
Идентификатор	Идентификатор pdf-extractor
Последняя версия	Последняя версия 1.0
Дата	Дата 12 июн. 2020 г.
Тип	Тип jar
Описание	Описание PDF Extractor Extract data and metadata from PDF files in a hierarchial JSON format.
Организация-разработчик	Организация-разработчик BeeHyv Software Solutions Pvt Ltd

Скачать pdf-extractor

Имя Файла	Размер
pdf-extractor-1.0.pom
pdf-extractor-1.0.jar	223 KB
pdf-extractor-1.0-sources.jar	182 KB
pdf-extractor-1.0-javadoc.jar	398 KB
Обзор

Как подключить последнюю версию

Apache Maven

<!-- https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/ -->
<dependency>
    <groupId>com.beehyv</groupId>
    <artifactId>pdf-extractor</artifactId>
    <version>1.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/
implementation 'com.beehyv:pdf-extractor:1.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/
implementation ("com.beehyv:pdf-extractor:1.0")

Apache Buildr

'com.beehyv:pdf-extractor:jar:1.0'

Apache Ivy

<dependency org="com.beehyv" name="pdf-extractor" rev="1.0">
  <artifact name="pdf-extractor" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.beehyv', module='pdf-extractor', version='1.0')
)

Scala SBT

libraryDependencies += "com.beehyv" % "pdf-extractor" % "1.0"

Leiningen

[com.beehyv/pdf-extractor "1.0"]

Зависимости

compile (12)

Идентификатор библиотеки	Тип	Версия
commons-collections : commons-collections	jar	3.2.2
org.apache.commons : commons-lang3	jar	3.9
org.apache.pdfbox : pdfbox	jar	2.0.5
org.slf4j : slf4j-api	jar	2.0.0-alpha1
org.apache.httpcomponents : httpcore	jar	4.4.13
commons-io : commons-io	jar	2.6
org.codehaus.jackson : jackson-mapper-asl	jar	1.9.13
com.google.apis : google-api-services-vision	jar	v1-rev30-1.22.0
commons-configuration : commons-configuration	jar	1.10
com.google.guava : guava	jar	18.0
technology.tabula : tabula	jar	1.0.0
com.beehyv.munchbot » ingestion-model	jar	0.0.1

test (1)

Идентификатор библиотеки	Тип	Версия
junit : junit	jar	4.12

Модули Проекта

Данный проект не имеет модулей.

PDF Extractor

Powered by BeeHyv Software Solutions Pvt Ltd and Distributed under Apache 2.0 Licence

Overview

Extracting meaningful data out of documents is a standard problem and many attempts have been made till date, with partial success. The goal of this project is to extract data and metadata in a structured manner for any given PDF document.

Features

Table of contents : A TOC generally provides an overview of the content within the document. A PDF may or may not have a table of contents. This code extracts a TOC from a PDF which doesn't have one using a heuristic based approach.
Text : Entire text , text from a particular page
Sections : Splitting PDF content (text , image , tables) into sections could help to extract more relevant content .
Font information : Color , font type . font weight , font size etc.
Tables : Table heading , rows , cells
Images : Image files , Text inside images.
Metadata of PDF : Author info , Creation date , Size etc.

Technologies Used

This library uses The Apache PDFBox® library's PDF content stream engine to stream the PDF file.
Tabula 1.2.1 (an Open source library) is used for table extraction.

Installation Instructions

Pre-requisites

Java (>1.6)
Maven

Installation and Setup

From Source

Clone the project
Add an environment variable for the tabula jar (used for tables extraction and unit tests)
TABULA_JAR_LOCATION={Project-dir}/lib/tabula/tabula-0.9.1-jar-with-dependencies.jar
Run mvn clean install to install it in your local environment. It might take some time (~15 mins) as there are ~400 unit tests within the project. In order to skip tests , run with -DskipTests

Import the pdf-extractor dependency to your project

Adding the maven dependency

 <dependency>
    <groupId>com.beehyv</groupId>
    <artifactId>pdf-extractor</artifactId>
    <version>1.0</version>
 </dependency>

Adding the jar to the classpath
pdf-extractor.jar file for the project can be found under {Project-dir}/target

Run Extraction

Create a document object

HolmesPdfDocument pdfDocument = new HolmesPdfDocument(file);

Create an extractor instance

PdfBoxExtractor pdfBoxExtractor = new PdfBoxExtractor();

Extract text

String text = pdfBoxExtractor.getText(pdfDocument,startPage,endPage);

Extract images

pdfBoxExtractor.getImages(pdfDocument,startPage,endPage);

Extract tables

pdfBoxExtractor.getTabularData(pdfDocument,startPage,endPage);

Extract Structured Text

With this feature you can extract data in a structured manner. The data is extracted in sections with the hierarchy of the sections being intact. All the texts , images , tables , paragraphs are assigned to the respective sections giving the extracted data a structure and hence more meaningful.

All this information resides in an InfoNode model.

InfoNode infoNode = pdfBoxExtractor.getStructuredText(hdoc)
- Sections infoNode.getSections()
- Paragraphs infoNode.getParagraphs
- Content infoNode.getContent()
- Section Heading infoNode.getHeading()
- Section Images infoNode.getImageSections()
- Lines infoNode.getContentLineObjects()

Feature Request

In case of new feature requests please use the Github Issues page to raise tickets for Bugs as well as enhancements. The community can then take up the functionality as per need.

BeeHyv Software Solutions Pvt Ltd

Версии библиотеки

Версия
1.0 12 июн. 2020 г.

PDF Extractor

Лицензия

Категории

Группа

Идентификатор

Последняя версия

Дата

Тип

Описание

Организация-разработчик

Скачать pdf-extractor

Как подключить последнюю версию

Зависимости

compile (12)

test (1)

Модули Проекта

PDF Extractor

Overview

Features

Technologies Used

Installation Instructions

Pre-requisites

Installation and Setup

From Source

Import the pdf-extractor dependency to your project

Run Extraction

Feature Request

BeeHyv Software Solutions Pvt Ltd

Версии библиотеки