PDF Extractor

Extract data and metadata from PDF files in a hierarchial JSON format.

Лицензия

Лицензия

Категории

Категории

PDF Данные
Группа

Группа

com.beehyv
Идентификатор

Идентификатор

pdf-extractor
Последняя версия

Последняя версия

1.0
Дата

Дата

Тип

Тип

jar
Описание

Описание

PDF Extractor
Extract data and metadata from PDF files in a hierarchial JSON format.
Организация-разработчик

Организация-разработчик

BeeHyv Software Solutions Pvt Ltd

Скачать pdf-extractor

Как подключить последнюю версию

<!-- https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/ -->
<dependency>
    <groupId>com.beehyv</groupId>
    <artifactId>pdf-extractor</artifactId>
    <version>1.0</version>
</dependency>
// https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/
implementation 'com.beehyv:pdf-extractor:1.0'
// https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/
implementation ("com.beehyv:pdf-extractor:1.0")
'com.beehyv:pdf-extractor:jar:1.0'
<dependency org="com.beehyv" name="pdf-extractor" rev="1.0">
  <artifact name="pdf-extractor" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.beehyv', module='pdf-extractor', version='1.0')
)
libraryDependencies += "com.beehyv" % "pdf-extractor" % "1.0"
[com.beehyv/pdf-extractor "1.0"]

Зависимости

compile (12)

Идентификатор библиотеки Тип Версия
commons-collections : commons-collections jar 3.2.2
org.apache.commons : commons-lang3 jar 3.9
org.apache.pdfbox : pdfbox jar 2.0.5
org.slf4j : slf4j-api jar 2.0.0-alpha1
org.apache.httpcomponents : httpcore jar 4.4.13
commons-io : commons-io jar 2.6
org.codehaus.jackson : jackson-mapper-asl jar 1.9.13
com.google.apis : google-api-services-vision jar v1-rev30-1.22.0
commons-configuration : commons-configuration jar 1.10
com.google.guava : guava jar 18.0
technology.tabula : tabula jar 1.0.0
com.beehyv.munchbot » ingestion-model jar 0.0.1

test (1)

Идентификатор библиотеки Тип Версия
junit : junit jar 4.12

Модули Проекта

Данный проект не имеет модулей.

PDF Extractor

BeeHyv Software Solutions Pvt Ltd

Powered by BeeHyv Software Solutions Pvt Ltd and Distributed under Apache 2.0 Licence

Overview

Extracting meaningful data out of documents is a standard problem and many attempts have been made till date, with partial success. The goal of this project is to extract data and metadata in a structured manner for any given PDF document.

Features

  • Table of contents : A TOC generally provides an overview of the content within the document. A PDF may or may not have a table of contents. This code extracts a TOC from a PDF which doesn't have one using a heuristic based approach.
  • Text : Entire text , text from a particular page
  • Sections : Splitting PDF content (text , image , tables) into sections could help to extract more relevant content .
  • Font information : Color , font type . font weight , font size etc.
  • Tables : Table heading , rows , cells
  • Images : Image files , Text inside images.
  • Metadata of PDF : Author info , Creation date , Size etc.

Technologies Used

  • This library uses The Apache PDFBox® library's PDF content stream engine to stream the PDF file.
  • Tabula 1.2.1 (an Open source library) is used for table extraction.

Installation Instructions

Pre-requisites

  • Java (>1.6)
  • Maven

Installation and Setup

From Source

  • Clone the project
  • Add an environment variable for the tabula jar (used for tables extraction and unit tests)
    TABULA_JAR_LOCATION={Project-dir}/lib/tabula/tabula-0.9.1-jar-with-dependencies.jar
  • Run mvn clean install to install it in your local environment. It might take some time (~15 mins) as there are ~400 unit tests within the project. In order to skip tests , run with -DskipTests

Import the pdf-extractor dependency to your project

  • Adding the maven dependency
 <dependency>
    <groupId>com.beehyv</groupId>
    <artifactId>pdf-extractor</artifactId>
    <version>1.0</version>
 </dependency>
  • Adding the jar to the classpath
    pdf-extractor.jar file for the project can be found under {Project-dir}/target

Run Extraction

  • Create a document object

    HolmesPdfDocument pdfDocument = new HolmesPdfDocument(file);
  • Create an extractor instance

    PdfBoxExtractor pdfBoxExtractor = new PdfBoxExtractor();
  • Extract text

    String text = pdfBoxExtractor.getText(pdfDocument,startPage,endPage);
  • Extract images

    pdfBoxExtractor.getImages(pdfDocument,startPage,endPage);
  • Extract tables

    pdfBoxExtractor.getTabularData(pdfDocument,startPage,endPage);
  • Extract Structured Text

    With this feature you can extract data in a structured manner. The data is extracted in sections with the hierarchy of the sections being intact. All the texts , images , tables , paragraphs are assigned to the respective sections giving the extracted data a structure and hence more meaningful.

    All this information resides in an InfoNode model.

    InfoNode infoNode = pdfBoxExtractor.getStructuredText(hdoc)

    • Sections infoNode.getSections()
    • Paragraphs infoNode.getParagraphs
    • Content infoNode.getContent()
    • Section Heading infoNode.getHeading()
    • Section Images infoNode.getImageSections()
    • Lines infoNode.getContentLineObjects()

Feature Request

In case of new feature requests please use the Github Issues page to raise tickets for Bugs as well as enhancements. The community can then take up the functionality as per need.

com.beehyv

BeeHyv Software Solutions Pvt Ltd

Версии библиотеки

Версия
1.0