pdfbox-text-extractor

A framework for extracting text from different types of files, e.g. PDFs, images, office documents, text files, ...

Лицензия	Лицензия The Apache License, Version 2.0
Категории	Категории Сеть PDFBox Прикладные библиотеки PDF Данные
Группа	Группа net.dankito.text.extraction
Идентификатор	Идентификатор pdfbox-text-extractor
Последняя версия	Последняя версия 0.6.0
Дата	Дата 6 нояб. 2020 г.
Тип	Тип module
Описание	Описание pdfbox-text-extractor A framework for extracting text from different types of files, e.g. PDFs, images, office documents, text files, ...
Ссылка на сайт	Ссылка на сайт https://github.com/dankito/TextExtraction
Система контроля версий	Система контроля версий https://github.com/dankito/TextExtraction

Лицензия

The Apache License, Version 2.0

Категории

Сеть PDFBox Прикладные библиотеки PDF Данные

Группа

net.dankito.text.extraction

Идентификатор

pdfbox-text-extractor

Последняя версия

0.6.0

Дата

6 нояб. 2020 г.

Тип

module

Описание

pdfbox-text-extractor

A framework for extracting text from different types of files, e.g. PDFs, images, office documents, text files, ...

Ссылка на сайт

https://github.com/dankito/TextExtraction

Система контроля версий

https://github.com/dankito/TextExtraction

Скачать pdfbox-text-extractor

Имя Файла	Размер
pdfbox-text-extractor-0.6.0.pom
pdfbox-text-extractor-0.6.0.module	4 KB
pdfbox-text-extractor-0.6.0-sources.jar	2 KB
pdfbox-text-extractor-0.6.0-javadoc.jar	261 bytes
Обзор

Имя Файла

Размер

pdfbox-text-extractor-0.6.0.pom

pdfbox-text-extractor-0.6.0.module

4 KB

pdfbox-text-extractor-0.6.0-sources.jar

2 KB

pdfbox-text-extractor-0.6.0-javadoc.jar

261 bytes

Обзор

Зависимости

compile (1)

Идентификатор библиотеки	Тип	Версия
net.dankito.text.extraction : text-extractor-common	jar	0.6.0

Идентификатор библиотеки

Тип

Версия

net.dankito.text.extraction : text-extractor-common

jar

0.6.0

runtime (2)

Идентификатор библиотеки	Тип	Версия
org.apache.pdfbox : pdfbox	jar	2.0.19
io.github.jonathanlink : PDFLayoutTextStripper	jar	2.2.3

Идентификатор библиотеки

Тип

Версия

org.apache.pdfbox : pdfbox

jar

2.0.19

io.github.jonathanlink : PDFLayoutTextStripper

jar

2.2.3

Модули Проекта

Данный проект не имеет модулей.

Text Extration

A modular framework for extracting text from many different sources (websites, PDFs, images).

Text Extractors comparison

PDF

There are two types of PDF:

"Image only" PDFs that just embed (scanned) images. But they contain no selectable and therefore extractable text. To get the text in the images, first the images have to be extracted from the PDF and then OCR applied to them. See section Images.
Searchable PDFs: If you open them in a PDF viewer you can select their text or search for it. The following libraries help to extract text from these types of PDFs:

Searchable PDFs

Extractor	Permissive License	Runs on Android	Advantages	Disadvantages
pdftotext	✔️	❌	Best PDF extraction result so far	User has to install Poppler Utils Does not run on Android
iText 2	✔️	✔️	Works also with PDFs with disordered layouts Best PDF extraction result of any Java library I found Works on older Androids (at least on Android 4.1) Almost the same text extraction quality as the newer (and non-free) iText 7
iText	❌	✔️	Works also with PDFs with disordered layouts Best PDF extraction result of any Java library I found Works on older Androids (at least on Android 4.1)	Not free / commercial (AGPL / commercial license)
OpenPDF	✔️	( ✔️ )	Free Quite good and fast	Does not work on PDFs with disordered layouts Does not run on older Androids (uses Java 8 features (Optional); works on Android 6 but not on Android 4.1, others not tested)
PDFBox (not added yet)	✔️	❌
PdfBox-Android (not added yet)	✔️	✔️

iText 2 and iText 7

iText 2 is the older, permissive version of then turned commercial iText. But as the last free iText version, 2.1.7, has security flaws, I used version 2.1.7.js7 from JasperReports as this version fixes the security issues. It's slower than iText 7 but in regard to text extraction quality I cannot see any difference between iText 7 and iText 2.

OpenPdf

OpenPdf took the last commit with a permissive license of iText and developed it further. But according to my experience its text extraction capability is worse than that one of iText 7 and iText 2.

Do not add OpenPdfPdfTextExtractor and iText2PdfTextExtractor to the class path at the same time as both have the same package and class names but different method and class signatures -> one of them will crash when using them.

(Very opinionated) Recommendation

If you can use pdftotext (Poppler), use pdftotext. It yields the best results both in terms of text extraction quality and speed.

Otherwise use security issues fixed version of iText 2. It's slower than commercial (and really amazing good) iText 7, but in terms of text extraction quality I cannot see any difference between iText 2 and iText 7.

I don't know why, but of some PDFs OpenPdf cannot extract any text at all.

How to distinguish between Searchable and "Image only" PDFs?

Kurt Pfeifle gave an superb hint (https://stackoverflow.com/a/3108531): Check how many fonts a PDF uses. If it uses fonts, it contains searchable text. If it uses no font at all it contains only images.

I added IPdfTypeDetector implementations for Poppler / pdffonts and ...

Images

(All variants with Tesseract 4 have the same extraction quality, which is quite good but not the best.)

Extractor	Advantages	Disadvantages
tess4j	Uses Tesseract 4	User has to install Tesseract Extraction result depends a lot on image quality Does not run on Android
Tesseract 4 over JNI (e. g. from Bytedeco)	Uses Tesseract 4	If there's an exception in native code whole application crashes (JNI) User has to install Tesseract Extraction result depends a lot on image quality Does not run on Android
Tesseract4Android	Uses Tesseract 4	Very slow, took 2 minutes to recognize a single image (0,5 MB) Extraction result depends a lot on image quality
Tess4Android	Uses Tesseract 4	Couldn't get it to compile
TextFairy (not added yet)		Uses Tesseract 3 Quite slow Extraction result depends a lot on image quality
Microsoft Cloud Computer Vision API OCR (not implemented yet)	Best image extraction result I found so far	Requires registration (credit card required; every single user to do this for his/her self) Costs $1.50 per 1000 images (see) Data protection insanity, stores all your images and recognized text for years
Google Cloud Vision OCR (neither implemented nor tested yet)		Requires registration (credit card required; every single user to do this for his/her self) 1000 images per month are free, have to pay for more (see) Data protection insanity, stores all your images and recognized text for years

License

If not stated otherwise all code is licensed under Apache License, Version 2.0.

Notice: Some libraries, like iText, have different, partially commercial licenses.

Версии библиотеки

Версия
0.6.0 6 нояб. 2020 г.

Версия

0.6.0
6 нояб. 2020 г.