CommonCrawl Document Download

Common utilities I find useful in many of my projects.

Лицензия

Лицензия

Группа

Группа

org.dstadler
Идентификатор

Идентификатор

commoncrawldownload
Последняя версия

Последняя версия

1.0.0.7
Дата

Дата

Тип

Тип

jar
Описание

Описание

CommonCrawl Document Download
Common utilities I find useful in many of my projects.
Ссылка на сайт

Ссылка на сайт

https://github.com/centic9/CommonCrawlDocumentDownload
Система контроля версий

Система контроля версий

https://github.com/centic9/CommonCrawlDocumentDownload

Скачать commoncrawldownload

Как подключить последнюю версию

<!-- https://jarcasting.com/artifacts/org.dstadler/commoncrawldownload/ -->
<dependency>
    <groupId>org.dstadler</groupId>
    <artifactId>commoncrawldownload</artifactId>
    <version>1.0.0.7</version>
</dependency>
// https://jarcasting.com/artifacts/org.dstadler/commoncrawldownload/
implementation 'org.dstadler:commoncrawldownload:1.0.0.7'
// https://jarcasting.com/artifacts/org.dstadler/commoncrawldownload/
implementation ("org.dstadler:commoncrawldownload:1.0.0.7")
'org.dstadler:commoncrawldownload:jar:1.0.0.7'
<dependency org="org.dstadler" name="commoncrawldownload" rev="1.0.0.7">
  <artifact name="commoncrawldownload" type="jar" />
</dependency>
@Grapes(
@Grab(group='org.dstadler', module='commoncrawldownload', version='1.0.0.7')
)
libraryDependencies += "org.dstadler" % "commoncrawldownload" % "1.0.0.7"
[org.dstadler/commoncrawldownload "1.0.0.7"]

Зависимости

compile (7)

Идентификатор библиотеки Тип Версия
org.dstadler : commons-dost jar 1.0.0.27
org.apache.httpcomponents : httpclient jar 4.5.7
org.netpreserve.commons : webarchive-commons jar 1.1.8
com.fasterxml.jackson.core : jackson-core jar 2.10.0
log4j : log4j jar 1.2.17
org.jsoup : jsoup jar 1.11.3
com.google.code.findbugs : jsr305 jar 3.0.1

test (2)

Идентификатор библиотеки Тип Версия
junit : junit jar 4.12
org.dstadler : commons-test jar 1.0.0.18

Модули Проекта

Данный проект не имеет модулей.

Build Status Gradle Status Tag Maven Central Maven Central

This is a small tool to find matching URLs and download the corresponding binary data from the CommonCrawl indexes.

Support for the newer URL Index (http://blog.commoncrawl.org/2015/04/announcing-the-common-crawl-index/) is available, older URL Index as described at https://github.com/trivio/common_crawl_index and http://blog.commoncrawl.org/2013/01/common-crawl-url-index/ is still available in the "oldindex" package.

Please note that a full run usually finds a huge number of files and thus downloading will require a large amount of time and lots of disk-space if the data is stored locally!

Getting started

Grab it

git clone git://github.com/centic9/CommonCrawlDocumentDownload

Build it and create the distribution files

cd CommonCrawlDocumentDownload
./gradlew check

Run it

./gradlew lookupURLs

Reads the current Common Crawl URL index data and extracts all URLs for interesting mime-types or file extensions, stores the URLs in a file called commoncrawl-CC-MAIN-<year>-<crawl>.txt

./gradlew downloadDocuments

Uses the URLs listed in commoncrawl-CC-MAIN-<year>-<crawl>.txt to download the documents from the Common Crawl

./gradlew downloadOldIndex

Starts downloading the URL index files from the old index and looks at each URL, downloading binary data from the common crawl archives.

The longer stuff

Change it

Create matching Eclipse project files

./gradlew eclipse

Run unit tests

./gradlew check jacocoTestReport

Adjust which files are found

There are a few things that you can tweak:

  • The file-extensions that are detected as download-able files are handled in the class Extensions.
  • The mime-types that are detected as download-able files isare handled in the class MimeTypes.
  • Adjust the name of the list of found files in DownloadURLIndex.COMMON_CRAWL_FILE.
  • Adjust the location where files are downloaded to in Utils.DOWNLOAD_DIR.
  • The starting file-index (of the approximately 300 cdx-files) is currently set as constant in class org.dstadler.commoncrawl.index.DownloadURLIndex, this way you can also re-start a download that was interrupted before.

Ideas

  • Old Index: By adding a new implementation of BlockProcesser (likely re-using existing stuff by deriving from one of the available implementations), you can do things like streaming processing of the file instead of storing the file locally, which will avoid using too much disk-space

Estimates (based on Old Index)

  • Size of overall URL Index is 233689120776, i.e. 217GB
  • Header: 6 Bytes
  • Index-Blocks: 2644
  • Block-Size: 65536
  • => Data-Blocks: 3563169
  • Aprox. Files per Block: 2.421275
  • Resulint aprox. number of files: 8627412
  • Avg. size per file: 221613
  • Needed storage: 1911954989425 bytes = 1.7TB!

Related projects/pages

Release it

./gradlew --console=plain release && ./gradlew closeAndReleaseRepository
  • This should automatically release the new version on MavenCentral
  • Afterwards go to the Github releases page and add release-notes

Support this project

If you find this library useful and would like to support it, you can Sponsor the author

Licensing

Версии библиотеки

Версия
1.0.0.7
1.0.0.6
1.0.0.5
1.0.0.4