Crawling Framework

Framework to simplify news crawling

Лицензия

Лицензия

Группа

Группа

lt.tokenmill.crawling
Идентификатор

Идентификатор

crawling-framework
Последняя версия

Последняя версия

0.2.0
Дата

Дата

Тип

Тип

pom
Описание

Описание

Crawling Framework
Framework to simplify news crawling
Ссылка на сайт

Ссылка на сайт

https://github.com/tokenmill/crawling-framework
Система контроля версий

Система контроля версий

https://github.com/tokenmill/crawling-framework/tree/master

Скачать crawling-framework

Имя Файла Размер
crawling-framework-0.2.0.pom 10 KB
Обзор

Как подключить последнюю версию

<!-- https://jarcasting.com/artifacts/lt.tokenmill.crawling/crawling-framework/ -->
<dependency>
    <groupId>lt.tokenmill.crawling</groupId>
    <artifactId>crawling-framework</artifactId>
    <version>0.2.0</version>
    <type>pom</type>
</dependency>
// https://jarcasting.com/artifacts/lt.tokenmill.crawling/crawling-framework/
implementation 'lt.tokenmill.crawling:crawling-framework:0.2.0'
// https://jarcasting.com/artifacts/lt.tokenmill.crawling/crawling-framework/
implementation ("lt.tokenmill.crawling:crawling-framework:0.2.0")
'lt.tokenmill.crawling:crawling-framework:pom:0.2.0'
<dependency org="lt.tokenmill.crawling" name="crawling-framework" rev="0.2.0">
  <artifact name="crawling-framework" type="pom" />
</dependency>
@Grapes(
@Grab(group='lt.tokenmill.crawling', module='crawling-framework', version='0.2.0')
)
libraryDependencies += "lt.tokenmill.crawling" % "crawling-framework" % "0.2.0"
[lt.tokenmill.crawling/crawling-framework "0.2.0"]

Зависимости

Библиотека не имеет зависимостей. Это самодостаточное приложение, которое не зависит ни от каких других библиотек.

Модули Проекта

  • data-model
  • elasticsearch
  • parser
  • page-analyzer
  • crawler
  • administration-ui
  • analysis-ui
  • ui-commons

Crawling Framework

Maven Central pipeline status

Crawling Framework aims at providing instruments to configure and run your Storm Crawler based crawler. It mainly aims at easing crawling of article content publishing sites like news portals or blog sites. With the help of GUI tool Crawling Framework provides you can:

  1. Specify which sites to crawl.
  2. Configure URL inclusion and exclusion filters, thus controlling which sections of the site will be fetched.
  3. Specify which elements of the page provide information about article publication name, its title and main body.
  4. Define tests which validate that extraction rules are working.

Once configuration is done the Crawling Framework runs Storm Crawler based crawling following the rules specified in the configuration.

Introduction

We have recorded a video on how to setup and use Crawling Framework. Click on the image below to watch in on Youtube.

Crawling Framework Intro

Requirements

Framework writes its configuration and stores crawled data to ElasticSearch. Before starting crawl project install ElasticSearch (Crawling Framework is tested to work with Elastic v7.x).

Crawling Framework is a Java lib which will have to be extended to run Storm Crawler topology, thus Java (JDK8, Maven) infrastructure will be needed.

Using password protected ElasticSearch

Some providers hide ElasticSearch under authentification step (Which makes sense). Just set environment variables ES_USERNAME and ES_PASSWORD accordingly, everything else can remain the same. Authentification step will be done implicitly if proper credentials are there

Configuring and Running a crawl

See Crawling Framework Example project's documentation.

License

Copyright © 2017-2019 TokenMill UAB.

Distributed under the The Apache License, Version 2.0.

lt.tokenmill.crawling

TokenMill

We can help you with your natural language generation and processing projects

Версии библиотеки

Версия
0.2.0