Gecco

Easy to use lightweight web crawler

Лицензия

Лицензия

Группа

Группа

com.geccocrawler
Идентификатор

Идентификатор

gecco
Последняя версия

Последняя версия

1.3.21
Дата

Дата

Тип

Тип

jar
Описание

Описание

Gecco
Easy to use lightweight web crawler
Ссылка на сайт

Ссылка на сайт

https://github.com/xtuhcy/gecco
Система контроля версий

Система контроля версий

https://github.com/xtuhcy/gecco

Скачать gecco

Как подключить последнюю версию

<!-- https://jarcasting.com/artifacts/com.geccocrawler/gecco/ -->
<dependency>
    <groupId>com.geccocrawler</groupId>
    <artifactId>gecco</artifactId>
    <version>1.3.21</version>
</dependency>
// https://jarcasting.com/artifacts/com.geccocrawler/gecco/
implementation 'com.geccocrawler:gecco:1.3.21'
// https://jarcasting.com/artifacts/com.geccocrawler/gecco/
implementation ("com.geccocrawler:gecco:1.3.21")
'com.geccocrawler:gecco:jar:1.3.21'
<dependency org="com.geccocrawler" name="gecco" rev="1.3.21">
  <artifact name="gecco" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.geccocrawler', module='gecco', version='1.3.21')
)
libraryDependencies += "com.geccocrawler" % "gecco" % "1.3.21"
[com.geccocrawler/gecco "1.3.21"]

Зависимости

compile (10)

Идентификатор библиотеки Тип Версия
org.apache.httpcomponents : httpclient jar 4.5.12
org.jsoup : jsoup jar 1.13.1
org.reflections : reflections jar 0.9.11
com.alibaba : fastjson jar 1.2.72
log4j : log4j jar 1.2.17
cglib : cglib jar 3.3.0
org.apache.commons : commons-lang3 jar 3.8.1
org.mozilla : rhino jar 1.7.10
org.weakref : jmxutils jar 1.19
com.google.guava : guava jar 27.0.1-jre

Модули Проекта

Данный проект не имеет модулей.

ci maven 996.icu

What is Gecco

Gecco is a easy to use lightweight web crawler developed with java language.Gecco integriert jsoup, httpclient, fastjson, spring, htmlunit, redission ausgezeichneten framework,Let you only need to configure a number of jQuery style selector can be very quick to write a crawler.Gecco framework has excellent scalability, the framework based on the principle of open and close design, to modify the closure, the expansion of open.At the same time Gecco is based on a very open MIT open source protocol, whether you are a user or want to jointly improve the Gecco developer, welcome to request pull.If you like the crawler framework,please star or fork!

Main features

  • Easy to use, use jQuery style selector to extract elements
  • Support for asynchronous Ajax requests in the page
  • Support page JavaScript variable extraction
  • Using Redis to realize distributed crawling,reference gecco-redis
  • Support the development of business logic with Spring,reference gecco-spring
  • Support htmlunit extension,reference gecco-htmlunit
  • Support extension mechanism
  • Support download UserAgent random selection
  • Support the download proxy server randomly selected

Framework overview

架构图

Download

Download via Maven

<dependency>
    <groupId>com.geccocrawler</groupId>
    <artifactId>gecco</artifactId>
    <version>x.x.x</version>
</dependency>

maven

Dependent project

httpclient,jsoup,fastjson,reflections,cglib,rhino,log4j,jmxutils,commons-lang3

Quick start

@Gecco(matchUrl="https://github.com/{user}/{project}", pipelines="consolePipeline")
public class MyGithub implements HtmlBean {

    private static final long serialVersionUID = -7127412585200687225L;

    @RequestParameter("user")
    private String user;

    @RequestParameter("project")
    private String project;

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(2) .social-count")
    private String star;

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(3) .social-count")
    private String fork;

    @Html
    @HtmlField(cssPath=".entry-content")
    private String readme;

    public String getReadme() {
        return readme;
    }

    public void setReadme(String readme) {
        this.readme = readme;
    }

    public String getUser() {
        return user;
    }

    public void setUser(String user) {
        this.user = user;
    }

    public String getProject() {
        return project;
    }

    public void setProject(String project) {
        this.project = project;
    }

    public String getStar() {
        return star;
    }

    public void setStar(String star) {
        this.star = star;
    }

    public String getFork() {
        return fork;
    }

    public void setFork(String fork) {
        this.fork = fork;
    }

    public static void main(String[] args) {
        GeccoEngine.create()
        .classpath("com.geccocrawler.gecco.demo")
        .start("https://github.com/xtuhcy/gecco")
        .thread(1)
        .interval(2000)
        .loop(true)
        .mobile(false)
        .start();
    }
}

DynamicGecco

The purpose of DynamicGecco is to implement the runtime configuration of the crawl rule without defining the SpiderBean.In fact, the principle is the use of byte code programming, dynamic generation of SpiderBean, but also through the custom GeccoClassLoader to achieve the rule of hot deployment.Below is a simple Demo, more complex Demo can refer to the example below com.geccocrawler.gecco.demo.dynamic.

The following code implements the runtime configuration of the crawl rule:

DynamicGecco.html()
.gecco("https://github.com/{user}/{project}", "consolePipeline")
.requestField("request").request().build()
.stringField("user").requestParameter("user").build()
.stringField("project").requestParameter().build()
.stringField("star").csspath(".pagehead-actions li:nth-child(2) .social-count").text(false).build()
.stringField("fork").csspath(".pagehead-actions li:nth-child(3) .social-count").text().build()
.stringField("contributors").csspath("ul.numbers-summary > li:nth-child(4) > a").href().build()
.register();

GeccoEngine.create()
.classpath("com.geccocrawler.gecco.demo")
.start("https://github.com/xtuhcy/gecco")
.run();

You can see that the DynamicGecco way compared to the traditional way of annotation code greatly reduced, and a very cool point is DynamicGecco to support the operation of the definition and modification of rules.

Demo

教您使用 java 爬虫 gecco 抓取 JD 全部商品信息(一)

教您使用 java 爬虫 gecco 抓取 JD 全部商品信息(二)

教您使用 java 爬虫 gecco 抓取 JD 全部商品信息(三)

集成 Htmlunit 下载页面

爬虫的监控

一个完整的例子,分页处理,结合 spring,mysql 入库

Similar Tool Comparison

A list of similar tools and how they compare is available here:

Web Archiving Software Comparision

Contact and communication

请作者喝杯咖啡

Gecco 的发展离不开大家支持,扫一扫请作者喝杯咖啡~

支付宝 支付宝

License

Please follow the open source protocol MIT!

Версии библиотеки

Версия
1.3.21
1.3.2
1.3.0
1.2.10
1.2.9
1.2.8
1.2.7
1.2.6
1.2.5
1.2.4
1.2.3
1.2.2
1.2.1
1.2.0
1.1.3
1.1.2
1.1.1
1.1.0
1.0.9
1.0.8
1.0.7
1.0.6
1.0.5
1.0.4
1.0.3
1.0.2
1.0.1
1.0.0