A flexible nodejs crawler library — x-crawl

x-crawl

x-crawl is a flexible nodejs crawler library. It can crawl pages, control pages, batch network requests, batch download file resources, polling and crawling, etc. Support asynchronous/synchronous mode crawling data. Running on nodejs…


This content originally appeared on DEV Community and was authored by CoderHXL

x-crawl

x-crawl is a flexible nodejs crawler library. It can crawl pages, control pages, batch network requests, batch download file resources, polling and crawling, etc. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.

If you feel good, you can give x-crawl repository a Star to support it, your Star will be the motivation for my update.

Features

  • Support asynchronous/synchronous way to crawl data.
  • Flexible writing, supporting multiple ways to write request configuration and obtain crawling results.
  • Flexible crawling interval, no interval/fixed interval/random interval, it is up to you to use/avoid high concurrent crawling.
  • Simple configuration can crawl pages, batch network requests, batch download file resources, polling and crawling, etc.
  • Crawl SPA (single-page application) to generate pre-rendered content (ie "SSR" (server-side rendering)), and use jsdom library to parse the content, and also supports self-parsing.
  • Form submissions, keystrokes, event actions, screenshots of generated pages, etc.
  • Capture and record the success and failure of crawling, and highlight the reminders.
  • Written in TypeScript, has types, provides generics.

Example

Timing capture: Take the automatic capture of the cover image of Airbnb Plus listings every day as an example:

// 1.Import module ES/CJS
import xCrawl from 'x-crawl'

// 2.Create a crawler instance
const myXCrawl = xCrawl({
  timeout: 10000, // overtime time
  intervalTime: { max: 3000, min: 2000 } // crawl interval
})

// 3.Set the crawling task
/* 
  Call the startPolling API to start the polling function, 
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call crawlPage API to crawl Page
  const { jsdom } = await myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes')

  // Get the cover image elements for Plus listings
  const imgEls = jsdom.window.document
    .querySelector('.a1stauiv')
    ?.querySelectorAll('picture img')

  // set request configuration
  const requestConfig: string[] = []
  imgEls?.forEach((item) => requestConfig.push(item.src))

  // Call the crawlFile API to crawl pictures
  myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })
})

running result:

Note: Do not crawl at will, you can check the robots.txt protocol before crawling. This is just to demonstrate how to use x-crawl.

More

For more detailed documentation, please check: https://github.com/coder-hxl/x-crawl


This content originally appeared on DEV Community and was authored by CoderHXL


Print Share Comment Cite Upload Translate Updates
APA

CoderHXL | Sciencx (2023-03-23T11:34:55+00:00) A flexible nodejs crawler library — x-crawl. Retrieved from https://www.scien.cx/2023/03/23/a-flexible-nodejs-crawler-library-x-crawl/

MLA
" » A flexible nodejs crawler library — x-crawl." CoderHXL | Sciencx - Thursday March 23, 2023, https://www.scien.cx/2023/03/23/a-flexible-nodejs-crawler-library-x-crawl/
HARVARD
CoderHXL | Sciencx Thursday March 23, 2023 » A flexible nodejs crawler library — x-crawl., viewed ,<https://www.scien.cx/2023/03/23/a-flexible-nodejs-crawler-library-x-crawl/>
VANCOUVER
CoderHXL | Sciencx - » A flexible nodejs crawler library — x-crawl. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2023/03/23/a-flexible-nodejs-crawler-library-x-crawl/
CHICAGO
" » A flexible nodejs crawler library — x-crawl." CoderHXL | Sciencx - Accessed . https://www.scien.cx/2023/03/23/a-flexible-nodejs-crawler-library-x-crawl/
IEEE
" » A flexible nodejs crawler library — x-crawl." CoderHXL | Sciencx [Online]. Available: https://www.scien.cx/2023/03/23/a-flexible-nodejs-crawler-library-x-crawl/. [Accessed: ]
rf:citation
» A flexible nodejs crawler library — x-crawl | CoderHXL | Sciencx | https://www.scien.cx/2023/03/23/a-flexible-nodejs-crawler-library-x-crawl/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.