A flexible Node.js multifunctional crawler library —— x-crawl

x-crawl

x-crawl is a flexible Node.js multifunctional crawler library. Used to crawl pages, crawl interfaces, crawl files, and poll crawls.

If you also like x-crawl, you can give x-crawl repository a star to support it, thank you for your …


This content originally appeared on DEV Community and was authored by CoderHXL

x-crawl

x-crawl is a flexible Node.js multifunctional crawler library. Used to crawl pages, crawl interfaces, crawl files, and poll crawls.

If you also like x-crawl, you can give x-crawl repository a star to support it, thank you for your support!

Features

  • 🔥 AsyncSync - Just change the mode attribute value to switch async or sync crawling mode.
  • ⚙️Multiple functions - It can crawl pages, crawl interfaces, crawl files and polling crawls, and supports crawling single or multiple.
  • 🖋️ Flexible writing style - Simple target configuration, detailed target configuration, mixed target array configuration and advanced configuration, the same crawling API can adapt to multiple configurations.
  • 👀Device Fingerprinting - Zero configuration or custom configuration to avoid fingerprinting to identify and track us from different locations.
  • ⏱️ Interval Crawling - No interval, fixed interval and random interval can generate or avoid high concurrent crawling.
  • 🔄 Retry on failure - Global settings, local settings and individual settings, It can avoid crawling failure caused by temporary problems.
  • 🚀 Priority Queue - According to the priority of a single crawling target, it can be crawled ahead of other targets.
  • ☁️ Crawl SPA - Crawl SPA (Single Page Application) to generate pre-rendered content (aka "SSR" (Server Side Rendering)).
  • ⚒️ Controlling Pages - Headless browsers can submit forms, keystrokes, event actions, generate screenshots of pages, etc.
  • 🧾 Capture Record - Capture and record crawling results and other information, and highlight reminders on the console.
  • 🦾 TypeScript - Own types, implement complete types through generics.

Example

Take some pictures of Airbnb hawaii experience and Plus listings automatically every day as an example:

// 1.Import module ES/CJS
import xCrawl from 'x-crawl'

// 2.Create a crawler instance
const myXCrawl = xCrawl({ maxRetry: 3, intervalTime: { max: 3000, min: 2000 } })

// 3.Set the crawling task
/*
  Call the startPolling API to start the polling function,
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call crawlPage API to crawl Page
  const res = await myXCrawl.crawlPage([
    'https://zh.airbnb.com/s/hawaii/experiences',
    'https://zh.airbnb.com/s/hawaii/plus_homes'
  ])

  // Store the image URL to targets
  const targets = []
  const elSelectorMap = ['.c14whb16', '.a1stauiv']
  for (const item of res) {
    const { id } = item
    const { page } = item.data

    // Gets the URL of the page's wheel image element
    const boxHandle = await page.$(elSelectorMap[id - 1])
    const urls = await boxHandle!.$$eval('picture img', (imgEls) => {
      return imgEls.map((item) => item.src)
    })
    targets.push(...urls)

    // Close page
    page.close()
  }

  // Call the crawlFile API to crawl pictures
  myXCrawl.crawlFile({ targets, storeDir: './upload' })
})

running result:

Note: Do not crawl at will, you can check the robots.txt protocol before crawling. This is just to demonstrate how to use x-crawl.

More

For more detailed documentation, please check: https://github.com/coder-hxl/x-crawl


This content originally appeared on DEV Community and was authored by CoderHXL


Print Share Comment Cite Upload Translate Updates
APA

CoderHXL | Sciencx (2023-04-21T03:04:10+00:00) A flexible Node.js multifunctional crawler library —— x-crawl. Retrieved from https://www.scien.cx/2023/04/21/a-flexible-node-js-multifunctional-crawler-library-x-crawl/

MLA
" » A flexible Node.js multifunctional crawler library —— x-crawl." CoderHXL | Sciencx - Friday April 21, 2023, https://www.scien.cx/2023/04/21/a-flexible-node-js-multifunctional-crawler-library-x-crawl/
HARVARD
CoderHXL | Sciencx Friday April 21, 2023 » A flexible Node.js multifunctional crawler library —— x-crawl., viewed ,<https://www.scien.cx/2023/04/21/a-flexible-node-js-multifunctional-crawler-library-x-crawl/>
VANCOUVER
CoderHXL | Sciencx - » A flexible Node.js multifunctional crawler library —— x-crawl. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2023/04/21/a-flexible-node-js-multifunctional-crawler-library-x-crawl/
CHICAGO
" » A flexible Node.js multifunctional crawler library —— x-crawl." CoderHXL | Sciencx - Accessed . https://www.scien.cx/2023/04/21/a-flexible-node-js-multifunctional-crawler-library-x-crawl/
IEEE
" » A flexible Node.js multifunctional crawler library —— x-crawl." CoderHXL | Sciencx [Online]. Available: https://www.scien.cx/2023/04/21/a-flexible-node-js-multifunctional-crawler-library-x-crawl/. [Accessed: ]
rf:citation
» A flexible Node.js multifunctional crawler library —— x-crawl | CoderHXL | Sciencx | https://www.scien.cx/2023/04/21/a-flexible-node-js-multifunctional-crawler-library-x-crawl/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.