How to build a serverless web crawler

Click for: original source

James Beswick wrote this piece about how to use serverless to scale an old concept for the modern era. It describes a client project which involved the need to crawl a large media site to generate a list of URLs and site assets.

For Node users, there’s a package that does this elegantly called Website Scraper, and it has plenty of configuration options to handle those questions plus a lot of other features. The package is mostly configuration-driven.

The article is split into:

  • Let’s crawl before we run
  • Serverless Crawler — Version 1.0
  • Serverless Web Crawler 2.0 — now slower!
  • Show me the code!
  • Serverless Web Crawler 3.0

There are many reasons to crawl a website – and crawling is different to scraping. The article describes evolution of lambda aws based solution with detailed experience and learnings. Great!

[Read More]

Tags serverless containers cicd web-development