8 most popular Python HTML web scraping packages with benchmarks

Click for: original source

This blog post will cover Python web scraping packages in terms of their speed, ease of use, and personal investigations. This blog post won’t cover what webscraping is and how parsers work. By Dmitriy Zub.

The article recommendations:

  • If you need to scrape data from a dynamic page that doesn’t require clicking, scrolling and similar things but still requires rendering JavaScript, try requests-html. It uses pure XPath as lxml and should be faster than the other two browser automations.

  • If you need to do complex page manipulation on the dynamic page, try to use playwright or selenium.

  • If you scraping non-dynamic pages (rendered via JavaScript), try selectolax over bs4, lxml or parsel. It’s a lot faster, uses less memory, and has almost identical syntax to parsel or bs4. A hidden gem I would say.

  • If you need to use XPath in your parser, try to use either lxml or parsel. parsel is built on top of lxml and translates every CSS query to XPath and can combine (chain) CSS and XPath queries. However, lxml is faster.

Excellent read with charts and code to complement the comparison of each package!

[Read More]

Tags python programming web-development app-development performance