WebMage: Web Scraping Tool – Office of Digital Humanities

Description

As an office that handles data frequently, web scraping is a useful resource for gathering data. Web scraping is the use of a computer script to usually get multiple web pages and extract data from them. Python has several modules that are useful for web scraping, such as requests, BeautifulSoup, and selenium. However, none of them have built-in methods for conveniently interacting with dynamic webpages.

WebMage is a Python module currently in development to add useful methods for web scraping. Some of these methods include:

Infinite Scrolling

Some webpages can load more content when scrolling down to the bottom of the page. An infinite scroll method allows you to continue to the bottom of the page until it stops loading more content.

Opening Content in Another Tab

While scrolling, you might want to open up a hyperlink temporarily in another page. This comes in handy when scrolling, and you don’t want to lose your place in the page.

Taking a Screenshot of an Element

While it’s usually better to download an image rather than taking a screenshot, an alternative is to simply take a screenshot of part of the page.

Check out the documentation on WebMage here for information on how to use WebMage. Those familiar with Python modules will know that this is a free installation and can be used for their own web scraping needs.

WebMage is currently being used for projects within the Office of Digital Humanities, WordCruncher, and Digital Humanites 260.

Project Type:

Service

Project URL:

pypi.org/project/webmage/

Status:

Development