September 22, 2018

downlink: scrape documents linked from a web page

For my work as a data editor I often have to download all of the files linked from a particular web page. That’s why I made downlink, a command-line utility and a Python library.

The basic product is the command line utility, called downlink. The secondary product is the library of classes used to build it. They are reasonably extensible, so you could build them into more powerful tools.

Quickstart

$ pip install downlink
$ mkdir output
$ downlink "https://www.ct.gov/doh/cwp/view.asp?a=4513&q=530462" output

I haven’t put in any error handling, so if things happen, like the web address you give it is invalid, or the output folder doesn’t exist for writing, then it’ll just crash. That’s on my TODO list.

Here’s the output from downlink –help

  usage: downlink [-h] [--ext EXT] url dst

  Download all the links of a given file extension from a web page.

  positional arguments:
url         A valid URL to download links from
dst         A path to a directory in which to save the files

  optional arguments:
-h, --help  show this help message and exit
--ext EXT   the file extension/type of file to download

Abstraction outline

Referring to the GitHub repo above, the two “library” files in the repo are linkscraper.py and document_linkscraper.py, which contain the classes LinkScraper and DocumentLinkScraper respectively.

The idea with this abstraction is that LinkScraper fetches and HTML document and pulls out all of the links (the a tags). It has a filter_links method, but the default is to leave them all in.

The DocumentLinkScraper subclasses LinkScraper and overrides the filter_links method to only return True if it is a link to a document of a given file extension (specified when instantiating the object).

This abstraction could be useful if you want to do something else with other types of links, perhaps links that contain onclick attributes.

One thing that bugs me about this design is that LinkScraper fetches remote HTML. I feel like that functionality belongs in a different object, but I did not want to over-engineer this.