jake kara, software engineer ‣ scrapetools.py - for pulling down all linked
files from a page ░

  • Projects
  • Blog

January 27, 2017

scrapetools.py - for pulling down all linked files from a page

I wrote some Python functions to help download every file linked to on a web
page.

Here’s the repo

It’s called scrape_tools.py, and it has a modest four methods:

  • makedir(directory) - create a directory if it doesn’t exist. usefule for
    setting up your output directory structure programmatically.

  • get(url) - wraps requests.get() call and raises an exception when response
    status != 200. Otherwise returns content.

  • download_bin(url,output_file) - downloads a file, again using requests, and
    saves it to output_file (which is path string, not file handle).

  • def get_files(html, base_url=lambda x: x, match_term=”.csv”, fname=lambda
    x: x) - get all files linked to in html, containing the term match_term,
    which defaults to “.csv”. base_url is a method that takes a url and
    generates a base url, and fname is function takes a url and generates a
    local filename to save the file as on the local machine. Both of these
    methods have default values that do nothing.

example: seec.py

The example seec.py demonstrates how to use the file download all of the CSV
files linked to on the Connecticut State Elections Enforcement Commission’s
disbursement and receipt data page.