scrapetools.py - for pulling down all linked files from a page
I wrote some Python functions to help download every file linked to on a web page.
Here’s the repo
It’s called scrape_tools.py, and it has a modest four methods:
-
makedir(directory) - create a directory if it doesn’t exist. usefule for setting up your output directory structure programmatically.
-
get(url) - wraps requests.get() call and raises an exception when response status != 200. Otherwise returns content.
-
download_bin(url,output_file) - downloads a file, again using requests, and saves it to output_file (which is path string, not file handle).
-
def get_files(html, base_url=lambda x: x, match_term=”.csv”, fname=lambda x: x) - get all files linked to in html, containing the term match_term, which defaults to “.csv”. base_url is a method that takes a url and generates a base url, and fname is function takes a url and generates a local filename to save the file as on the local machine. Both of these methods have default values that do nothing.
example: seec.py
The example seec.py demonstrates how to use the file download all of the CSV files linked to on the Connecticut State Elections Enforcement Commission’s disbursement and receipt data page.