Goals:

Introduction into webscraping, or how one can efficiently collect lots of information from the Internet .

Software:

Class:

  • practical examples of working with wget
  • single link download
  • batch download
    • web-page analysis
    • extraction of links with regular expressions
    • modification of links with regular expressions

Sample commands

wget link
wget -i file_with_links.txt
wget -i links.txt -P ./folderYouWantToSaveTo/ -nc 

Where:

  • -P is a folder parameter, which instructs wget where you want to store downloaded files (optional).
  • -nc is a no-clobber parameter, which instructs wget to skips files, if they already exist (optional)

NB: there are many other parameters with which you can adjust wget to your needs.

Examples for Downloading

Practice 1: very easy

Practice 2: easy-ish

Practice 3 (aka Homework): a tiny-bit tricky

Reference Materials:

Homework:

  1. Scraping the “Dispatch”: download issues of “Richmond Times Dispatch” (Years 1860-1865, only!), which are available at: http://www.perseus.tufts.edu/hopper/collection?collection=Perseus:collection:RichTimes)
  2. Publish a step-by-step explanation of what you have done as a blogpost on your website.
  3. Codecademy’s Learn Python, Unit 7-8.
  4. Github: publish the confirmation screenshot as a post on your new site.