Goals:
Introduction into webscraping, or how one can efficiently collect lots of information from the Internet .
Software:
-
wget
(https://www.gnu.org/software/wget/), a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS the most widely-used Internet protocols. It is a non-interactive command line tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc. -
NB: on installing
wget
, see Ian Milligan’s “Automated Downloading with Wget”, https://programminghistorian.org/lessons/automated-downloading-with-wget- Alternatively, for Windows: https://builtvisible.com/download-your-website-with-wget/
- On Mac (and, possibly, Linux):
brew install wget
Class:
- practical examples of working with
wget
- single link download
- batch download
- web-page analysis
- extraction of links with
regular expressions
- modification of links with
regular expressions
Sample commands
wget link
wget -i file_with_links.txt
wget -i links.txt -P ./folderYouWantToSaveTo/ -nc
Where:
-P
is a folder parameter, which instructswget
where you want to store downloaded files (optional).-nc
is a no-clobber parameter, which instructswget
to skips files, if they already exist (optional)
NB: there are many other parameters with which you can adjust wget
to your needs.
Examples for Downloading
Practice 1: very easy
- Article 01
- Article 02
- Article 03
- Article 04
- Article 05
- Article 06
- Article 07
- Article 08
- Article 09
- Article 10
- Article 11
- Article 12
- Article 13
- Article 14
- Article 15
Practice 2: easy-ish
- Article 16
- Article 17
- Article 18
- Article 19
- Article 20
- Article 21
- Article 22
- Article 23
- Article 24
- Article 25
- Article 26
- Article 27
- Article 28
- Article 29
- Article 30
- Article 31
- Article 32
- Article 33
- Article 34
- Article 35
- Article 36
- Article 37
- Article 38
- Article 39
Practice 3 (aka Homework): a tiny-bit tricky
- download issues of “Richmond Times Dispatch” (Years 1860-1865, only!), which are available at: http://www.perseus.tufts.edu/hopper/collection?collection=Perseus:collection:RichTimes)
Reference Materials:
- Milligan, Ian. 2012. “Automated Downloading with Wget.” Programming Historian, June. https://programminghistorian.org/lessons/automated-downloading-with-wget.
- Kurschinski, Kellen. 2013. “Applied Archival Downloading with Wget.” Programming Historian, September. https://programminghistorian.org/lessons/applied-archival-downloading-with-wget.
- Alternatively, this operation can be done with a Python script: Turkel, William J., and Adam Crymble. 2012. “Downloading Web Pages with Python.” Programming Historian, July. https://programminghistorian.org/lessons/working-with-web-pages.
Homework:
- Scraping the “Dispatch”: download issues of “Richmond Times Dispatch” (Years 1860-1865, only!), which are available at: http://www.perseus.tufts.edu/hopper/collection?collection=Perseus:collection:RichTimes)
- Publish a step-by-step explanation of what you have done as a blogpost on your website.
- Codecademy’s Learn Python, Unit 7-8.
- Github: publish the confirmation screenshot as a post on your new site.