23. Scraping & Display

Overview

The objective of this assignment is to introduce basic tools and techniques for automatic content extraction, parsing and display, such as are commonly encountered in data-intensive research projects. We first consider Ben Fry’s overview of the process of data science, from his doctoral thesis:

Screen Shot 2014-01-23 at 6.43.44 AM

In this assignment, you are asked to locate an interesting online source of data, scrape some of it (with code written in Python) and display it in the course WordPress (with Javascript and D3).

Learning Objectives
Per Ben Fry’s formulation: Upon completion of this assignment, you will be able to acquire, parse, and represent data using popular web-scraping and data presentation libraries.


Scraping, Ethics, and Law

Here we enter a gray area. Is web scraping illegal? It is no exaggeration to say that people have certainly lost their liberty and even their lives for scraping data. In class we will discuss the ethics and tactics of data scraping. Before we begin, please read about the brilliant and creative Aaron Swartz, who was arrested for downloading too many publicly-funded academic journal articles, with tragic consequences. Read:

Carl Malamud, “rogue archivist”, presented this eulogy for Aaron Swartz:

Another challenging example is that of the hacker Weev, who faces 5 years in jail for scraping improperly secured personal email addresses from AT&T. To avoid this sort of situation in our class, we will establish the policy that you may not scrape personal data (e.g. contact, personal or financial data about customers, credit card information, etc.) for anyone other than yourself.


The Work.

You are asked to do the following:

  • Read the three articles:
    1. Taxonomy of Data Science tasks, by Hilary Mason and Chris Wiggins, with special attention to their sections on obtain and scrub.
    2. Also read the Guerilla Open Access Manifesto by Aaron Swartz.
    3. Also read Art and the API by Jer Thorp.
  • Identify an interesting online website, public data API, or database that manifests in scrape-able HTML. This can be anything you wish — for example, you might fetch public civic data, or possibly a source of data relevant to your upcoming Quantified Selfie project. Or you might fetch something weird and funny, such as the real names of hip-hop artists.
  • Create a scraper to automate the extraction and downloading of this data. NEW! You have four choices for doing so:
    1. create a custom scraper written from scratch (we recommend the Beautiful Soup python library);
    2. write code using a public API of your choice;
    3. use Temboo, which provides interfaces from 7 languages to more than 100 API’s; or
    4. use the KimonoLabs tool.

    The enthusiastic recommendation of a wide consensus of data scientists is that the Beautiful Soup Python library is the premier tool for creating a custom data scraper from scratch. As an alternative to Beautiful Soup — I have decided that you may alternatively use any public API you please. Consider using the new KimonoLabs tool for automated data scraping, or the Temboo library for API access in your favorite language.

  • Scrape some data. For heaven’s sake, don’t scrape it all. A thousand data points oughta be sufficient for the purposes of this assignment. And please: don’t forget to put in a couple seconds’ delay between queries — otherwise, you may get banned, or worse.
  • Display your data, using D3, in a blog post categorized 23-scraping-and-display.  Here is a page full of resources I have prepared for you, to help you do so. I have also prepared this page about Sheetsee, a cool tool that lets you use Google spreadsheets as a datasource for D3 visualizations. Keep your display simple and perfunctory. You are asked to create the display in order to complete a technical workflow. Although lovely, informative and insightful diagrams are always appreciated, your display will not be critiqued on these terms for this assignment.
  • Also in the blog post, discuss your project in 100-200 words. What did you learn? What challenges did you overcome?
  • Also in the blog post, include a screen capture (JPG or PNG) of your display, just in case the D3 stops working someday.
  • Link to the Github repository that contains all of your code for this project.

Optional helpful readings: