Andrew Russell

11 Feb 2014

For the scraping assignment, I decided to figure out which words were most commonly used in XKCD’s alt-text. Being a big fan of Python, I decided to use Python and Beautiful Soup to scrape the data.

XKCD uses a very simple URL scheme (http://xkcd.com/<comic_number>). I simply figured out what was the most recent comic number manually, then, in a python loop, went from 1-1328 and downloaded the html of each page. Once that was done, I loaded each page in Beautiful Soup to extract the alt-text from the comic. Each comic page contains a <div id=”comic”>…</div> which has an <img> inside that. The <img> tag contains the ‘title=”…”‘ attribute which is the alt-text.  I then created a file where each line contained a single comic’s alt-text.

I then created a second Python script to parse the alt-text data. This script loaded the file I had created earlier, removed all punctuation, and counted every single word.  It then output a csv file of each word and their count.

I then loaded the csv file into d3 and, using d3.layout.cloud.js, created a word cloud of the alt-text words.  To get the word cloud to look nicer, I removed all common English words (eg, the, and) and removed all words less than length two.  The result is shown below.

cloud

I did not imbed a running example of the d3 code since it takes a few minutes to parse the csv file.  However, all code, data, and instructions on how to run it can be found on my Github.