XKCD Alt Text Word Cloud | Interactive Art & Computational Design, Spring 2014

For the scraping assignment, I decided to figure out which words were most commonly used in XKCD’s alt-text. Being a big fan of Python, I decided to use Python and Beautiful Soup to scrape the data.

XKCD uses a very simple URL scheme (http://xkcd.com/<comic_number>). I simply figured out what was the most recent comic number manually, then, in a python loop, went from 1-1328 and downloaded the html of each page. Once that was done, I loaded each page in Beautiful Soup to extract the alt-text from the comic. Each comic page contains a <div id=”comic”>…</div> which has an <img> inside that. The <img> tag contains the ‘title=”…”‘ attribute which is the alt-text. I then created a file where each line contained a single comic’s alt-text.

I then created a second Python script to parse the alt-text data. This script loaded the file I had created earlier, removed all punctuation, and counted every single word. It then output a csv file of each word and their count.

I then loaded the csv file into d3 and, using d3.layout.cloud.js, created a word cloud of the alt-text words. To get the word cloud to look nicer, I removed all common English words (eg, the, and) and removed all words less than length two. The result is shown below.

I did not imbed a running example of the d3 code since it takes a few minutes to parse the csv file. However, all code, data, and instructions on how to run it can be found on my Github.

M	T	W	T	F	S	S
« May
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Interactive Art & Computational Design, Spring 2014

An Advanced Studio in Arts Engineering and Freestyle Computing // Prof. Golan Levin, Carnegie Mellon University

Andrew Russell

11 Feb 2014