JohnMars—DataScraping

Popular music has the stigma of being overly duplicative — it seems like almost every song has the same chord progression. Is this really the case? Has it always been like this?

This project took place in two parts: scraping the Billboard Hot 100 for every year since 1940, and then using that data to get each song’s chord progression.

The Billboard information was gathered from this website with a custom python scraper. I was lucky enough that the site had a consistent URL scheme to find the pages, and all but a few of the pages had a consistent table layout from which to scrape the data. A couple years were inconsistent, so I had to download those pages as HTML files, do some manual touch-up with multiple cursors in Sublime Text, and send them back through the scraper as strings instead of actual requests. By the end of that phase, my data structure looked like this, but with 5400 entries:

{
  "songs": [
    {
      "title": "White Christmas",
      "artist": "Bing Crosby",
      "rank": 1,
      "year": 1940
    },
    {
      "title": "The Christmas Song",
      "artist": "Nat \"King\" Cole",
      "rank": 2,
      "year": 1940
    },
    …
    {
      "title": "Somethin' Bad",
      "artist": "Miranda Lambert and Carrie Underwood",
      "rank": 99,
      "year": 2014
    },
    {
      "title": "Adore You",
      "artist": "Miley Cyrus",
      "rank": 100,
      "year": 2014
    }
  ]
}

Phase two was using that data to scrape chord information from a guitar tab website. Again, I was extremely lucky that they used a consistent URL naming scheme — unfortunately, not all of my data fit that scheme, so my final data has holes. It’s entirely possible that I could find the missing data points by traversing search results, but that’s not within the scope of this project at the moment. After stripping the chord information from the website, again with a custom scraper, my structure looked like this (the total amount of data is 200,679 lines):

{
  "songs": [
    {
      "chords": [
        "G",
        "C",
        "D",
        "D7",
        "C",
        "D",
        "G",
        "G",
        "B",
        "Em",
        "G",
        "G",
        "Em",
        "D",
        "G",
        "C",
        "D",
        "D7",
        "C",
        "D",
        "G",
        "C",
        "Em",
        "C",
        "G",
        "D",
        "G",
        "G",
        "C",
        "D",
        "D7",
        "C",
        "D",
        "G",
        "C",
        "Em",
        "C",
        "G",
        "D",
        "G"
      ],
      "chord_url": "http://tabs.ultimate-guitar.com/b/Bing_Crosby/White_Christmas_crd.htm",
      "title": "White Christmas",
      "rank": 1,
      "year": 1940,
      "artist": "Bing Crosby"
    },
    {
      "chords": "",
      "chord_url": "",
      "title": "The Christmas Song",
      "rank": 2,
      "year": 1940,
      "artist": "Nat \"King\" Cole"
    },
    {
      "chords": [
        "Ebmaj7",
        "Abmaj7",
        "Ebmaj7",
        "Abmaj7",
        "Eb7",
        "Eb7",
        "Abmaj7",
        "Ebmaj7",
        "F7",
        "C7",
        "Bb7",
        "Ebmaj7",
        "Abmaj7",
        "Ebmaj7",
        "Abmaj7",
        "Ebmaj7",
        "Abmaj7",
        "Eb7",
        "Eb7",
        "Abmaj7",
        "Ebmaj7",
        "F7",
        "C7",
        "Bb7",
        "Ebmaj7",
        "Abmaj7",
        "Ebmaj7",
        "Ab7",
        "G7",
        "D7",
        "G7",
        "C7",
        "Bb7",
        "Ebmaj7",
        "Abmaj7",
        "Ebmaj7",
        "Abmaj7",
        "Eb7",
        "Eb7",
        "Abmaj7",
        "Ebmaj7"
      ],
      "chord_url": "http://tabs.ultimate-guitar.com/b/Billie_Holiday/God_Bless_The_Child_crd.htm",
      "title": "God Bless The Child",
      "rank": 3,
      "year": 1940,
      "artist": "Billie Holiday"
    },
    {
      "chords": "",
      "chord_url": "",
      "title": "Take The \"A\" Train",
      "rank": 4,
      "year": 1940,
      "artist": "Duke Ellington"
    },
    …
    {
      "chords": [
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "Am",
        "C",
        "F",
        "Am",
        "F",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "F",
        "C",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C",
        "F",
        "Am",
        "C"
      ],
      "chord_url": "http://tabs.ultimate-guitar.com/m/Miley_Cyrus/Adore_You_crd.htm",
      "title": "Adore You",
      "rank": 100,
      "year": 2014,
      "artist": "Miley Cyrus"
    }
  ]
}

In terms of visualization/representation, obviously it has to be something sonic. Someone has already made an application that explores related chords in popular music, which is pretty cool; implementing something similar, except where time is a main contributor, might be a good avenue — sort of a “what does this era sound like” type of thing, or maybe album covers sortable by progression.

One of the awesome things about this data set and the way I’ve constructed it is that it’s easily expandable. Want tempos for each song? Easy. Album Cover? Okay. Genre? Done.

The project source is available here, and the full JSON data is here (it’s only about 5MB).