Emily Schwartzman – Lyrical Similarity

by ecschwar @ 4:30 am 10 May 2011

Lyrical Similarity 

What can you tell about an artist through their song lyrics? Typically songs and musical artists are analyzed on metrics based on the music, not on the song lyrics. This project explores patterns and similarities of musical artists based on song lyrics. Numerous studies and visualizations were created with various text analysis tools and techniques to look at a dataset of 100 song lyrics for 95 artists, selected from Rolling Stone‘s “Top 100 Artists of All Time.”


For this project I wanted to explore text-based visualizations. I decided to focus on song lyrics as my data source. Typically music is analyzed on the quality of the sound or the music itself, not from the perspective of the lyrics. I thought this would be an interesting opportunity to look at what the song lyrics might communicate or say about songs.

My initial inspiration came from other work done in the text visualization space. A couple of examples, shown below, include Stefanie Posavec’s Literary Organism project, and Ben Fry’s Origins of Species project.

Another interesting project is Stephan Thiel’s Understanding Shakespeare. What I appreciate about these projects is the breadth of views that they looked at to examine and understand the data.

Fernanda Viegas and Martin Wattenberg created a visualization of song lyrics looks at an interesting angle, parts of the body. This piece was part of their Fleshmap project and visualizes genres of songs based on how frequently different body parts are mentioned in the lyrics.


I downloaded song lyrics using the musixmatch API. Initially I intended to look at song lyrics over time, analyzing collections of individual songs from different artists by decade. However, I ran into an issue when I was acquiring the data from musixmatch. I was only able to get 30% of song lyrics, and this did not feel like a good representation of individual songs. I put the lyrics that I had for each decade into one txt file, and ran them through an online tag cloud generator, just to get an initial impression of the data. There were no surprises here, but the bigger issue was that I did not feel good about the data. (Shown below, tag cloud for Billboard top 100 songs from 1980 and 2010)

Instead, I decided to look at collections of songs by artists. Using Rolling Stone’s “Top 100 Artists of All Time” as a starting point, I downloaded up to 100 songs for each artist on the list, which returned 30% of 100 songs. These song lyrics were stored in one .txt file for each artist. All of the lyrics were acquired using the musixmatch API via Processing.



Once I had all of the song lyrics, I had to figure out how to analyze the lyrics to prepare the data for a visualization. This process involved exploring many different avenues, looking at different text-analysis libraries and software as a starting point, and then deciding what angle to take on the data. This process was a back and forth between trying to figure out what angle to take on the data and how to best visualize that. Below are some sketches from throughout the process that look at possible ways to visualize the data.

AntConc, one of the tools that I installed and used to look at the data, provided word frequencies, word pairing frequencies, and concordance plots of where words appear in each txt file. Below is a concordance plot created with AntConc and a view of the interface showing frequent word clusters.

LIWC (Linguistic Inquiry and Word Count) was another tool that I came across, which “…can determine the degree any text uses positive or negative emotions, self-references, causal words, and 70 other language dimensions.” Unfortunately I did not have access to the full program, but on the LIWC website a sample set of data could be returned for an individual text file. The metrics included in this set were self references, social words, positive emotions, negative emotions, cognitive words and big words (words greater than 6 letters long). With help from Mauricio, who kindly gave me a PHP script to scrape this data from the LIWC site, I was able to upload each of the song lyric files and store the LIWC metrics for each artist in one txt file.


Visualization Studies

I began with a simple plot of the LIWC data to get a sense of what it looked like. To see where each artist ranked, I created a series of vertical plots, one for each metric, to map out the name of each artist. Some interesting groupings were visible, in particular the outliers, but the rest of the artists were not legible. For example, one immediate surprise was discovering that Simon and Garfunkel ranked the highest for negative emotions, which I was not expecting.

I took this visualization a step further by making it interactive, so that only the selected artist would be white and the other artist names would fall back at a lighter opacity. If you rollover one artist, you can see that artist highlighted in each plot to see where they fall for each of the metrics. As an attempt to add another layer of information, I accessed the last.fm API to get genre information for each artist. I add a list of the genres to the left of the visualization, so that when you rollover each genre, all artists that fall under that category will be highlighted. Although this allows for an interesting exploration, the data still feels disjointed in this visualization.

In another study, I created 2-dimensional plots for each of the metrics, to look for where the most significant relationships might be. For example, one plot looked at positive emotions vs negative emotions, or self references vs social words. I printed the artist names for this study instead of plotting points, which actually detracted from the legibility of the charts. However, some relationships were still visible. Shown below are sketches from this part of the visualization process, and a view of the 2-dimensional charts. Download PDF of charts

To better consolidate the data into a more meaningful visualization, Golan recommended looking into SVD (Singular Value Decomposition), which is a way of reducing multidimensional data into 2 dimensions. Looking at a 2-dimensional plot would map out similarity of artists based on the LIWC analysis of the song lyrics, so relationships between artists would be more visible. I did some research to look for existing tools to do this and came across an SVD library in LingPipe, which is a “…tool kit for processing text using computational linguistics.” I was able to get this running on my computer with Eclipse, but the data it returned did not quite translate into data that could be plotted in a 2-dimensional chart. I attempted to understand the math behind it, and looked into other libraries as well, but unfortunately was unable to figure this out on my own. Golan kindly shared code that would take my data and do SVD, returning 2 values that could be plotted in Processing. The initial results were difficult to read, since many of the artists fell in the same region.

To help with this, I used Golan’s Polynomial Shaping Functions to scale the data to optimize legibility. Result after applying shaping functions shown below.


Final Visualization

For the final visualization I ended up with a stylized version of the SVD results, which was formatted at poster size. Ideally I wanted to take this a step further by synthesizing the data more and adding a more meaningful layer of interpretation. I discussed the results with several people with a range of perspectives and opinions on music, and got different interpretations of the significance of the artist groupings. However, this in combination with my own interpretation did not leave me feeling confident enough about adding this to the final piece. Despite not adding this information, the piece still invites discussion and reflection on what the viewer thinks the groupings mean. Download PDF of Final Poster



I was somewhat disappointed with where I ended up for the final poster. From a design standpoint, the legibility was unfortunately compromised in my initial attempt to stylize the results. However, this could easily be resolved with another iteration on this particular design. However, I would ideally like to take this visualization a step further and add in an extra layer of information that would provide some insightful interpretation of why certain artists have similar lyrics. I think that part of my struggle to do this was in my selection of artists to analyze. I chose the Rolling Stone’s “Top 100 Artists of All Time” as a starting point because I thought it was a credible source that would provide a good range of artists from different decades and genres. However, some of these artists fell outside of my own knowledge of music, which made it difficult to fully interpret the results on my own. I would be interested to create a similar visualization with the 100 most frequently listened to artists from my own musical library, to see if I can draw some more meaningful conclusions from the findings.

Despite these frustrations, I learned a lot of new processes and techniques for scraping, analyzing, and visualizing data, which I think will be valuable to know for future information visualization projects. I realize a significant part of the process of creating an information visualization involves finding, analyzing and prepping the data. From a technical standpoint, that was my biggest challenge to overcome on this project. However, another significant part of creating a successful visualization is having an interesting question or angle to look at the data from, which I also struggled to settle focus on early on in the process. After working through this project, I can appreciate how challenging it is to create a meaningful and interesting text visualization, given the complexity of language and finding patterns and meaning within a body or bodies of text.


This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
(c) 2018 Interactive Art & Computational Design / Spring 2011 | powered by WordPress with Barecity