I recently analyzed over one million lines of rap lyrics and compiled a list of the best lyrics of rap. I reveal the best lyrics at the end of the article.

This week's post will define the central problem of why assessing rap quality through flow is actually non-trivial, and explain the underlying interdisciplinary approach that I used to solve the problem. Next week, I'll explain the details of how everything works.

But first, how do we assess rap quality and why do we care?

An introduction to flow, and why I undertook the project

Flow, as defined by musicologist and literary expert Paul Edwards in his widely influencial book How to Rap, is a combination of many elements to create an artistic effect. The main components are usually rhythm and rhyme, but a deeper analysis brings in a vast array of elements in music theory: timbre, dynamics, pitch, etc.

In my eyes, the centerpiece of flow is abstraction, an idea in computer science that one doesn't need to understand the inner pinnings of how something works to interact with it. Indeed, it's easy to tell when something has good or bad flow, and to remember rap lyrics with well executed flows, but difficult to fully explain why.

I wasn't satisfied with merely appreciating rap. I wanted to see through the layers of abstraction, and to quantify it. I wanted to push the frontiers of data mining in a highly cultural and subjective playing field. I wanted, at my fingertips, the best of rap.

The reality, though, is that flow is really hard to even define, much less quantify. When well done, it's easy to appreciate. Breaking it down, however, is challenging to say the least.

Instead of doing a full analysis of all the elements of particular songs, akin to Pandora's Music Genome Project, I decided to focus on only rhythm and rhyme. Even this, however, became very difficult very quickly.

An illustrative example

To illustrate why finding seemingly simple rhyme patterns in rap becomes fairly convoluted quickly, here's Eminem's Renegade from 2001:

Now who's the king of these rude ludicrous lucrative lyrics? Who could inherit the title, put the youth in hysterics?
Usin' his music to steer it, sharin his views and his merits
But there's a huge interference - they're sayin you shouldn't hear it
Maybe it's hatred I spew, maybe it's food for the spirit
Maybe it's beautiful music I made for you just cherish

Pulled from a lecture from the University of Waterloo, this verse provides an excellent starting ground for why interdisciplinary thinking is necessary to overcome obstacles in analyzing flow.

So... let's think about rhymes in this. We want to rhyme last syllables of lines.

But it doesn't quite work, because many of these words only approximately rhyme. "Merits" and "hear it" don't have the same consonant ending, and "lyrics" and "hysterics" have different vowels for them to be perfectly rhymed.

Equally as important, the rhymes seem to extend beyond the last words.

These "extended rhymes" are often known as multis in rap lingo, or multisyllabic rhymes in academic circles. "Lucratic lyrics" and "youth in hysterics", as well as "food for the spirit" and "you to just cherish" last five/six syllables and add a layer of complexity.

But that's not it. It appears that the rhymes at the end of each line flow onto the next line, creating a cascade of internal rhymes. Getting confusing yet?

Furthermore, set theory dictates the transitive relation, meaning that if A and B and related, and B and C are related, then so are A and C. It's not long before the series of rhymes becomes a graph full of nodes and edges. Not so fun.

But this isn't all! Given that all the songs in my collection are written in English, we have yet another problem.

The Problem with English Phonetics

The tricky part about English is the existence of non-phonetic words. Eric Malmi of Aalto University pioneered some work involving the analysis of Finnish rap lyrics. In his blog, he presents Raplyzer, a program that finds multisyllabic rhymes by traversing backwards series of two lines in rap.

The algorithm is fairly straightforward in Finnish, because there is a predictable, deterministic mapping from the way words look to the way that they sound. He admits to certain challenges in applying it to English. His algorithm, verbatim from his website, is as follows:

  1. Go through a song word by word.
  2. For each word, find the longest matching vowel sequence that ends with one of the 15 previous words (start with the last vowels of two words, if they’re the same, proceed to the second to last vowels, third to last, and so on. Proceed ignoring word boundaries until the first non-matching vowels have been encountered).
  3. Compute the average rhyme length by averaging the lengths of the longest matching vowel sequences of all words.

The downside is that a) we cannot detect imperfect rhymes and b) vowel matching is not necessarily even indicative of rhyming in English. That is, certain words can have the same vowels but don't rhyme, and also certain words with different vowels can rhyme. "Lyrics" and "hysterics" would be a good example, from Eminem's Renegade above.

A good algorithm, therefore, cannot rely on the letters in the rap alone.

Stay Tuned for the Interdisciplinary Solution

So how did I do it? The solution is actually quite elegant from an interdisciplinary point of view. It involves linguistics and machine learning, of course, but the crux of the solution lies in bioinformatics. The crux of it is that phonemes are no different from amino acid sequences, or nucleotides, in proteins or DNA. Using the wealth of genomic analysis tools out there, a whole new world tools has opened to help us.

And with that realization, the problem becomes much simpler. I'll write a detailed explanation next week. But I hate to leave you hanging, so here's a preliminary version of the site.


You can search by artist/album/year, as well as upvote/downvote lyrics. Stay tuned.