Leave comments or discuss the blog on my GitHub Blog page!

1 Introduction to My AFI Project

by Caleb J. Picker, created using Quarto in RStudio, currently drinking Starbucks Veranda at-home brew

Sing the Sorrow is my favorite album of all time. I’ve listened to it 1000s of times over the last 20 years. If you’re like me, and you love AFI as much as I do, then you also love the rest of their discography. And if you’re really like me, then you also love data science, research, and statistics.

As the 20th Anniversary of the Sing the Sorrow concert looms on the horizon, I was inspired to combine both of my passions to analyze the content of AFI’s lyrics from their major album releases (and to upskill my data science skillset in the realm of natural language processing). Note that each of the tables in this blog is interactive and searchable!

So let’s see what modern statistics can reveal about the themes (clandestine or not) behind AFI’s lyrics!

2 AFI’s Most Frequently Used Words

To begin, I start by looking at the most frequently used words in several ways. I used the Lyrics Genius API wrapper in the geniusr package to download AFI’s entire catalog of songs (following https://www.r-bloggers.com/2021/01/scraping-analysing-and-visualising-lyrics-in-r/).

2.1 Songs Removed from Analysis

After downloading the lyrics, I filtered to only those songs in their major album release. This means I removed almost 100 songs from this analysis, as AFI’s catalog contains about 240 songs. See below for list of removed songs.

2.2 Word Frequency

To get to a more meaningful output, I pre-processed the lyrics. I first removed lines of words that repeated (e.g., choruses), and then I removed stop words (e.g., “I”, “you”, “the”, “a”). So you can see for yourself, this section shows word frequency both with stop words (Table 2-left) and without stop words (Table 3-right).

As an aside, an analysis with stop words would definitely be interesting. Davey uses pronouns and direct objects in unique ways, and that could be an entire analysis of its own: The stop words could be contextualized, for example, using ngrams.

Without further ado, let’s reveal AFI’s most frequently used words! Looking at Table 3 (right), it seems like the top six words are:

feel (75); love (70); eyes (60); time (55); life (51); and heart (49).


Attaching package: 'jsonlite'

The following object is masked from 'package:purrr':

    flatten

Joining, by = "word"

Word frequency across AFI’s major album releases.

2.3 Wordcloud

Following the tables, I present an alternative visualization of word frequency (based on local and global weightings, see Technical Details for more information): a wordcloud!

Take a look! The top two words appear to be feel and love. What else caught your eye?

Loading required package: RColorBrewer

2.4 Top 10 Most Frequently Used Words (by Album)

To round out the word frequency topic, I created a plot of AFI’s top 10 most frequently used words faceted by major album release (following https://drsimonj.svbtle.com/ordering-categories-within-ggplot2-facets). I themed up the image to use Sing the Sorrow colors and by selecting colors from each album to represent the data as bars.

Following on the Sing the Sorrow theme, let’s look at Sing the Sorrow (the blood-red album in the second row, second column). The top six words are:

grey, tonight, dance, step, lay, inside, and heart

Grey likely comes from This Celluloid Dream (“All the colors (all grey) upon leaving (all grey) all will turn to grey.”)

Feel free to look at the rest of the albums! It’s pretty interesting! Let me know what else you discover in a comment on the Discussion page!

Figure 2: Word frequency by each of AFI’s albums

3 Latent Semantic Analysis

In this section, I followed Gefen, Endicott, Fresneda, Miller, and Larsen’s (2017) process for a latent semantic analysis. The goal was to reveal how the words are related and to discover hidden themes.

3.1 Technical Details

If you’re not into technical details, feel free to skip this section and go straight to Cosine Similarity. Gefen et al.’s (2017) process allowed me to locally and globally weight the raw frequencies. The local weighting algorithm weights words more heavily that appear more often within a song (presumably because they’re more important); the global weighting algorithm weights words less heavily if words appear more often across all songs (presumably because they’re less important, like the word “the”). These weightings were then multiplied together.

To start the process, I split up all the songs from the major album releases into separate text files. After importing them, I created a Term-Document-Matrix and first, counted the raw frequencies and second, locally and globally weighted those frequencies. The rest of the analyses for this post rely on these weighted frequencies.

While importing the files, I set the minimum global frequency to 2. There are 698 vocabulary words and 147 songs. After that, I created a cosine similarity matrix to examine how related certain words are together. Finally, I examined the hidden themes by creating a matrix with word loadings (similar to factor analysis).

3.2 Cosine Similarity

Similar to concepts like correlation, cosine similarity also measures similarity by measuring the distance between two vectors in semantic space. Cosine similarity ranges between -1 and +1:

closer to -1 means the words are more opposite,
closer +1 means the words are more similar, and
closer to 0 means the words are unrelated.

3.2.1 Related Words

In this interactive table, I’m interested in sorting the values based on a column. In this case, I’ll use the word eye. I click the column header twice, and I find the top three words (cosine similarity) are:

stare (+.53); recal (+.49); and meet (+.47)

It looks like the use of the word eye is closely related to staring, as well as to recalling (a memory). Interesting!

Warning in instance$preRenderHook(instance): It seems your data is too
big for client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

3.2.2 Is `eye` more closely related to sight or to memory?

The follow up question is whether eye is more closely related to sight or to memory words.

[1] "Eye and sight have cosine similarity rating of 0.2"

[1] "Eye and recal have cosine similarity rating of 0.48"

[1] "Eye and recogn have cosine similarity rating of 0.22"

[1] "Eye and rememb have cosine similarity rating of 0.13"

As it turns out, my hypothesis that eye, when projected onto the semantic space, is more closely related to memory words like recal, recogn, rememb is somewhat supported!

What else would you like to see? Tell me in the comments on the Discussion page!

3.3 Hidden Themes

One of the key features of latent semantic analysis is that it can reveal hidden themes that one otherwise may not have noticed. This is demonstrated in the word loadings matrix. The full set is here for you to play with (Table 5), but I’ve also extracted the top five words for all dimensions in the Table 6.

Here, I will present all factors from the analysis, but I will only analyze three of the hidden themes. The approach I took was to find hidden themes that were easier for me to interpret to better demonstrate the power of this analysis. The hidden themes are labeled V1 to V50, so feel free to explore to your heart’s content. In my estimation, the hidden themes aren’t so straightforward, so some help interpreting them would be welcomed! Feel free to post your findings in the Discussion forum!

Table 6 extracts the top five words (and their loadings) for each of the themes to make interpretation easier.

3.3.1 Hidden Theme 1: Love is Red Blood

For hidden theme 1, I analyzed (V3):

Top 5: drop (+18.59); red (+15.88); love (+15.27); tast (+11.61); life (+7.45)

The first hidden theme seems to relate to red drops that signify love and tasting life. Two songs come to mind: She Speaks the Language with the line “Red, red drops upon my cuff let me know this must be love”.

So the use of these top 5 words make the connection between love and blood, so every drop of blood spilled is love spilling out of his body.

3.3.2 Hidden Theme 2: Forever Crawling We Rewind

For hidden theme 2 (V26), I find the top 5 words:

Top 5: crawl (+11.78); forever (+11.40); float (+11.29); dream (+10.71); and meet (+9.06)

The second hidden theme seems to relate to forever crawling and floating in a dream. This reminds me of the song Rewind: “feel it crawl up on your chest…stay awake as its creeping close”, “Let the ghost that meets your eyes still haunt you, remind you”, “Your selfish words must float denied like your crime, I float now”, and “Forever crawling, ever crawling”.

3.3.3 Hidden Theme 3: Bathed in Your Radiance I Melt

For hidden theme 3 (V6), I find:

Top 5: twist (+12.34); melt (+12.14); gonna (+11.17); light (+10.76); and star (+10.06)

The third hidden theme seems to relate to twist and melting by the light of a star. This Celluloid Dream comes to mind with the lines: “Just like a memory, it twists me”; “bathed in your radiance I melt”; and “in the shadow of a star in static pallor”. So this could mean, at least within the context of this song, Davey compares his lack of radiance (static pallor) to his twisted memory of the radiance of his star who radiates light that melts him.

4 Conclusion

AFI is my favorite band of all time, and Sing the Sorrow is my favorite album of all time. Combined with my love for statistics, I decided, for my first blog post tracking my data science journey, to analyze the content of AFI’s lyrics.

In addition to my goal of learning new data science skills for my CV, I wanted to gain a deeper understanding of the hidden themes and word usage patterns in Davey’s lyrics. Although not every interpretation was straightforward, I did find that the most frequently used words were feel and love. I also found that eyes, when compared to sight, were more closely related to memory words, such as recal, rememb, and recogn.

I also found a hidden metaphors: such as love is blood (Hidden Theme 1: Love is Red Blood). I found a memory forever crawling and begging to be rewound (Hidden Theme 2: Forever Crawling We Rewind). Finally, I found how his memories of his star twist and melt him (Hidden Theme 3: Bathed in Your Radiance I Melt).

Overall, this was a really fun exercise! It’s my first time doing something like this, and I certainly learned a lot! I know I still have a lot more to learn, especially regarding best software development practices.

I can hardly wait to attend the 20th Anniversary Sing the Sorrow event! For real. I’ve been in The Despair Faction fan club for 20 years. This is a real treat. You could say I’ve been waiting 20 years for this!

Leave comments or discuss the blog on my GitHub Blog page!

--- title: "A Text Mining Analysis to Uncover the Hidden Themes behind AFI's (A Fire Inside's) Lyrics" listing: contents: posts sort: "date desc" type: default categories: true sort-ui: false filter-ui: false page-layout: full date: "2023-01-23" title-block-banner: true format: html: code-fold: true code-tools: true code-line-numbers: true execute: echo: false include: false number-sections: true toc: true toc-depth: 3 toc-title: Contents theme: simplex repo: calebjpicker/AFI image: wordcloud.png --- Leave comments or discuss the blog on my [GitHub Blog](https://github.com/calebjpicker/Blog/discussions/2) page! # Introduction to My AFI Project by [Caleb J. Picker](https://calebjpicker.quarto.pub/ "Caleb J. Picker's Curriculum Vitae"), created using Quarto in RStudio, currently drinking Starbucks Veranda at-home brew [Sing the Sorrow](https://open.spotify.com/album/1eIzVBHA5NvX0wo2nLACew?si=ntAoYywYSq-SrfJCGL9t3g) is my favorite album of all time. I've listened to it 1000s of times over the last 20 years. If you're like me, and you love AFI as much as I do, then you also love the rest of their discography. And if you're really like me, then you also love data science, research, and statistics. As the [20th Anniversary of the Sing the Sorrow](https://www.youtube.com/watch?v=k6x3kf5rvak&t=4s) concert looms on the horizon, I was inspired to combine both of my passions to analyze the content of AFI's lyrics from their major album releases (and to upskill my data science skillset in the realm of natural language processing). **Note** that each of the tables in this blog is [interactive]{.underline} and [searchable]{.underline}! So let's see what modern statistics can reveal about the themes (clandestine or not) behind AFI's lyrics! # AFI's Most Frequently Used Words To begin, I start by looking at the most frequently used words in several ways. I used the Lyrics Genius API wrapper in the `geniusr` package to download AFI's entire catalog of songs (following <https://www.r-bloggers.com/2021/01/scraping-analysing-and-visualising-lyrics-in-r/>). ```{r} #| echo: false #| include: false # Set working directory dir = 'C:/Users/caleb/OneDrive/Caleb/Personal/Financials/Curriculum Vitae/Portfolio' setwd(dir) ``` ```{r} #| echo: false #| cache: true #| include: false #| output: false library(jsonlite) library(tidyverse) library(geniusr) library(tidytext) library(ggplot2) library(quanteda) # library(spacyr) # library(openNLP) # library(rJava) library(glue) library(gplots) library(RColorBrewer) library(ComplexHeatmap) library(wordcloud) library(rjson) ``` ```{r} #| echo: false #| output: false #| cache: true #| include: false library(geniusr)# find artist ID # genius token API gen_token="CqypOkltZXuxwnwuEvLm6OQB_uonxqSJ-w-rWIBamPSAg0sJdEia2g1iCZSNwWhD" # NEED TO HIDE THIS SOMEHOW # Find the artist ID. search_artist("AFI",access_token=gen_token) #36520 # Save Artist ID as variable. AFI_id<-36520 # Get all songs based on Artist ID variable. songs<-get_artist_songs_df(AFI_id,access_token = gen_token) # Copy songs as new dataframe (to preserve original import) songs_df<-songs # Get song ids ``` ```{r} #| cache: true #| include: false #| output: false ids <- c(as.character(songs_df$song_id)) # remove songs that are repeats or live versions or not on a main album # create list of songs by song id to remove songs_to_rm<- c(1795101, # 100 Words 1955285, # 17 Crimes (LA Riots Remix) 775819, # 3.5 1984466, # A Single Second - Live From Long Beach Arena 774610, # A Winter's Tale 4046936, # Back into the Sun 794933, # Born In The US Of A 4046935, # Break Angels 2389822, # Breathing Towers to Heaven 794998, # But Home Is Nowhere (Demo Version) 1795120, # But Home Is Nowhere (Demo Version) [from "the Leaving Song Pt.Ii" (Part 2) Single]chicken Song 775075, # Carcinogen Crush 7250231, # Caught 1707354, # Demonomania 1178936, # Don't Change 795088, # Dream Of Waking 1966520, # Endlessly, She Said - Live From Long Beach Arena 785517, # Ether 2117778, # Fainting Spells 1323191, # Fall Children 785146, # Fallen Like the Sky 4045137, # Get Dark 1946768, # Girl's Not Grey - Live From Long Beach Arena 2009937, # God Called In Sick Today - Live From Long Beach Arena 782859, # Halloween 785634, # Head Like a Hole 794623, # Hearts Frozen Solid, Thawed Once More by the Spring of Rage, Despair and Hopelessness 2160131, # I Hope You Suffer 2323968, # Initation 2118067, # Intro/Prelude 12/21 794676, # I Wanna Mohawk 970901, # Jack the Ripper 779434, # Key Lime Pie 1955497, # Kill Caustic - Live From Long Beach Arena 775594, # Kung-Fu Devil 1772410, # Love Is A Many Splendored Thing 784805, # Love Is a Many Splendored Thing [vinyl only] (Ft.Â BrettÂ Reed & TimÂ Timebomb) 1979083, # Love Like Winter - Live From Long Beach Arena 484287, # Lower It 794800, # Lower Your Head And Take It In The Body 784844, # Malleus Malleficarum 775186, # Man in a Suitcase 785056, # Mini Trucks Suck 4663303, # MissedCallz [interlude] 1945711, # Miss Murder - Live From Long Beach Arena 4965495, # Miss Murder (UK Radio Edit) 774660, # My Michelle 775479, # Now the World 775962, # On the Arrow 779513, # Over Exposure 1855407, # Paper Airplanes 785140, # Paper Airplanes (Makeshift Wings) (Demo Version) 784382, # Porphyria 1994239, # Prelude 12/21 - Live From Long Beach Arena 795301, # Rabbits Are Roadkill On Route 37 774434, # Rabbits Are Roadkill on Rt. 37 775770, # Red Hat 774620, # Reiver's Music 775217, # Rolling Balls 784209, # Self Pity 1964228, # Silver And Cold - Live From Long Beach Arena 1981611, # Summer Shudder - Live From Long Beach Arena 795228, # Synesthesia 1795242, # Synesthesia (Demo Version) 785775, # The Boy Who Destroyed the World 794969, # The Chicken Song 2000230, # The Days of the Phoenix - Live From Long Beach Arena 785608, # The Great Disappointment (Demo Version) 742667, # The Hanging Garden 776173, # The Leaving Song (Demo Version) 784751, # The Leaving Song Part II 1986158, # The Leaving Song Pt. II - Live From Long Beach Arena 1935590, # The Missing Frame - Live From Long Beach Arena 4046937, # The Missing Man 250706, # The Spoken Word 2835040, # The View From Here 1748926, # This Celluloid Dream (Demo Version) 1940476, # This Time Imperfect - Live From Long Beach Arena 2389819, # Too Late for Gods 785951, # Totalimmortal 785386, # Transference 4046934, # Trash Bat 967979, # Values Here 775159, # We Bite 2389817, # Weâ€™ve Got the Knife 775743, # Whatever I Do 2128444, # Where We Used to Play 795523, # Who Knew? 784581, # Who Said You Could Touch Me? 775274, # Who said you could touch me? [vinyl only] 1059885, # YÃ¼rf Rendenmein 1795153 # Ziggy Stardust ) songs_to_rm_names<-c( "100 Words", "17 Crimes (LA Riots Remix)", "3.5", "A Single Second - Live From Long Beach Arena", "A Winter's Tale", "Back into the Sun", "Born In The US Of A", "Break Angels", "Breathing Towers to Heaven", "But Home Is Nowhere (Demo Version)", "But Home Is Nowhere (Demo Version) [from the Leaving Song Pt.Ii (Part 2) Single]chicken Song", "Carcinogen Crush", "Caught", "Demonomania", "Don't Change", "Dream Of Waking", "Endlessly, She Said - Live From Long Beach Arena", "Ether", "Fainting Spells", "Fall Children", "Fallen Like the Sky", "Get Dark", "Girl's Not Grey - Live From Long Beach Arena", "God Called In Sick Today - Live From Long Beach Arena", "Halloween", "Head Like a Hole", "Hearts Frozen Solid, Thawed Once More by the Spring of Rage, Despair and Hopelessness", "I Hope You Suffer", "Initation", "Intro/Prelude 12/21", "I Wanna Mohawk", "Jack the Ripper", "Key Lime Pie", "Kill Caustic - Live From Long Beach Arena", "Kung-Fu Devil", "Love Is A Many Splendored Thing", "Love Is a Many Splendored Thing [vinyl only] (Ft.Bretteed & TimTimebomb)", "Love Like Winter - Live From Long Beach Arena", "Lower It", "Lower Your Head And Take It In The Body", "Malleus Malleficarum", "Man in a Suitcase", "Mini Trucks Suck", "MissedCallz [interlude]", "Miss Murder - Live From Long Beach Arena", "Miss Murder (UK Radio Edit)", "My Michelle", "Now the World", "On the Arrow", "Over Exposure", "Paper Airplanes", "Paper Airplanes (Makeshift Wings) (Demo Version)", "Porphyria", "Prelude 12/21 - Live From Long Beach Arena", "Rabbits Are Roadkill On Route 37", "Rabbits Are Roadkill on Rt. 37", "Red Hat", "Reiver's Music", "Rolling Balls", "Self Pity", "Silver And Cold - Live From Long Beach Arena", "Summer Shudder - Live From Long Beach Arena", "Synesthesia", "Synesthesia (Demo Version)", "The Boy Who Destroyed the World", "The Chicken Song", "The Days of the Phoenix - Live From Long Beach Arena", "The Great Disappointment (Demo Version)", "The Hanging Garden", "The Leaving Song (Demo Version)", "The Leaving Song Part II", "The Leaving Song Pt. II - Live From Long Beach Arena", "The Missing Frame - Live From Long Beach Arena", "The Missing Man", "The Spoken Word", "The View From Here", "This Celluloid Dream (Demo Version)", "This Time Imperfect - Live From Long Beach Arena", "Too Late for Gods", "Totalimmortal", "Transference", "Trash Bat", "Values Here", "We Bite", "Weâ€™ve Got the Knife", "Whatever I Do", "Where We Used to Play", "Who Knew?", "Who Said You Could Touch Me?", "Who said you could touch me? [vinyl only]", "YÃ¼rf Rendenmein", "Ziggy Stardust" ) # remove songs from songs_to_rm list songs_df<-songs_df[!(songs_df$song_id %in% songs_to_rm),] # get all song ids (should now exclude songs on songs_to_rm) ids <- as.character(songs_df$song_id) ``` ```{r} #| echo: false #| freeze: true #| include: false #| cache: true # Create empty dataframe to house the lyrics allLyrics<-data.frame() # add lyrics to df while (length(ids) > 0) { for (id in ids) { tryCatch({ allLyrics <- rbind(get_lyrics_id(id,access_token = gen_token), allLyrics) successful <- unique(allLyrics$song_id) ids <- ids[!ids %in% successful] print(paste("done - ", id)) print(paste("New length is ", length(ids))) }, error = function(e){}) } } # create copy of all Lyrics allLyrics_orig<-allLyrics # create dtaframe with lyrics for all songs and add empty column for album info allIds <- data.frame(song_id = unique(allLyrics$song_id)) allIds$album <- "" ``` ```{r} #| echo: false #| freeze: true #| include: false #| cache: true # get all album info and join it to the allLyrics df for (song in allIds$song_id) { allIds[match(song,allIds$song_id),2] <- get_song_df(song,access_token = gen_token)[12] print(allIds[match(song,allIds$song_id),]) } # add album info for Yurf Rendenmein allIds[(allIds$song_id==1817423),2]<-"Very Proud of Ya" ``` ```{r} #| freeze: true #| include: false #| cache: true # remove repeated lines within allLyrics (e.g., repeated choruses) allLyrics_unique<-allLyrics %>% distinct(allLyrics,line,keep.all=TRUE) # keep only major release albums # allIds<-allIds[(allIds$album %in% album_major_release),] # join allIds dataframe with allLyrics_unique dataframe allLyrics_unique <- full_join(allIds, allLyrics_unique) # keep only major release albums (in case full join added them back in, but they shouldn't have) # allLyrics_unique<-allLyrics_unique[(allLyrics_unique$album %in% album_major_release),]# keep only major release albums allLyrics2 <- full_join(allLyrics_unique, allIds) ``` ## Songs Removed from Analysis After downloading the lyrics, I filtered to only those songs in their major album release. This means I removed almost 100 songs from this analysis, as AFI's catalog contains about 240 songs. See below for list of removed songs. ```{r} #| echo: false #| include: true #| code-fold: true DT::datatable(as.matrix(songs_to_rm_names), caption="Table 1: Songs removed from analysis.", colnames = "Removed Song") ``` ```{r} #| freeze: false #| include: false library(tidyverse) library(tidytext) allLyricsTokenised <- allLyrics2 %>% unnest_tokens(word, line,token="words") # Words to remove words_to_rm<-c("ah", "cajz", "cap'n", "calvin", "de", "hey", "ho", "l.a", "ohh", "the", "uh", "um", "whoa", "whoah", "ya", "yeah") # remove words allLyricsTokenised<-allLyricsTokenised[!(allLyricsTokenised$word %in% words_to_rm),] # allLyricsTokenised<-distinct(allLyricsTokenised,word,song,keep.all=TRUE) # dim(allLyricsTokenised) ``` ## Word Frequency {#word-frequency-with-and-without-stop-words} To get to a more meaningful output, I pre-processed the lyrics. I first removed lines of words that repeated (e.g., choruses), and then I removed stop words (e.g., "I", "you", "the", "a"). So you can see for yourself, this section shows word frequency both with stop words (Table 2-left) and without stop words (Table 3-right). As an aside, an analysis with stop words would definitely be interesting. Davey uses pronouns and direct objects in unique ways, and that could be an entire analysis of its own: The stop words could be contextualized, for example, using [ngrams](https://en.wikipedia.org/wiki/N-gram). Without further ado, let's reveal AFI's most frequently used words! Looking at Table 3 (right), it seems like the top six words are: - `feel` (75); `love` (70); `eyes` (60); `time` (55); `life` (51); and `heart` (49). ```{r} #| echo: false #| include: true #| output: true #| freeze: false #| layout-ncol: 2 #| column: page #| tbl-cap: "Word frequency across AFI's major album releases." #| tbl-subcap: #| - "Stop words present" #| - "Stop words removed" # library(jsonlite) # Count each word allLyricscount<-allLyricsTokenised %>% count(word, sort = TRUE) DT::datatable(allLyricscount, caption = "Table 2: Word frequency across all of AFI's major album releases (with stop-words).", colnames=c("Word","Frequency") # options = list( # initComplete = JS( # "function(settings, json) {", # "$(this.api().table().header()).css({'background-color': '#92141c', 'color': '#767975'});", # "}") ) # library(rjson) library(jsonlite) # library(rjson) # Remove stopwords tidyLyrics <- allLyricsTokenised %>% anti_join(stop_words) tidyLyricscount<-tidyLyrics %>% count(word, sort = TRUE) tidyLyricscount1<- tidyLyrics %>% count(song_name,word, sort = TRUE) DT::datatable(tidyLyricscount, caption = "Table 3: Word frequency across all of AFI's major album releases (with stop-words removed).", colnames=c("Word","Frequency")) # options = list( # initComplete = JS( # "function(settings, json) {", # "$(this.api().table().header()).css({'background-color': '#92141c', 'color': '#767975'});", # "}") ``` ## Wordcloud {#sec-wordcloud} Following the tables, I present an alternative visualization of word frequency (based on local and global weightings, see [Technical Details] for more information): a wordcloud! Take a look! The top two words appear to be `feel` and `love`. What else caught your `eye`? ```{r} #| echo: false #| freeze: true #| cache: true #| include: false library(LSAfun) library(lsa) # Import list of default stop words in English data(stopwords_en) # stopwords_en # Create the TDM from all the Words tidyLyrics_select<-tidyLyrics %>% select(album,song_name,word) # song name and word # DT::datatable(tidyLyrics_select) # for troubleshooting # split the tidyLyrics into multiple dataframes based on song AFI_songs_df_split<- split(tidyLyrics_select,tidyLyrics_select$song_name) # change name of Prelude 12/21 list to remove slash '/' both the name of hte list and within the list names(AFI_songs_df_split)[89]<-"Prelude 12 21" AFI_songs_df_split[['Prelude 12 21']][1]<-"Prelude 12 21" ``` ```{r} #| echo: false #| freeze: true #| include: false #| cache: true # # Create function to remove song_name column within each list element # AFI_songs_df_split<-lapply(AFI_songs_df_split, function(x) x[!(names(x) %in% c("song_name"))]) dir<-"C:/Users/Caleb/OneDrive/Caleb/Personal/Financials/Curriculum Vitae/Blog" # # Create function to export as txt files lapply(1:length(AFI_songs_df_split),function(x){ songname<-as.character(names(AFI_songs_df_split[x])) outfile<-paste0(dir,"/AFIsongs/",songname,".txt") write.table(AFI_songs_df_split[[x]]["word"], file = outfile, row.names=FALSE, col.names=FALSE, # remove the word "word" from all documents quote=FALSE) }) ``` ```{r} #| freeze: false #| echo: false #| include: false #| cache: true # Next I will follow a more practical example so I can learn how to interpret the output better, besides a wordcloud # Also, make sure that all from major albumreleases are included and others are excluded (didn't seem to be the case for some reason) # Load required code libraries library(cluster) library(tm) library(LSAfun) library(lsa) dir = 'C:/Users/caleb/OneDrive/Caleb/Personal/Financials/Curriculum Vitae/Blog' setwd(dir) # Create text matrix using tm's DirSource. # Create corpus in memory # There's some mistyped code that doesn't work in this paper # raw_corpus<-VCorpus(source_dir,readerControl-list(language='en')) #I'll copy and paste from previous example to create a raw corpus # Establish source directory for all text files. (one .txt file = 1 song) source_dir<-paste0(dir,"/AFIsongs/") data(stopwords_en) # Create TDM (text document matrix) with each txt file equal to one document TDM<-textmatrix(source_dir,stopwords=c(stopwords_en),stemming=TRUE, removeNumber=F,minGlobFreq=1.75) # TDM # view TDM TDM_summary<-summary.textmatrix(TDM) TDM_summary_terms<-t(TDM_summary["vocabulary"]) TDM_summary_docs<-t(TDM_summary["documents"]) # Before running the SVM, create a weighted matrix TDM2. # TDM2 is term frequency x inverse document frequency (this is a standard method) # lw = local weighting, more importance to terms that appear more times within a single document # gw = global weighting, less importance to w9rds that appear in more document (same reasoning as removing stop words) TDM2 <-lw_tf(TDM)*gw_idf(TDM) # TDM2 ``` ```{r} #| include: false #| echo: false library(cluster) library(tm) library(LSAfun) library(lsa) # Run LSA on wegihted matrix TDM2. # Number of dimensions chosen by default with dimcalc_share() # lsa transformed TDM2 into three matrices and placed them all into the miniLSAspace object # tk = term matrix, dk = document matrix, and sk = singular value matrix miniLSAspace<-lsa(TDM2,dims=dimcalc_share()) # miniLSAspace<-lsa(TDM2,dims=5) # View matrix by transforming SVD matrix into text matrix as.textmatrix(miniLSAspace) # Doesn't work. Needs other non-working script. # findFreqTerms(TDM2) # these are all unrotated "loadings", like PCA or factor analysis # This command will show the value-weighted matrix of Terms # factoring of terms, # factors = 40, so there are 40 factors with loadings for each term tk2<-t(miniLSAspace$sk*t(miniLSAspace$tk)) tk2_numfactors<-ncol(tk2) ``` ```{r} #| echo: false #| include: true #| cache: false #| tbl-cap: "Figure 1: Wordcloud of the lyrics to each of AFI's major album releases." library(wordcloud) # plot wordcloud using lsa results!!! Term_count<-apply(TDM2,1,sum) TCT<-t(Term_count) myTerms2<-rownames(tk2) suppressWarnings(wordcloud(myTerms2,TCT,min.freq=3,random.order=FALSE,color=brewer.pal(8,"Dark2") ,scale=c(3.5,.1))) ``` ## Top 10 Most Frequently Used Words (by Album) To round out the word frequency topic, I created a plot of AFI's top 10 most frequently used words faceted by major album release (following <https://drsimonj.svbtle.com/ordering-categories-within-ggplot2-facets>). I themed up the image to use [Sing the Sorrow](https://open.spotify.com/album/1eIzVBHA5NvX0wo2nLACew?si=B4jg7hxoQtWYO2K-ITYDBQ) colors and by selecting colors from each album to represent the data as bars. Following on the [Sing the Sorrow](https://open.spotify.com/album/1eIzVBHA5NvX0wo2nLACew?si=B4jg7hxoQtWYO2K-ITYDBQ) theme, let's look at [Sing the Sorrow](https://open.spotify.com/album/1eIzVBHA5NvX0wo2nLACew?si=B4jg7hxoQtWYO2K-ITYDBQ) (the blood-red album in the second row, second column). The top six words are: - `grey`, `tonight`, `dance`, `step`, `lay`, `inside`, and `heart` `Grey` likely comes from [This Celluloid Dream](https://open.spotify.com/track/46Y9yh2KxmtodypW4bCp6v?si=172d233902574ce7) ("All the colors (all grey) upon leaving (all grey) all will turn to grey.") Feel free to look at the rest of the albums! It's pretty interesting! Let me know what else you discover in a comment on the [Discussion page](https://github.com/calebjpicker/Blog/discussions/2)! ```{r} #| echo: false #| freeze: true #| output: false #| include: false library(tidyverse) topFew <- tidyLyrics %>% group_by(album, word) %>% mutate(n = row_number()) %>% ungroup() # Take only top 10 max for each word by album topFew <- topFew %>% group_by(album, word) %>% summarise(n = max(n))%>% mutate(total=sum(n)) %>% top_n(10,n) %>% ungroup() %>% arrange(album,n) %>% # 3. add order column of row numbers mutate(order=row_number()) topFew <- topFew %>% mutate(total = sum(n)) %>% group_by(word) %>% # top_n(10,abs(n)) %>% # filter(total>=40) %>% # 1. remove grouping ungroup() %>% # 2. Arrange by # i. facet group # ii. bar height arrange(album,total) %>% # 3. add order column of row numbers mutate(order=row_number()) ``` ```{r} #| echo: false #| cache: false # Create a facet for each album and list the most frequented words for that album # https://drsimonj.svbtle.com/ordering-categories-within-ggplot2-facets library(plotly) # package to make plot interactive library(jpeg) library(grid) # afi_sts_bg<-jpeg::readJPEG("afi-sing-sorrow.jpg") # Prepare only albums for major release album_major_release<-c("Answer That and Stay Fashionable", "Very Proud of Ya", "Shut Your Mouth and Open Your Eyes", "Black Sails in the Sunset", "The Art of Drowning", "Sing the Sorrow", "DECEMBERUNDERGROUND", "Crash Love", "Burials", "AFI (The Blood Album)", "Bodies") # colours for each album albumCol <- c("#c9a95a", # ATASF "#a0480f", # VPYa "#dc4d33", #SYMAOYE "#e86722", # Black Sails in the Sunset "#9f62ad", # The Art of Drowning "#92141c", # Sing the Sorrow "#d6dee5", # DU "#f3c937", # Crash Love "#7c7c7c", # Burials "#3b0404", # AFI (The Blood Album) "#595957") # Bodies names(albumCol) <- album_major_release # This ensures bars are stacked in order of release date topFew$album <- factor(topFew$album, levels = album_major_release) wordsPlot <- ggplot(topFew) + geom_bar(aes(x = order, y = n, fill = as.factor(album)), colour = "#141517", stat = "identity", show.legend = FALSE) + # removes legend coord_flip() + facet_wrap(~ album ,scales="free") + labs(title = "AFI's Top 10 most frequently used words by album", caption = "Source: genius.com | by @CalebPicker", y = "Frequency", x = "Word", fill = "Album") + # annotation_custom(rasterGrob(afi_sts_bg, # width = unit(1,"npc"), # height = unit(1,"npc")), # -Inf,Inf,-Inf,Inf) + scale_fill_manual(values = albumCol) + # add categories to axis scale_x_continuous( breaks = topFew$order, labels = topFew$word, expand = c(0,0) ) + theme(title = element_text(face = "bold", size = 20), strip.background=element_rect(fill='#141517'), strip.text.x = element_text(color="#d51020"), # This impacted teh titles for each facet grid strip.text.y = element_text(color="#d51020"), strip.text = element_text(size=12,color="#d51020"), panel.border = element_rect(colour = "#141517", fill=NA, size=1), panel.background = element_rect(colour = "#767975", fill = "white"), panel.grid.major.x = element_line(colour="#d51020",size = 1, linetype = 4), axis.title = element_text(face = "bold",size = 20, colour = "#d51020"), axis.ticks.length = unit(5, units = "pt"), legend.position="none", # legend.background = element_rect(color='#141517', fill='#141517'), # legend.title = element_text(color='#d51020'), # legend.position = "top", # legend.key.size = unit(12,"pt"), # legend.box.spacing = unit(3,"pt"), # legend.text = element_text(size = 12,color='#d51020'), axis.text.y = element_text(size = 12,face="bold"), axis.text.x = element_text(size = 12,face="bold"), plot.background=element_rect(fill='#141517',color='#141517'), plot.title=element_text(color='#d51020'), plot.subtitle=element_text(color='#d51020'), ) ``` ```{r} #| echo: false #| include: true #| cache: false #| fig-height: 12 #| fig-width: 14 #| fig-cap: "Figure 2: Word frequency by each of AFI's albums" # plotly::ggplotly(wordsPlot,show.legend=FALSE) wordsPlot # ggsave(filename = "/Albums/AFIwordsbyalbum.png", plot = wordsPlot, width = 30, height = 24, units = "cm", # type = "cairo") ``` # Latent Semantic Analysis In this section, I followed [Gefen, Endicott, Fresneda, Miller, and Larsen's (2017)](https://aisel.aisnet.org/cais/vol41/iss1/21/ "Latent Semantic Analysis"){#lsa} process for a latent semantic analysis. The goal was to reveal how the words are related and to discover hidden themes. ## Technical Details If you're not into technical details, feel free to skip this section and go straight to [Cosine Similarity](#cosine-similarity-getting-more-interesting). Gefen et al.'s (2017) process allowed me to locally and globally weight the raw frequencies. The local weighting algorithm weights words more heavily that appear more often *within* a song (presumably because they're more important); the global weighting algorithm weights words less heavily if words appear more often *across* all songs (presumably because they're less important, like the word "the"). These weightings were then multiplied together. To start the process, I split up all the songs from the major album releases into separate text files. After importing them, I created a Term-Document-Matrix and first, counted the raw frequencies and second, locally and globally weighted those frequencies. The rest of the analyses for this post rely on these weighted frequencies. While importing the files, I set the minimum global frequency to 2. There are `r TDM_summary_terms` vocabulary words and `r TDM_summary_docs` songs. After that, I created a cosine similarity matrix to examine how related certain words are together. Finally, I examined the hidden themes by creating a matrix with word loadings (similar to factor analysis). ## Cosine Similarity {#cosine-similarity-getting-more-interesting} Similar to concepts like [correlation](https://en.wikipedia.org/wiki/Correlation), [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) also measures similarity by measuring the distance between two vectors in semantic space. Cosine similarity ranges between -1 and +1: - closer to -1 means the words are more opposite, - closer +1 means the words are more similar, and - closer to 0 means the words are unrelated. ### Related Words In this interactive table, I'm interested in sorting the values based on a column. In this case, I'll use the word `eye`. I click the column header twice, and I find the top three words (cosine similarity) are: - `stare` (+.53); `recal` (+.49); and `meet` (+.47) It looks like the use of the word `eye` is closely related to `star`ing, as well as to `recal`ling (a memory). Interesting! ```{r} #| echo: false #| freeze: false #| include: true #| cache: false myCosineSpace2<-multicos(myTerms2,tvectors=tk2) # removed breakdown=TRUE arg, and it worked suppressWarnings(DT::datatable(round(myCosineSpace2,2), options=list(scrollX=TRUE), caption="Table 4: Cosine similarity for relationships between individual words.")) # library(LSAfun) # coherence(myTerms2,tvectors=tk2) ``` ### Is `eye` more closely related to sight or to memory? The follow up question is whether `eye` is more closely related to sight or to memory words. ```{r} #| include: true print(paste0("Eye and sight have cosine similarity rating of ", round(costring("eye","sight" ,tvectors=tk2[,1:ncol(tk2)]),2))) print(paste0("Eye and recal have cosine similarity rating of ", round(costring("eye",'recal' ,tvectors=tk2[,1:ncol(tk2)]),2))) print(paste0("Eye and recogn have cosine similarity rating of ", round(costring("eye",'recogn' ,tvectors=tk2[,1:ncol(tk2)]),2))) print(paste0("Eye and rememb have cosine similarity rating of ", round(costring("eye",'rememb' ,tvectors=tk2[,1:ncol(tk2)]),2))) ``` As it turns out, my hypothesis that `eye`, when projected onto the semantic space, is more closely related to memory words like `recal`, `recogn`, `rememb` is somewhat supported! What else would you like to see? Tell me in the comments on the [Discussion page](https://github.com/calebjpicker/Blog/discussions/2)! ## Hidden Themes One of the key features of latent semantic analysis is that it can reveal hidden themes that one otherwise may not have noticed. This is demonstrated in the word loadings matrix. The full set is here for you to play with (Table 5), but I've also extracted the top five words for all dimensions in the Table 6. Here, I will present all factors from the analysis, but I will only analyze three of the hidden themes. The approach I took was to find hidden themes that were easier for me to interpret to better demonstrate the power of this analysis. The hidden themes are labeled V1 to V50, so feel free to explore to your heart's content. In my estimation, the hidden themes aren't so straightforward, so some help interpreting them would be welcomed! Feel free to post your findings in the [Discussion forum](https://github.com/calebjpicker/Blog/discussions/2)! ```{r} #| include: true #| echo: false #| cache: false # Display all factor loadings for all words tk2_round_df<-round(as.data.frame(tk2),2) DT::datatable(tk2_round_df, options=list(scrollX=TRUE), caption="Table 5: All words and their loading onto factors (hidden themes).") ``` Table 6 extracts the top five words (and their loadings) for each of the themes to make interpretation easier. ```{r} #| include: true #| echo: false #| cache: false # Top 5 of each factor top_words<-sapply(tk2_round_df,function(x) head(row.names(tk2_round_df)[order(x,decreasing=TRUE)],5)) top_value<-sapply(tk2_round_df,function(x) head(x[order(x,decreasing=TRUE)],5)) top_themes<-data.frame() top_themes<-cbind(top_words,top_value) DT::datatable(top_themes, options=list(scrollX=TRUE,scrollY=TRUE), caption="Table 6: Top 5 words loading onto their factors (hidden themes).") ``` ### Hidden Theme 1: Love is Red Blood {data-link="Hidden Theme 1: Love is Red Blood"} For hidden theme 1, I analyzed (V3): - Top 5: `drop` (+18.59); `red` (+15.88); `love` (+15.27); `tast` (+11.61); `life` (+7.45) The first hidden theme seems to relate to `red drops` that signify `love` and `tast`ing `life`. Two songs come to mind: [She Speaks the Language](https://open.spotify.com/track/2b8Ajd7NgbjdLHzY2f8gex?si=def37951aa884707) with the line "Red, red drops upon my cuff let me know this must be love". So the use of these top 5 words make the connection between love and blood, so every drop of blood spilled is love spilling out of his body. ### Hidden Theme 2: Forever Crawling We Rewind {#hidden-theme-2-forever-crawling-we-rewind data-link="Hidden Theme 2: Forever Crawling We Rewind"} For hidden theme 2 (V26), I find the top 5 words: - Top 5: `crawl` (+11.78); `forever` (+11.40); `float` (+11.29); `dream` (+10.71); and `meet` (+9.06) The second hidden theme seems to relate to forever crawling and floating in a dream. This reminds me of the song [Rewind](https://open.spotify.com/track/06fPxg3oOPbOUaZVVhghfd?si=5f462d1f9fac46da): "feel it crawl up on your chest...stay awake as its creeping close", "Let the ghost that meets your eyes still haunt you, remind you", "Your selfish words must float denied like your crime, I float now", and "Forever crawling, ever crawling". ### Hidden Theme 3: Bathed in Your Radiance I Melt {data-link="Hidden Theme 3: Bathed in Your Radiance I Melt"} For hidden theme 3 (V6), I find: - Top 5: `twist` (+12.34); `melt` (+12.14); `gonna` (+11.17); `light` (+10.76); and `star` (+10.06) The third hidden theme seems to relate to `twist` and `melt`ing by the `light` of a `star`. [This Celluloid Dream](https://open.spotify.com/track/46Y9yh2KxmtodypW4bCp6v?si=ce45b9dc38fc4505) comes to mind with the lines: "Just like a memory, it twists me"; "bathed in your radiance I melt"; and "in the shadow of a star in static pallor". So this could mean, at least within the context of this song, Davey compares his lack of radiance (static pallor) to his twisted memory of the radiance of his star who radiates light that `melt`s him. ```{r} #| include: false #| echo: false # This will show the matrix of documents # factoring of documents, # factors = 40, so there are 40 factors with loadings for each documents dk2 = miniLSAspace$dk # round(dk2,2) #DT::datatable(round(dk2,2), # fillContainer = TRUE) ``` ```{r} #| include: false #| echo: false # The sk matrix of singular values connects tk and dk matrices to reproduce the the original TDM2 # Because the sk matrix only has diagonal values, R stores it as a numeric vector sk2 = miniLSAspace$sk # round(sk2,2) ``` # Conclusion AFI is my favorite band of all time, and [Sing the Sorrow](https://open.spotify.com/album/1eIzVBHA5NvX0wo2nLACew?si=ntAoYywYSq-SrfJCGL9t3g) is my favorite album of all time. Combined with my love for statistics, I decided, for my first blog post tracking my data science journey, to analyze the content of AFI's lyrics. In addition to my goal of learning new data science skills for my [CV](https://calebjpicker.quarto.pub/cvr/), I wanted to gain a deeper understanding of the hidden themes and word usage patterns in Davey's lyrics. Although not every interpretation was straightforward, I did find that the most frequently used words were `feel` and `love`. I also found that `eye`s, when compared to `sight`, were more closely related to memory words, such as `recal`, `rememb`, and `recogn`. I also found a hidden metaphors: such as love is blood ([Hidden Theme 1: Love is Red Blood]). I found a memory forever crawling and begging to be rewound ([Hidden Theme 2: Forever Crawling We Rewind](#hidden-theme-2-forever-crawling-we-rewind)). Finally, I found how his memories of his star twist and melt him ([Hidden Theme 3: Bathed in Your Radiance I Melt]). Overall, this was a really fun exercise! It's my first time doing something like this, and I certainly learned a lot! I know I still have a lot more to learn, especially regarding best software development practices. I can hardly wait to attend the 20th Anniversary Sing the Sorrow event! For real. I've been in The Despair Faction fan club for 20 years. This is a real treat. You could say I've been waiting 20 years for this! Leave comments or discuss the blog on my [GitHub Blog](https://github.com/calebjpicker/Blog/discussions/2) page!

A Text Mining Analysis to Uncover the Hidden Themes behind AFI’s (A Fire Inside’s) Lyrics

Categories

1 Introduction to My AFI Project

2 AFI’s Most Frequently Used Words

2.1 Songs Removed from Analysis

2.2 Word Frequency

2.3 Wordcloud

2.4 Top 10 Most Frequently Used Words (by Album)

3 Latent Semantic Analysis

3.1 Technical Details

3.2 Cosine Similarity

3.3 Hidden Themes

3.3.1 Hidden Theme 1: Love is Red Blood

3.3.2 Hidden Theme 2: Forever Crawling We Rewind

3.3.3 Hidden Theme 3: Bathed in Your Radiance I Melt

4 Conclusion

Categories

1 Introduction to My AFI Project

2 AFI’s Most Frequently Used Words

2.1 Songs Removed from Analysis

2.2 Word Frequency

2.3 Wordcloud

2.4 Top 10 Most Frequently Used Words (by Album)

3 Latent Semantic Analysis

3.1 Technical Details

3.2 Cosine Similarity

3.2.1 Related Words

3.2.2 Is eye more closely related to sight or to memory?

3.3 Hidden Themes

3.3.1 Hidden Theme 1: Love is Red Blood

3.3.2 Hidden Theme 2: Forever Crawling We Rewind

3.3.3 Hidden Theme 3: Bathed in Your Radiance I Melt

4 Conclusion

3.2.2 Is `eye` more closely related to sight or to memory?