A Text Mining Analysis to Uncover the Hidden Themes behind AFI’s (A Fire Inside’s) Lyrics
Published
January 23, 2023
Categories
All (0)
Leave comments or discuss the blog on my GitHub Blog page!
1 Introduction to My AFI Project
by Caleb J. Picker, created using Quarto in RStudio, currently drinking Starbucks Veranda at-home brew
Sing the Sorrow is my favorite album of all time. I’ve listened to it 1000s of times over the last 20 years. If you’re like me, and you love AFI as much as I do, then you also love the rest of their discography. And if you’re really like me, then you also love data science, research, and statistics.
As the 20th Anniversary of the Sing the Sorrow concert looms on the horizon, I was inspired to combine both of my passions to analyze the content of AFI’s lyrics from their major album releases (and to upskill my data science skillset in the realm of natural language processing). Note that each of the tables in this blog is interactive and searchable!
So let’s see what modern statistics can reveal about the themes (clandestine or not) behind AFI’s lyrics!
After downloading the lyrics, I filtered to only those songs in their major album release. This means I removed almost 100 songs from this analysis, as AFI’s catalog contains about 240 songs. See below for list of removed songs.
2.2 Word Frequency
To get to a more meaningful output, I pre-processed the lyrics. I first removed lines of words that repeated (e.g., choruses), and then I removed stop words (e.g., “I”, “you”, “the”, “a”). So you can see for yourself, this section shows word frequency both with stop words (Table 2-left) and without stop words (Table 3-right).
As an aside, an analysis with stop words would definitely be interesting. Davey uses pronouns and direct objects in unique ways, and that could be an entire analysis of its own: The stop words could be contextualized, for example, using ngrams.
Without further ado, let’s reveal AFI’s most frequently used words! Looking at Table 3 (right), it seems like the top six words are:
feel (75); love (70); eyes (60); time (55); life (51); and heart (49).
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
Joining, by = "word"
Word frequency across AFI’s major album releases.
2.3 Wordcloud
Following the tables, I present an alternative visualization of word frequency (based on local and global weightings, see Technical Details for more information): a wordcloud!
Take a look! The top two words appear to be feel and love. What else caught your eye?
If you’re not into technical details, feel free to skip this section and go straight to Cosine Similarity. Gefen et al.’s (2017) process allowed me to locally and globally weight the raw frequencies. The local weighting algorithm weights words more heavily that appear more often within a song (presumably because they’re more important); the global weighting algorithm weights words less heavily if words appear more often across all songs (presumably because they’re less important, like the word “the”). These weightings were then multiplied together.
To start the process, I split up all the songs from the major album releases into separate text files. After importing them, I created a Term-Document-Matrix and first, counted the raw frequencies and second, locally and globally weighted those frequencies. The rest of the analyses for this post rely on these weighted frequencies.
While importing the files, I set the minimum global frequency to 2. There are 698 vocabulary words and 147 songs. After that, I created a cosine similarity matrix to examine how related certain words are together. Finally, I examined the hidden themes by creating a matrix with word loadings (similar to factor analysis).
3.2 Cosine Similarity
Similar to concepts like correlation, cosine similarity also measures similarity by measuring the distance between two vectors in semantic space. Cosine similarity ranges between -1 and +1:
closer to -1 means the words are more opposite,
closer +1 means the words are more similar, and
closer to 0 means the words are unrelated.
3.2.1 Related Words
In this interactive table, I’m interested in sorting the values based on a column. In this case, I’ll use the word eye. I click the column header twice, and I find the top three words (cosine similarity) are:
stare (+.53); recal (+.49); and meet (+.47)
It looks like the use of the word eye is closely related to staring, as well as to recalling (a memory). Interesting!
Warning in instance$preRenderHook(instance): It seems your data is too
big for client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
3.2.2 Is eye more closely related to sight or to memory?
The follow up question is whether eye is more closely related to sight or to memory words.
[1] "Eye and sight have cosine similarity rating of 0.2"
[1] "Eye and recal have cosine similarity rating of 0.48"
[1] "Eye and recogn have cosine similarity rating of 0.22"
[1] "Eye and rememb have cosine similarity rating of 0.13"
As it turns out, my hypothesis that eye, when projected onto the semantic space, is more closely related to memory words like recal, recogn, rememb is somewhat supported!
What else would you like to see? Tell me in the comments on the Discussion page!
3.3 Hidden Themes
One of the key features of latent semantic analysis is that it can reveal hidden themes that one otherwise may not have noticed. This is demonstrated in the word loadings matrix. The full set is here for you to play with (Table 5), but I’ve also extracted the top five words for all dimensions in the Table 6.
Here, I will present all factors from the analysis, but I will only analyze three of the hidden themes. The approach I took was to find hidden themes that were easier for me to interpret to better demonstrate the power of this analysis. The hidden themes are labeled V1 to V50, so feel free to explore to your heart’s content. In my estimation, the hidden themes aren’t so straightforward, so some help interpreting them would be welcomed! Feel free to post your findings in the Discussion forum!
Table 6 extracts the top five words (and their loadings) for each of the themes to make interpretation easier.
3.3.1 Hidden Theme 1: Love is Red Blood
For hidden theme 1, I analyzed (V3):
Top 5: drop (+18.59); red (+15.88); love (+15.27); tast (+11.61); life (+7.45)
The first hidden theme seems to relate to red drops that signify love and tasting life. Two songs come to mind: She Speaks the Language with the line “Red, red drops upon my cuff let me know this must be love”.
So the use of these top 5 words make the connection between love and blood, so every drop of blood spilled is love spilling out of his body.
3.3.2 Hidden Theme 2: Forever Crawling We Rewind
For hidden theme 2 (V26), I find the top 5 words:
Top 5: crawl (+11.78); forever (+11.40); float (+11.29); dream (+10.71); and meet (+9.06)
The second hidden theme seems to relate to forever crawling and floating in a dream. This reminds me of the song Rewind: “feel it crawl up on your chest…stay awake as its creeping close”, “Let the ghost that meets your eyes still haunt you, remind you”, “Your selfish words must float denied like your crime, I float now”, and “Forever crawling, ever crawling”.
3.3.3 Hidden Theme 3: Bathed in Your Radiance I Melt
For hidden theme 3 (V6), I find:
Top 5: twist (+12.34); melt (+12.14); gonna (+11.17); light (+10.76); and star (+10.06)
The third hidden theme seems to relate to twist and melting by the light of a star. This Celluloid Dream comes to mind with the lines: “Just like a memory, it twists me”; “bathed in your radiance I melt”; and “in the shadow of a star in static pallor”. So this could mean, at least within the context of this song, Davey compares his lack of radiance (static pallor) to his twisted memory of the radiance of his star who radiates light that melts him.
4 Conclusion
AFI is my favorite band of all time, and Sing the Sorrow is my favorite album of all time. Combined with my love for statistics, I decided, for my first blog post tracking my data science journey, to analyze the content of AFI’s lyrics.
In addition to my goal of learning new data science skills for my CV, I wanted to gain a deeper understanding of the hidden themes and word usage patterns in Davey’s lyrics. Although not every interpretation was straightforward, I did find that the most frequently used words were feel and love. I also found that eyes, when compared to sight, were more closely related to memory words, such as recal, rememb, and recogn.
Overall, this was a really fun exercise! It’s my first time doing something like this, and I certainly learned a lot! I know I still have a lot more to learn, especially regarding best software development practices.
I can hardly wait to attend the 20th Anniversary Sing the Sorrow event! For real. I’ve been in The Despair Faction fan club for 20 years. This is a real treat. You could say I’ve been waiting 20 years for this!
Leave comments or discuss the blog on my GitHub Blog page!
No matching items
Source Code
---title: "A Text Mining Analysis to Uncover the Hidden Themes behind AFI's (A Fire Inside's) Lyrics"listing: contents: posts sort: "date desc" type: default categories: true sort-ui: false filter-ui: falsepage-layout: fulldate: "2023-01-23"title-block-banner: trueformat: html: code-fold: true code-tools: true code-line-numbers: trueexecute: echo: false include: falsenumber-sections: truetoc: truetoc-depth: 3toc-title: Contentstheme: simplexrepo: calebjpicker/AFIimage: wordcloud.png---Leave comments or discuss the blog on my [GitHub Blog](https://github.com/calebjpicker/Blog/discussions/2) page!# Introduction to My AFI Projectby [Caleb J. Picker](https://calebjpicker.quarto.pub/ "Caleb J. Picker's Curriculum Vitae"), created using Quarto in RStudio, currently drinking Starbucks Veranda at-home brew[Sing the Sorrow](https://open.spotify.com/album/1eIzVBHA5NvX0wo2nLACew?si=ntAoYywYSq-SrfJCGL9t3g) is my favorite album of all time. I've listened to it 1000s of times over the last 20 years. If you're like me, and you love AFI as much as I do, then you also love the rest of their discography. And if you're really like me, then you also love data science, research, and statistics.As the [20th Anniversary of the Sing the Sorrow](https://www.youtube.com/watch?v=k6x3kf5rvak&t=4s) concert looms on the horizon, I was inspired to combine both of my passions to analyze the content of AFI's lyrics from their major album releases (and to upskill my data science skillset in the realm of natural language processing). **Note** that each of the tables in this blog is [interactive]{.underline} and [searchable]{.underline}!So let's see what modern statistics can reveal about the themes (clandestine or not) behind AFI's lyrics!# AFI's Most Frequently Used WordsTo begin, I start by looking at the most frequently used words in several ways. I used the Lyrics Genius API wrapper in the `geniusr` package to download AFI's entire catalog of songs (following <https://www.r-bloggers.com/2021/01/scraping-analysing-and-visualising-lyrics-in-r/>).```{r}#| echo: false#| include: false# Set working directorydir ='C:/Users/caleb/OneDrive/Caleb/Personal/Financials/Curriculum Vitae/Portfolio'setwd(dir)``````{r}#| echo: false#| cache: true#| include: false#| output: falselibrary(jsonlite)library(tidyverse)library(geniusr)library(tidytext)library(ggplot2)library(quanteda)# library(spacyr)# library(openNLP)# library(rJava)library(glue)library(gplots)library(RColorBrewer)library(ComplexHeatmap)library(wordcloud)library(rjson)``````{r}#| echo: false#| output: false#| cache: true#| include: falselibrary(geniusr)# find artist ID# genius token APIgen_token="CqypOkltZXuxwnwuEvLm6OQB_uonxqSJ-w-rWIBamPSAg0sJdEia2g1iCZSNwWhD"# NEED TO HIDE THIS SOMEHOW# Find the artist ID.search_artist("AFI",access_token=gen_token) #36520# Save Artist ID as variable.AFI_id<-36520# Get all songs based on Artist ID variable.songs<-get_artist_songs_df(AFI_id,access_token = gen_token)# Copy songs as new dataframe (to preserve original import)songs_df<-songs# Get song ids``````{r}#| cache: true#| include: false#| output: falseids <-c(as.character(songs_df$song_id))# remove songs that are repeats or live versions or not on a main album# create list of songs by song id to removesongs_to_rm<-c(1795101, # 100 Words1955285, # 17 Crimes (LA Riots Remix)775819, # 3.51984466, # A Single Second - Live From Long Beach Arena774610, # A Winter's Tale4046936, # Back into the Sun794933, # Born In The US Of A4046935, # Break Angels2389822, # Breathing Towers to Heaven794998, # But Home Is Nowhere (Demo Version)1795120, # But Home Is Nowhere (Demo Version) [from "the Leaving Song Pt.Ii" (Part 2) Single]chicken Song775075, # Carcinogen Crush7250231, # Caught1707354, # Demonomania1178936, # Don't Change795088, # Dream Of Waking1966520, # Endlessly, She Said - Live From Long Beach Arena785517, # Ether2117778, # Fainting Spells1323191, # Fall Children785146, # Fallen Like the Sky4045137, # Get Dark1946768, # Girl's Not Grey - Live From Long Beach Arena2009937, # God Called In Sick Today - Live From Long Beach Arena782859, # Halloween785634, # Head Like a Hole794623, # Hearts Frozen Solid, Thawed Once More by the Spring of Rage, Despair and Hopelessness2160131, # I Hope You Suffer2323968, # Initation2118067, # Intro/Prelude 12/21794676, # I Wanna Mohawk970901, # Jack the Ripper779434, # Key Lime Pie1955497, # Kill Caustic - Live From Long Beach Arena775594, # Kung-Fu Devil1772410, # Love Is A Many Splendored Thing784805, # Love Is a Many Splendored Thing [vinyl only] (Ft. Brett Reed & Tim Timebomb)1979083, # Love Like Winter - Live From Long Beach Arena484287, # Lower It794800, # Lower Your Head And Take It In The Body784844, # Malleus Malleficarum775186, # Man in a Suitcase785056, # Mini Trucks Suck4663303, # MissedCallz [interlude]1945711, # Miss Murder - Live From Long Beach Arena4965495, # Miss Murder (UK Radio Edit)774660, # My Michelle775479, # Now the World775962, # On the Arrow779513, # Over Exposure1855407, # Paper Airplanes785140, # Paper Airplanes (Makeshift Wings) (Demo Version)784382, # Porphyria1994239, # Prelude 12/21 - Live From Long Beach Arena795301, # Rabbits Are Roadkill On Route 37774434, # Rabbits Are Roadkill on Rt. 37775770, # Red Hat774620, # Reiver's Music775217, # Rolling Balls784209, # Self Pity1964228, # Silver And Cold - Live From Long Beach Arena1981611, # Summer Shudder - Live From Long Beach Arena795228, # Synesthesia1795242, # Synesthesia (Demo Version)785775, # The Boy Who Destroyed the World794969, # The Chicken Song2000230, # The Days of the Phoenix - Live From Long Beach Arena785608, # The Great Disappointment (Demo Version)742667, # The Hanging Garden776173, # The Leaving Song (Demo Version)784751, # The Leaving Song Part II1986158, # The Leaving Song Pt. II - Live From Long Beach Arena1935590, # The Missing Frame - Live From Long Beach Arena4046937, # The Missing Man250706, # The Spoken Word2835040, # The View From Here1748926, # This Celluloid Dream (Demo Version)1940476, # This Time Imperfect - Live From Long Beach Arena2389819, # Too Late for Gods785951, # Totalimmortal785386, # Transference4046934, # Trash Bat967979, # Values Here775159, # We Bite2389817, # We’ve Got the Knife775743, # Whatever I Do2128444, # Where We Used to Play795523, # Who Knew?784581, # Who Said You Could Touch Me?775274, # Who said you could touch me? [vinyl only]1059885, # Yürf Rendenmein1795153# Ziggy Stardust)songs_to_rm_names<-c("100 Words","17 Crimes (LA Riots Remix)","3.5","A Single Second - Live From Long Beach Arena","A Winter's Tale","Back into the Sun","Born In The US Of A","Break Angels","Breathing Towers to Heaven","But Home Is Nowhere (Demo Version)","But Home Is Nowhere (Demo Version) [from the Leaving Song Pt.Ii (Part 2) Single]chicken Song","Carcinogen Crush","Caught","Demonomania","Don't Change","Dream Of Waking","Endlessly, She Said - Live From Long Beach Arena","Ether","Fainting Spells","Fall Children","Fallen Like the Sky","Get Dark","Girl's Not Grey - Live From Long Beach Arena","God Called In Sick Today - Live From Long Beach Arena","Halloween","Head Like a Hole","Hearts Frozen Solid, Thawed Once More by the Spring of Rage, Despair and Hopelessness","I Hope You Suffer","Initation","Intro/Prelude 12/21","I Wanna Mohawk","Jack the Ripper","Key Lime Pie","Kill Caustic - Live From Long Beach Arena","Kung-Fu Devil","Love Is A Many Splendored Thing","Love Is a Many Splendored Thing [vinyl only] (Ft.Bretteed & TimTimebomb)","Love Like Winter - Live From Long Beach Arena","Lower It","Lower Your Head And Take It In The Body","Malleus Malleficarum","Man in a Suitcase","Mini Trucks Suck","MissedCallz [interlude]","Miss Murder - Live From Long Beach Arena","Miss Murder (UK Radio Edit)","My Michelle","Now the World","On the Arrow","Over Exposure","Paper Airplanes","Paper Airplanes (Makeshift Wings) (Demo Version)","Porphyria","Prelude 12/21 - Live From Long Beach Arena","Rabbits Are Roadkill On Route 37","Rabbits Are Roadkill on Rt. 37","Red Hat","Reiver's Music","Rolling Balls","Self Pity","Silver And Cold - Live From Long Beach Arena","Summer Shudder - Live From Long Beach Arena","Synesthesia","Synesthesia (Demo Version)","The Boy Who Destroyed the World","The Chicken Song","The Days of the Phoenix - Live From Long Beach Arena","The Great Disappointment (Demo Version)","The Hanging Garden","The Leaving Song (Demo Version)","The Leaving Song Part II","The Leaving Song Pt. II - Live From Long Beach Arena","The Missing Frame - Live From Long Beach Arena","The Missing Man","The Spoken Word","The View From Here","This Celluloid Dream (Demo Version)","This Time Imperfect - Live From Long Beach Arena","Too Late for Gods","Totalimmortal","Transference","Trash Bat","Values Here","We Bite","We’ve Got the Knife","Whatever I Do","Where We Used to Play","Who Knew?","Who Said You Could Touch Me?","Who said you could touch me? [vinyl only]","Yürf Rendenmein","Ziggy Stardust")# remove songs from songs_to_rm listsongs_df<-songs_df[!(songs_df$song_id %in% songs_to_rm),]# get all song ids (should now exclude songs on songs_to_rm)ids <-as.character(songs_df$song_id)``````{r}#| echo: false#| freeze: true#| include: false#| cache: true# Create empty dataframe to house the lyricsallLyrics<-data.frame()# add lyrics to dfwhile (length(ids) >0) {for (id in ids) {tryCatch({ allLyrics <-rbind(get_lyrics_id(id,access_token = gen_token), allLyrics) successful <-unique(allLyrics$song_id) ids <- ids[!ids %in% successful]print(paste("done - ", id))print(paste("New length is ", length(ids))) }, error =function(e){}) }}# create copy of all LyricsallLyrics_orig<-allLyrics# create dtaframe with lyrics for all songs and add empty column for album infoallIds <-data.frame(song_id =unique(allLyrics$song_id))allIds$album <-""``````{r}#| echo: false#| freeze: true#| include: false#| cache: true# get all album info and join it to the allLyrics dffor (song in allIds$song_id) { allIds[match(song,allIds$song_id),2] <-get_song_df(song,access_token = gen_token)[12]print(allIds[match(song,allIds$song_id),])}# add album info for Yurf RendenmeinallIds[(allIds$song_id==1817423),2]<-"Very Proud of Ya"``````{r}#| freeze: true#| include: false#| cache: true# remove repeated lines within allLyrics (e.g., repeated choruses)allLyrics_unique<-allLyrics %>%distinct(allLyrics,line,keep.all=TRUE)# keep only major release albums# allIds<-allIds[(allIds$album %in% album_major_release),] # join allIds dataframe with allLyrics_unique dataframeallLyrics_unique <-full_join(allIds, allLyrics_unique)# keep only major release albums (in case full join added them back in, but they shouldn't have)# allLyrics_unique<-allLyrics_unique[(allLyrics_unique$album %in% album_major_release),]# keep only major release albumsallLyrics2 <-full_join(allLyrics_unique, allIds)```## Songs Removed from AnalysisAfter downloading the lyrics, I filtered to only those songs in their major album release. This means I removed almost 100 songs from this analysis, as AFI's catalog contains about 240 songs. See below for list of removed songs.```{r}#| echo: false#| include: true#| code-fold: trueDT::datatable(as.matrix(songs_to_rm_names),caption="Table 1: Songs removed from analysis.",colnames ="Removed Song")``````{r}#| freeze: false#| include: falselibrary(tidyverse)library(tidytext)allLyricsTokenised <- allLyrics2 %>%unnest_tokens(word, line,token="words")# Words to removewords_to_rm<-c("ah","cajz","cap'n","calvin","de","hey","ho","l.a","ohh","the","uh","um","whoa","whoah","ya","yeah")# remove words allLyricsTokenised<-allLyricsTokenised[!(allLyricsTokenised$word %in% words_to_rm),]# allLyricsTokenised<-distinct(allLyricsTokenised,word,song,keep.all=TRUE)# dim(allLyricsTokenised)```## Word Frequency {#word-frequency-with-and-without-stop-words}To get to a more meaningful output, I pre-processed the lyrics. I first removed lines of words that repeated (e.g., choruses), and then I removed stop words (e.g., "I", "you", "the", "a"). So you can see for yourself, this section shows word frequency both with stop words (Table 2-left) and without stop words (Table 3-right).As an aside, an analysis with stop words would definitely be interesting. Davey uses pronouns and direct objects in unique ways, and that could be an entire analysis of its own: The stop words could be contextualized, for example, using [ngrams](https://en.wikipedia.org/wiki/N-gram).Without further ado, let's reveal AFI's most frequently used words! Looking at Table 3 (right), it seems like the top six words are:- `feel` (75); `love` (70); `eyes` (60); `time` (55); `life` (51); and `heart` (49).```{r}#| echo: false#| include: true#| output: true#| freeze: false#| layout-ncol: 2#| column: page#| tbl-cap: "Word frequency across AFI's major album releases."#| tbl-subcap: #| - "Stop words present"#| - "Stop words removed"# library(jsonlite)# Count each wordallLyricscount<-allLyricsTokenised %>%count(word, sort =TRUE)DT::datatable(allLyricscount,caption ="Table 2: Word frequency across all of AFI's major album releases (with stop-words).",colnames=c("Word","Frequency")# options = list(# initComplete = JS(# "function(settings, json) {",# "$(this.api().table().header()).css({'background-color': '#92141c', 'color': '#767975'});",# "}") )# library(rjson)library(jsonlite)# library(rjson)# Remove stopwordstidyLyrics <- allLyricsTokenised %>%anti_join(stop_words)tidyLyricscount<-tidyLyrics %>%count(word, sort =TRUE)tidyLyricscount1<- tidyLyrics %>%count(song_name,word, sort =TRUE)DT::datatable(tidyLyricscount,caption ="Table 3: Word frequency across all of AFI's major album releases (with stop-words removed).",colnames=c("Word","Frequency"))# options = list(# initComplete = JS(# "function(settings, json) {",# "$(this.api().table().header()).css({'background-color': '#92141c', 'color': '#767975'});",# "}")```## Wordcloud {#sec-wordcloud}Following the tables, I present an alternative visualization of word frequency (based on local and global weightings, see [Technical Details] for more information): a wordcloud!Take a look! The top two words appear to be `feel` and `love`. What else caught your `eye`?```{r}#| echo: false#| freeze: true#| cache: true#| include: falselibrary(LSAfun)library(lsa)# Import list of default stop words in Englishdata(stopwords_en)# stopwords_en# Create the TDM from all the WordstidyLyrics_select<-tidyLyrics %>%select(album,song_name,word) # song name and word# DT::datatable(tidyLyrics_select) # for troubleshooting# split the tidyLyrics into multiple dataframes based on songAFI_songs_df_split<-split(tidyLyrics_select,tidyLyrics_select$song_name)# change name of Prelude 12/21 list to remove slash '/' both the name of hte list and within the listnames(AFI_songs_df_split)[89]<-"Prelude 12 21"AFI_songs_df_split[['Prelude 12 21']][1]<-"Prelude 12 21"``````{r}#| echo: false#| freeze: true#| include: false#| cache: true# # Create function to remove song_name column within each list element# AFI_songs_df_split<-lapply(AFI_songs_df_split, function(x) x[!(names(x) %in% c("song_name"))])dir<-"C:/Users/Caleb/OneDrive/Caleb/Personal/Financials/Curriculum Vitae/Blog"# # Create function to export as txt fileslapply(1:length(AFI_songs_df_split),function(x){ songname<-as.character(names(AFI_songs_df_split[x])) outfile<-paste0(dir,"/AFIsongs/",songname,".txt")write.table(AFI_songs_df_split[[x]]["word"],file = outfile,row.names=FALSE,col.names=FALSE, # remove the word "word" from all documentsquote=FALSE)})``````{r}#| freeze: false#| echo: false#| include: false#| cache: true# Next I will follow a more practical example so I can learn how to interpret the output better, besides a wordcloud# Also, make sure that all from major albumreleases are included and others are excluded (didn't seem to be the case for some reason)# Load required code librarieslibrary(cluster)library(tm)library(LSAfun)library(lsa)dir ='C:/Users/caleb/OneDrive/Caleb/Personal/Financials/Curriculum Vitae/Blog'setwd(dir)# Create text matrix using tm's DirSource.# Create corpus in memory # There's some mistyped code that doesn't work in this paper# raw_corpus<-VCorpus(source_dir,readerControl-list(language='en'))#I'll copy and paste from previous example to create a raw corpus# Establish source directory for all text files. (one .txt file = 1 song)source_dir<-paste0(dir,"/AFIsongs/")data(stopwords_en)# Create TDM (text document matrix) with each txt file equal to one documentTDM<-textmatrix(source_dir,stopwords=c(stopwords_en),stemming=TRUE,removeNumber=F,minGlobFreq=1.75)# TDM # view TDMTDM_summary<-summary.textmatrix(TDM)TDM_summary_terms<-t(TDM_summary["vocabulary"])TDM_summary_docs<-t(TDM_summary["documents"])# Before running the SVM, create a weighted matrix TDM2.# TDM2 is term frequency x inverse document frequency (this is a standard method)# lw = local weighting, more importance to terms that appear more times within a single document# gw = global weighting, less importance to w9rds that appear in more document (same reasoning as removing stop words)TDM2 <-lw_tf(TDM)*gw_idf(TDM)# TDM2``````{r}#| include: false#| echo: falselibrary(cluster)library(tm)library(LSAfun)library(lsa)# Run LSA on wegihted matrix TDM2.# Number of dimensions chosen by default with dimcalc_share()# lsa transformed TDM2 into three matrices and placed them all into the miniLSAspace object# tk = term matrix, dk = document matrix, and sk = singular value matrixminiLSAspace<-lsa(TDM2,dims=dimcalc_share())# miniLSAspace<-lsa(TDM2,dims=5)# View matrix by transforming SVD matrix into text matrixas.textmatrix(miniLSAspace)# Doesn't work. Needs other non-working script.# findFreqTerms(TDM2)# these are all unrotated "loadings", like PCA or factor analysis# This command will show the value-weighted matrix of Terms# factoring of terms, # factors = 40, so there are 40 factors with loadings for each termtk2<-t(miniLSAspace$sk*t(miniLSAspace$tk))tk2_numfactors<-ncol(tk2)``````{r}#| echo: false#| include: true#| cache: false#| tbl-cap: "Figure 1: Wordcloud of the lyrics to each of AFI's major album releases."library(wordcloud)# plot wordcloud using lsa results!!!Term_count<-apply(TDM2,1,sum)TCT<-t(Term_count)myTerms2<-rownames(tk2)suppressWarnings(wordcloud(myTerms2,TCT,min.freq=3,random.order=FALSE,color=brewer.pal(8,"Dark2") ,scale=c(3.5,.1)))```## Top 10 Most Frequently Used Words (by Album)To round out the word frequency topic, I created a plot of AFI's top 10 most frequently used words faceted by major album release (following <https://drsimonj.svbtle.com/ordering-categories-within-ggplot2-facets>). I themed up the image to use [Sing the Sorrow](https://open.spotify.com/album/1eIzVBHA5NvX0wo2nLACew?si=B4jg7hxoQtWYO2K-ITYDBQ) colors and by selecting colors from each album to represent the data as bars.Following on the [Sing the Sorrow](https://open.spotify.com/album/1eIzVBHA5NvX0wo2nLACew?si=B4jg7hxoQtWYO2K-ITYDBQ) theme, let's look at [Sing the Sorrow](https://open.spotify.com/album/1eIzVBHA5NvX0wo2nLACew?si=B4jg7hxoQtWYO2K-ITYDBQ) (the blood-red album in the second row, second column). The top six words are:- `grey`, `tonight`, `dance`, `step`, `lay`, `inside`, and `heart``Grey` likely comes from [This Celluloid Dream](https://open.spotify.com/track/46Y9yh2KxmtodypW4bCp6v?si=172d233902574ce7) ("All the colors (all grey) upon leaving (all grey) all will turn to grey.")Feel free to look at the rest of the albums! It's pretty interesting! Let me know what else you discover in a comment on the [Discussion page](https://github.com/calebjpicker/Blog/discussions/2)!```{r}#| echo: false#| freeze: true#| output: false#| include: falselibrary(tidyverse)topFew <- tidyLyrics %>%group_by(album, word) %>%mutate(n =row_number()) %>%ungroup()# Take only top 10 max for each word by albumtopFew <- topFew %>%group_by(album, word) %>%summarise(n =max(n))%>%mutate(total=sum(n)) %>%top_n(10,n) %>%ungroup() %>%arrange(album,n) %>%# 3. add order column of row numbersmutate(order=row_number())topFew <- topFew %>%mutate(total =sum(n)) %>%group_by(word) %>%# top_n(10,abs(n)) %>%# filter(total>=40) %>%# 1. remove groupingungroup() %>%# 2. Arrange by# i. facet group# ii. bar heightarrange(album,total) %>%# 3. add order column of row numbersmutate(order=row_number())``````{r}#| echo: false#| cache: false# Create a facet for each album and list the most frequented words for that album# https://drsimonj.svbtle.com/ordering-categories-within-ggplot2-facetslibrary(plotly) # package to make plot interactivelibrary(jpeg)library(grid)# afi_sts_bg<-jpeg::readJPEG("afi-sing-sorrow.jpg")# Prepare only albums for major releasealbum_major_release<-c("Answer That and Stay Fashionable","Very Proud of Ya","Shut Your Mouth and Open Your Eyes","Black Sails in the Sunset","The Art of Drowning","Sing the Sorrow","DECEMBERUNDERGROUND","Crash Love","Burials","AFI (The Blood Album)","Bodies")# colours for each album albumCol <-c("#c9a95a", # ATASF "#a0480f", # VPYa "#dc4d33", #SYMAOYE "#e86722", # Black Sails in the Sunset "#9f62ad", # The Art of Drowning "#92141c", # Sing the Sorrow "#d6dee5", # DU"#f3c937", # Crash Love "#7c7c7c", # Burials "#3b0404", # AFI (The Blood Album) "#595957") # Bodiesnames(albumCol) <- album_major_release# This ensures bars are stacked in order of release datetopFew$album <-factor(topFew$album, levels = album_major_release)wordsPlot <-ggplot(topFew) +geom_bar(aes(x = order, y = n,fill =as.factor(album)),colour ="#141517",stat ="identity",show.legend =FALSE) +# removes legendcoord_flip() +facet_wrap(~ album ,scales="free") +labs(title ="AFI's Top 10 most frequently used words by album",caption ="Source: genius.com | by @CalebPicker",y ="Frequency",x ="Word",fill ="Album") +# annotation_custom(rasterGrob(afi_sts_bg,# width = unit(1,"npc"),# height = unit(1,"npc")),# -Inf,Inf,-Inf,Inf) +scale_fill_manual(values = albumCol) +# add categories to axisscale_x_continuous(breaks = topFew$order,labels = topFew$word,expand =c(0,0) ) +theme(title =element_text(face ="bold", size =20), strip.background=element_rect(fill='#141517'),strip.text.x =element_text(color="#d51020"), # This impacted teh titles for each facet gridstrip.text.y =element_text(color="#d51020"),strip.text =element_text(size=12,color="#d51020"),panel.border =element_rect(colour ="#141517", fill=NA, size=1),panel.background =element_rect(colour ="#767975", fill ="white"),panel.grid.major.x =element_line(colour="#d51020",size =1, linetype =4),axis.title =element_text(face ="bold",size =20, colour ="#d51020"),axis.ticks.length =unit(5, units ="pt"),legend.position="none",# legend.background = element_rect(color='#141517', fill='#141517'),# legend.title = element_text(color='#d51020'),# legend.position = "top",# legend.key.size = unit(12,"pt"),# legend.box.spacing = unit(3,"pt"),# legend.text = element_text(size = 12,color='#d51020'),axis.text.y =element_text(size =12,face="bold"),axis.text.x =element_text(size =12,face="bold"),plot.background=element_rect(fill='#141517',color='#141517'),plot.title=element_text(color='#d51020'),plot.subtitle=element_text(color='#d51020'), )``````{r}#| echo: false#| include: true#| cache: false#| fig-height: 12#| fig-width: 14#| fig-cap: "Figure 2: Word frequency by each of AFI's albums"# plotly::ggplotly(wordsPlot,show.legend=FALSE)wordsPlot# ggsave(filename = "/Albums/AFIwordsbyalbum.png", plot = wordsPlot, width = 30, height = 24, units = "cm",# type = "cairo")```# Latent Semantic AnalysisIn this section, I followed [Gefen, Endicott, Fresneda, Miller, and Larsen's (2017)](https://aisel.aisnet.org/cais/vol41/iss1/21/ "Latent Semantic Analysis"){#lsa} process for a latent semantic analysis. The goal was to reveal how the words are related and to discover hidden themes.## Technical DetailsIf you're not into technical details, feel free to skip this section and go straight to [Cosine Similarity](#cosine-similarity-getting-more-interesting). Gefen et al.'s (2017) process allowed me to locally and globally weight the raw frequencies. The local weighting algorithm weights words more heavily that appear more often *within* a song (presumably because they're more important); the global weighting algorithm weights words less heavily if words appear more often *across* all songs (presumably because they're less important, like the word "the"). These weightings were then multiplied together.To start the process, I split up all the songs from the major album releases into separate text files. After importing them, I created a Term-Document-Matrix and first, counted the raw frequencies and second, locally and globally weighted those frequencies. The rest of the analyses for this post rely on these weighted frequencies.While importing the files, I set the minimum global frequency to 2. There are `r TDM_summary_terms` vocabulary words and `r TDM_summary_docs` songs. After that, I created a cosine similarity matrix to examine how related certain words are together. Finally, I examined the hidden themes by creating a matrix with word loadings (similar to factor analysis).## Cosine Similarity {#cosine-similarity-getting-more-interesting}Similar to concepts like [correlation](https://en.wikipedia.org/wiki/Correlation), [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) also measures similarity by measuring the distance between two vectors in semantic space. Cosine similarity ranges between -1 and +1:- closer to -1 means the words are more opposite,- closer +1 means the words are more similar, and- closer to 0 means the words are unrelated.### Related WordsIn this interactive table, I'm interested in sorting the values based on a column. In this case, I'll use the word `eye`. I click the column header twice, and I find the top three words (cosine similarity) are:- `stare` (+.53); `recal` (+.49); and `meet` (+.47)It looks like the use of the word `eye` is closely related to `star`ing, as well as to `recal`ling (a memory). Interesting!```{r}#| echo: false#| freeze: false#| include: true#| cache: falsemyCosineSpace2<-multicos(myTerms2,tvectors=tk2) # removed breakdown=TRUE arg, and it workedsuppressWarnings(DT::datatable(round(myCosineSpace2,2),options=list(scrollX=TRUE),caption="Table 4: Cosine similarity for relationships between individual words."))# library(LSAfun)# coherence(myTerms2,tvectors=tk2)```### Is `eye` more closely related to sight or to memory?The follow up question is whether `eye` is more closely related to sight or to memory words.```{r}#| include: trueprint(paste0("Eye and sight have cosine similarity rating of ", round(costring("eye","sight" ,tvectors=tk2[,1:ncol(tk2)]),2)))print(paste0("Eye and recal have cosine similarity rating of ", round(costring("eye",'recal' ,tvectors=tk2[,1:ncol(tk2)]),2)))print(paste0("Eye and recogn have cosine similarity rating of ", round(costring("eye",'recogn' ,tvectors=tk2[,1:ncol(tk2)]),2)))print(paste0("Eye and rememb have cosine similarity rating of ", round(costring("eye",'rememb' ,tvectors=tk2[,1:ncol(tk2)]),2)))```As it turns out, my hypothesis that `eye`, when projected onto the semantic space, is more closely related to memory words like `recal`, `recogn`, `rememb` is somewhat supported!What else would you like to see? Tell me in the comments on the [Discussion page](https://github.com/calebjpicker/Blog/discussions/2)!## Hidden ThemesOne of the key features of latent semantic analysis is that it can reveal hidden themes that one otherwise may not have noticed. This is demonstrated in the word loadings matrix. The full set is here for you to play with (Table 5), but I've also extracted the top five words for all dimensions in the Table 6.Here, I will present all factors from the analysis, but I will only analyze three of the hidden themes. The approach I took was to find hidden themes that were easier for me to interpret to better demonstrate the power of this analysis. The hidden themes are labeled V1 to V50, so feel free to explore to your heart's content. In my estimation, the hidden themes aren't so straightforward, so some help interpreting them would be welcomed! Feel free to post your findings in the [Discussion forum](https://github.com/calebjpicker/Blog/discussions/2)!```{r}#| include: true#| echo: false#| cache: false# Display all factor loadings for all wordstk2_round_df<-round(as.data.frame(tk2),2)DT::datatable(tk2_round_df,options=list(scrollX=TRUE),caption="Table 5: All words and their loading onto factors (hidden themes).")```Table 6 extracts the top five words (and their loadings) for each of the themes to make interpretation easier.```{r}#| include: true#| echo: false#| cache: false# Top 5 of each factortop_words<-sapply(tk2_round_df,function(x) head(row.names(tk2_round_df)[order(x,decreasing=TRUE)],5))top_value<-sapply(tk2_round_df,function(x) head(x[order(x,decreasing=TRUE)],5)) top_themes<-data.frame() top_themes<-cbind(top_words,top_value) DT::datatable(top_themes,options=list(scrollX=TRUE,scrollY=TRUE),caption="Table 6: Top 5 words loading onto their factors (hidden themes).")```### Hidden Theme 1: Love is Red Blood {data-link="Hidden Theme 1: Love is Red Blood"}For hidden theme 1, I analyzed (V3):- Top 5: `drop` (+18.59); `red` (+15.88); `love` (+15.27); `tast` (+11.61); `life` (+7.45)The first hidden theme seems to relate to `red drops` that signify `love` and `tast`ing `life`. Two songs come to mind: [She Speaks the Language](https://open.spotify.com/track/2b8Ajd7NgbjdLHzY2f8gex?si=def37951aa884707) with the line "Red, red drops upon my cuff let me know this must be love".So the use of these top 5 words make the connection between love and blood, so every drop of blood spilled is love spilling out of his body.### Hidden Theme 2: Forever Crawling We Rewind {#hidden-theme-2-forever-crawling-we-rewind data-link="Hidden Theme 2: Forever Crawling We Rewind"}For hidden theme 2 (V26), I find the top 5 words:- Top 5: `crawl` (+11.78); `forever` (+11.40); `float` (+11.29); `dream` (+10.71); and `meet` (+9.06)The second hidden theme seems to relate to forever crawling and floating in a dream. This reminds me of the song [Rewind](https://open.spotify.com/track/06fPxg3oOPbOUaZVVhghfd?si=5f462d1f9fac46da): "feel it crawl up on your chest...stay awake as its creeping close", "Let the ghost that meets your eyes still haunt you, remind you", "Your selfish words must float denied like your crime, I float now", and "Forever crawling, ever crawling".### Hidden Theme 3: Bathed in Your Radiance I Melt {data-link="Hidden Theme 3: Bathed in Your Radiance I Melt"}For hidden theme 3 (V6), I find:- Top 5: `twist` (+12.34); `melt` (+12.14); `gonna` (+11.17); `light` (+10.76); and `star` (+10.06)The third hidden theme seems to relate to `twist` and `melt`ing by the `light` of a `star`. [This Celluloid Dream](https://open.spotify.com/track/46Y9yh2KxmtodypW4bCp6v?si=ce45b9dc38fc4505) comes to mind with the lines: "Just like a memory, it twists me"; "bathed in your radiance I melt"; and "in the shadow of a star in static pallor". So this could mean, at least within the context of this song, Davey compares his lack of radiance (static pallor) to his twisted memory of the radiance of his star who radiates light that `melt`s him.```{r}#| include: false#| echo: false# This will show the matrix of documents# factoring of documents, # factors = 40, so there are 40 factors with loadings for each documentsdk2 = miniLSAspace$dk# round(dk2,2)#DT::datatable(round(dk2,2),# fillContainer = TRUE)``````{r}#| include: false#| echo: false# The sk matrix of singular values connects tk and dk matrices to reproduce the the original TDM2# Because the sk matrix only has diagonal values, R stores it as a numeric vectorsk2 = miniLSAspace$sk# round(sk2,2)```# ConclusionAFI is my favorite band of all time, and [Sing the Sorrow](https://open.spotify.com/album/1eIzVBHA5NvX0wo2nLACew?si=ntAoYywYSq-SrfJCGL9t3g) is my favorite album of all time. Combined with my love for statistics, I decided, for my first blog post tracking my data science journey, to analyze the content of AFI's lyrics.In addition to my goal of learning new data science skills for my [CV](https://calebjpicker.quarto.pub/cvr/), I wanted to gain a deeper understanding of the hidden themes and word usage patterns in Davey's lyrics. Although not every interpretation was straightforward, I did find that the most frequently used words were `feel` and `love`. I also found that `eye`s, when compared to `sight`, were more closely related to memory words, such as `recal`, `rememb`, and `recogn`.I also found a hidden metaphors: such as love is blood ([Hidden Theme 1: Love is Red Blood]). I found a memory forever crawling and begging to be rewound ([Hidden Theme 2: Forever Crawling We Rewind](#hidden-theme-2-forever-crawling-we-rewind)). Finally, I found how his memories of his star twist and melt him ([Hidden Theme 3: Bathed in Your Radiance I Melt]).Overall, this was a really fun exercise! It's my first time doing something like this, and I certainly learned a lot! I know I still have a lot more to learn, especially regarding best software development practices.I can hardly wait to attend the 20th Anniversary Sing the Sorrow event! For real. I've been in The Despair Faction fan club for 20 years. This is a real treat. You could say I've been waiting 20 years for this!Leave comments or discuss the blog on my [GitHub Blog](https://github.com/calebjpicker/Blog/discussions/2) page!