Tidy Spice
This post is my reproduction of Julia Silge’s blogpost Topic Modeling for #TidyTuesday Spice Girls Lyrics, with some added inspiration from a blogpost by Ariane Aumaitre called Tutorial: Text analysis and data visualization with Taylor Swift songs.

Setup
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.0.5
library(tidytext)
library(stm)
library(scales) #will be needed for percentage scales in ggplot
knitr::opts_chunk$set(error=FALSE,
message= FALSE,
warning=FALSE)
Load Data
lyrics <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-14/lyrics.csv")
Explore
First five rows of data
Album and Song Names
# Show 5 Rows
head(lyrics, 5)
## # A tibble: 5 × 9
## artist_name album_name track_number song_id song_name line_number section_name
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>
## 1 Spice Girls Spice 1 89740 Wannabe 1 Intro
## 2 Spice Girls Spice 1 89740 Wannabe 2 Intro
## 3 Spice Girls Spice 1 89740 Wannabe 3 Intro
## 4 Spice Girls Spice 1 89740 Wannabe 4 Intro
## 5 Spice Girls Spice 1 89740 Wannabe 5 Intro
## # … with 2 more variables: line <chr>, section_artist <chr>
lyrics %>% distinct(album_name, song_name)
## # A tibble: 31 × 2
## album_name song_name
## <chr> <chr>
## 1 Spice "Wannabe"
## 2 Spice "Say You\x92ll Be There"
## 3 Spice "2 Become 1"
## 4 Spice "Love Thing"
## 5 Spice "Last Time Lover"
## 6 Spice "Mama"
## 7 Spice "Who Do You Think You Are"
## 8 Spice "Something Kinda Funny"
## 9 Spice "Naked"
## 10 Spice "If U Can\x92t Dance"
## # … with 21 more rows
Tidy
Change
\x92to ’Split words in line column into tokens
Anti-Join each word into it’s own row
tidy_lyrics <-
lyrics %>%
mutate(song_name = str_replace_all(song_name, "\x92", "'")) %>%
unnest_tokens(word, line) %>%
anti_join(get_stopwords())
tidy_lyrics
## # A tibble: 6,663 × 9
## artist_name album_name track_number song_id song_name line_number
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 Spice Girls Spice 1 89740 Wannabe 1
## 2 Spice Girls Spice 1 89740 Wannabe 2
## 3 Spice Girls Spice 1 89740 Wannabe 2
## 4 Spice Girls Spice 1 89740 Wannabe 2
## 5 Spice Girls Spice 1 89740 Wannabe 2
## 6 Spice Girls Spice 1 89740 Wannabe 2
## 7 Spice Girls Spice 1 89740 Wannabe 2
## 8 Spice Girls Spice 1 89740 Wannabe 3
## 9 Spice Girls Spice 1 89740 Wannabe 3
## 10 Spice Girls Spice 1 89740 Wannabe 3
## # … with 6,653 more rows, and 3 more variables: section_name <chr>,
## # section_artist <chr>, word <chr>
Most Common Words
Long Way
tidy_lyrics %>%
group_by(word) %>%
summarise(n = n()) %>%
arrange(-n)
## # A tibble: 979 × 2
## word n
## <chr> <int>
## 1 get 153
## 2 love 137
## 3 know 124
## 4 time 106
## 5 wanna 102
## 6 never 101
## 7 oh 88
## 8 yeah 88
## 9 la 85
## 10 got 82
## # … with 969 more rows
Short Way
tidy_lyrics %>%
count(word, sort = TRUE)
## # A tibble: 979 × 2
## word n
## <chr> <int>
## 1 get 153
## 2 love 137
## 3 know 124
## 4 time 106
## 5 wanna 102
## 6 never 101
## 7 oh 88
## 8 yeah 88
## 9 la 85
## 10 got 82
## # … with 969 more rows
tidy_lyrics %>%
count(word, sort = TRUE) %>%
filter(n > 80,
word != "la",
word != "oh") %>%
ggplot(aes(x = n, y = reorder(word, n), fill = word)) +
geom_col() +
labs(y = "",
x = "Number of Times Mentioned",
title = "Most Frequent Words in Spice Girls Lyrics")

Most Common Word Per Song
# Most Common Word per Song
tidy_lyrics %>%
count(song_name, word, sort = TRUE)
## # A tibble: 2,206 × 3
## song_name word n
## <chr> <chr> <int>
## 1 Saturday Night Divas get 91
## 2 Spice Up Your Life la 64
## 3 If U Can't Dance dance 60
## 4 Holler holler 48
## 5 Never Give Up on the Good Times never 47
## 6 Move Over generation 41
## 7 Saturday Night Divas deeper 41
## 8 Move Over yeah 39
## 9 Something Kinda Funny got 39
## 10 Never Give Up on the Good Times give 38
## # … with 2,196 more rows
tidy_lyrics %>%
count(song_name, word, sort = TRUE) %>%
filter(n >40,
word != "la") %>%
ggplot(aes(x = n, y = reorder(word, n), fill = word)) +
geom_col() +
labs(y = "",
x = "Number of Times Mentioned",
title = "Most Frequent Words in Spice Girls Lyrics in a Single Song")

Train a topic model
(Analyze text data to determine cluster words)
lyrics_sparse <-
tidy_lyrics %>%
count(song_name, word) %>%
cast_sparse(song_name, word, n)
dim(lyrics_sparse)
## [1] 31 979
This means there are 31 songs (i.e. documents) and 979 different tokens (i.e. terms or words) in our dataset for modeling.
“The most important parameter when training a topic modeling is K, the number of topics. This is like k in k-means in that it is a hyperparamter of the model and we must choose this value ahead of time. We could try multiple different values to find the best value for K, but this is a very small dataset so let’s just stick with K = 4.”
set.seed(123) # random number generator
topic_model <- stm(lyrics_sparse, K = 4, verbose = FALSE)
summary(topic_model)
## A topic model with 4 topics, 31 documents and a 979 word dictionary.
## Topic 1 Top Words:
## Highest Prob: get, wanna, deeper, right, night, come, gotta
## FREX: deeper, saturday, get, comin, back, night, ya
## Lift: jump, party's, body, another, anyway, blameless, breaking
## Score: deeper, saturday, get, night, comin, arms, wanna
## Topic 2 Top Words:
## Highest Prob: dance, yeah, know, generation, next, love, naked
## FREX: next, naked, denying, foolin, nobody, wants, lead
## Lift: foolin, nobody, question, next, admit, bein, check
## Score: next, dance, naked, generation, denying, colour, foolin
## Topic 3 Top Words:
## Highest Prob: got, holler, make, love, wanna, oh, time
## FREX: holler, kinda, swing, funny, yay, use, driving
## Lift: anyone, driving, fantasy, oller, blow, nudge, unwind
## Score: holler, swing, kinda, funny, yay, ashamed, loving
## Topic 4 Top Words:
## Highest Prob: la, never, love, give, time, know, way
## FREX: times, swear, la, bring, promise, viva, tried
## Lift: aggravation, angel, dreamt, heaven, letting, revelation, sent
## Score: la, times, aha, swear, chicas, front, havin
Explore topic model results
word_topics <- tidy(topic_model, matrix = "beta")
word_topics
## # A tibble: 3,916 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 achieve 1.94e- 3
## 2 2 achieve 8.51e-29
## 3 3 achieve 1.00e-25
## 4 4 achieve 9.51e-19
## 5 1 baby 1.38e- 2
## 6 2 baby 1.44e- 2
## 7 3 baby 1.28e- 3
## 8 4 baby 4.16e- 3
## 9 1 back 2.31e- 2
## 10 2 back 5.44e- 4
## # … with 3,906 more rows
Visualization
word_topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
mutate(topic = paste("Topic", topic)) %>%
ggplot(aes(beta, reorder_within(term, beta, topic), fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(topic), scales = "free_y") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_reordered() +
labs(x = expression(beta), y = NULL)

Gamma Matrix
song_topics <- tidy(topic_model,
matrix = "gamma",
document_names = rownames(lyrics_sparse)
)
song_topics
## # A tibble: 124 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 2 Become 1 1 0.932
## 2 Denying 1 0.00154
## 3 Do It 1 0.996
## 4 Get Down With Me 1 0.300
## 5 Goodbye 1 0.000971
## 6 Holler 1 0.00155
## 7 If U Can't Dance 1 0.000896
## 8 If You Wanna Have Some Fun 1 0.0171
## 9 Last Time Lover 1 0.140
## 10 Let Love Lead the Way 1 0.00178
## # … with 114 more rows
song_topics %>%
mutate(
song_name = fct_reorder(document, gamma),
topic = factor(topic)
) %>%
ggplot(aes(gamma, topic, fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(song_name), ncol = 4) +
scale_x_continuous(expand = c(0, 0)) +
labs(x = expression(gamma), y = "Topic")

Estimate Regression
effects <-
estimateEffect(
1:4 ~ album_name,
topic_model,
tidy_lyrics %>% distinct(song_name, album_name) %>% arrange(song_name)
)
summary(effects)
##
## Call:
## estimateEffect(formula = 1:4 ~ album_name, stmobj = topic_model,
## metadata = tidy_lyrics %>% distinct(song_name, album_name) %>%
## arrange(song_name))
##
##
## Topic 1:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.14061 0.12301 1.143 0.263
## album_nameSpice 0.09258 0.17701 0.523 0.605
## album_nameSpiceworld 0.15105 0.17327 0.872 0.391
##
##
## Topic 2:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1471 0.1327 1.109 0.277
## album_nameSpice 0.1327 0.1887 0.703 0.488
## album_nameSpiceworld 0.1472 0.1851 0.795 0.433
##
##
## Topic 3:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.29480 0.12114 2.434 0.0216 *
## album_nameSpice 0.07952 0.17174 0.463 0.6470
## album_nameSpiceworld -0.28112 0.16919 -1.662 0.1078
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 4:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.41760 0.13938 2.996 0.00567 **
## album_nameSpice -0.30293 0.19793 -1.531 0.13711
## album_nameSpiceworld -0.01948 0.19352 -0.101 0.92053
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1