K-means Clustering, Hierarchical Clustering, and PCA for Top 50 Spotify Songs - 2019

Mehmet Emre CETIN
Sep 16, 2023
12 min read

Updated: Nov 14, 2023

PCA (Principal Component Analysis), K-means, and Hierarchical Clustering in data science - made by DALL-E

Introduction

Since both K-means Clustering and Hierarchical Clustering have the same continuous variable they needed a description altogether. This description will be made for the continuous variables. After that description, they are going to be analyzed individually. Their comparison will be made in the third phase. With that being said, Beats per minute and Danceability were chosen as the continuous variables. The reason is that is a handful of research that is about beats per minute and danceability (Herremans, Martens & Sörensen, 2014; Goel et. al, 2014; Jamdar et. al, 2015; Elías Alonso, 2016; Howlin & Rooney, 2020, 2021; Kim, Aiello & Quercia, 2020; Krishna et. al, n.d.). Also, those researchers are in different fields of the academy. Some of these articles are directly related to K-means (Kim, Aiello & Quercia, 2020) and Hierarchical Clustering (Kusama & Itoh, 2011, 2014; King & Imbrasaitė, 2015). Finally, Principal Component Analysis will be applied. Since academic references were given about other methods, PCA cannot be left alone. In one study, music genres were examined on the basis of PCA. They suggested that the artists within the genres should specifically raise the index value to make their works more popular (Long, Hu & Jin, 2021). In other research, Luo (2018) divided audio features into sets that are linearly uncorrelated to others. She collected her data from Kaggle.com, Spotify Web API, and Discogs AIPs. To divide audio features, she implements PCA. In another study, PCA, K-means, and Hierarchical Clustering were being used. One of the findings of PCA in that article is; projections onto 2D images showed that the use of PCA with 20 components (maintaining around 80% variance) considerably improved the clustering models (Langensiepen, Cripps & Cant, 2018). All in all, this study aims to shed some light on K-means, Hierarchical Clustering, Principal Component Analysis.

K-means Clustering

Output of K-means

Comments

The measure of the total variance is explained by the clustering is 68.9%. The mean values are close to each other for Danceability. It can be stated that the mean values of BPM varied highly. Maybe, because, BPM rates can reach 200 and Danceability score only reaches 100. Another reason for the differentiation of BPM and Danceability is; the list is about the top 50 most listened to songs in the world by Spotify. So, the song categories or genres differentiated by listeners' choices. One example about the choices, it would not fair to compare Electronic music (Hörschläger et. al, 2015; Alspach, 2020) to Romantic music (Pérez Lloret et. al, 2014) types in terms of their BPM. One suggestion here is that normalization could be applied (Dodge & Commenges, 2003). Because the difference is not just between BPM and Danceability. There are some variables (Loudness DB) in this dataset that their scores are below zero, or there are some variables (Length) that their scores are more than 300.

GGPlot

Comments

The first observation is 125 beats per minute is the divider point. The previous interpretations can be seen here visually. There are two interesting variables here. One of them has 190 BPM. But still, its Danceability score is 40. The other one has 85 BPM. But its Danceability is 29. The first one might be interpreted by its BPM score. Because after some point people might not want to dance. The second one is that even if that song has 85 BPM, 29 Danceability is a bit strange. People dance in romantic songs too. A related argument is that these songs are listed top 50 in Spotify. So, how come those two songs have the lowest Danceability scores? Danceability definition might help the readers here. The author was extracted his data from OrganizeYourMusic. The website describes Danceability with the given words; the higher the value, the easier it is to dance to this song. So, this explanation did not help too. The dataset can be examined. The song that has 190 BPM belongs to Ariana Grande. The song's category is dance-pop. The other song belongs to The Chainsmokers. The song's category is "EDM", which means that Electronic dance music. This song has 85 BPM, 29 Danceability, and it was categorized as Electronic dance music. Whoever categorized those songs deserves criticism. In fairness to Danceability score, one shouldn't expect others to dance to every song they listen to.

Since normalization was applied prior to performing Hierarchical Clustering, the same process will be applied K-means Clustering too. This process is important to be fair to Hierarchical Clustering.

Normalization of K-means

The me

asure of the total variance is explained by the clustering is 39.7%.

GGPlot for Normalization

Normalization of K-means Clustering ggplot

It seems that the graph obtained from K-means Clustering is almost identical previous one. It looks like the scaling did not change the output.

Hierarchical Clustering

Before applying Hierarchical Clustering to this dataset, variables are normalized.

Plots

Comments

Single linkage is confusing in terms of how the variables are clustered. The same statement can be made for Average linkage too. Complete Linkage, on the other hand, seems more fit than the other two. To clarify this comment, the songs' numbers and their BPM/Danceability scores need to be examined. It looks like the left side of the dendrogram is for Danceability, and the right side of the dendrogram is for BPM. Number 36 is Martin Garrix, and its Danceability is 66. Number 21 is Martin Garrix, and its Danceability is 66. Number 26 is Shawn Mendes, and its Danceability is 69. Number 10 is Billie Eilish, and its Danceability is 70. Number 25 is Billie Eilish, and its Danceability is 67. The examples of another side of the aisle as follows; Number 3 is Ariana Grande, and its BPM is 190. Number 14 is Sech, and its BPM is 176. Number 7 is Lil Tecca, and its BPM is 180. Number 17 is J Balvin, and its BPM is 176. Number 37 is Sech, and its BPM is 176.

So, complete linkage divides the songs in terms of some of the highest Danceability and BPM. On the left side, there is an interesting table which is about the same artists' songs. Martin Garrix and Billie Eilish had two songs side by side. How come Hierarchical Clustering recognized those two people's work without knowing their other information? Could that be a coincidence? On the right side, the accuracy of Hierarchical Clustering is at its peak. At least, it can be stated that Hierarchical Clustering is ordered the songs almost their BPM.

The examination has started from the left side. 36,21,26,10 and 25 are K-means Clustering. 3,14,7,17 and 37 are Hierarchical Clustering.

Cutting Tree for 2 Clusters

Complete Linkage

Cluster 1 is 28 and Cluster 2 is 22.

Average Linkage

Cluster 1 is 49 and Cluster 2 is 1.

Single Linkage

Cluster 1 is 49 and Cluster 2 is 1.

The interpretation given above can be seen when cutting trees. Complete Linkage separated the observations into their highest scores. Also, to obtain a more sensible answer, 4 clusters are going to be examined.

Cutting Tree for 4 Clusters

Complete Linkage

Cluster 1 is 28, Cluster 2 is 11, Cluster 3 is 10, and Cluster 4 is 1.

Average Linkage

Cluster 1 is 30, Cluster 2 is 1, Cluster 3 is 16, and Cluster 4 is 3.

Single Linkage

Cluster 1 is 47, Cluster 2 is 1, Cluster 3 is 1, and Cluster 4 is 1.

To grasp these output given above (cutting a tree), complete, average, and single linkages are going to be drawn for 4 clusters.

Complete, Average, and Single Linkage 4 Clusters Plot

After a few tries, it is clear that Complete Linkage separated the observations into their highest scores. On the other hand, finding a common ground for categorization of Average Linkage and Single Linkage is more than difficult. Also, Complete Linkage is fairer than the other two when it comes to separating the observations.

Comparison of K-means Clustering and Hierarchical Clustering.¶

In terms of reading the variables from the graphics, Hierarchical Clustering has advantages over K-means Clustering. The most important advantage is the number of songs. This insight creates many paths for the analyzer. For example, finding Martin Garrix (21 and 36) and Billie Eilish (10 and 25) in Hierarchical Clustering can clarify many details. Another example is Hierarchical Clustering starts from higher numbers of BPM and Danceability. Related critique can be open here. Some people may argue that starting from the highest score might not be the expected approach. Speaking of the highest scores, K-means only divided this dataset into two parts. But this dividing makes interpretation difficult. For that reason, in the interpretation part, HC is more accurate. Meaning, one can interpret accurately when one uses HC. In fairness to both K-means Clustering and Hierarchical Clustering, in one study, Long, Hu, and Jin (2021) conclude their article with these words; "the higher the energy, valance, tempo, count, and liveness indexes are, the more the song fits the characteristics of the Pop/Rock genre and the more popular it is by loyal fans in the field." The meaning of those words is there are only two variables in this test. K-means Clustering and Hierarchical Clustering might have needed more indexes to cluster these songs.

PCA

Mean Values for the Indexes

Beats.Per.Minute:120.06 Energy: 64.06 Danceability: 71.38 Loudness..dB..: -5.66 Liveness: 14.66 Valence.: 54.6 Length.: 200.96 Acousticness..: 22.16 Speechiness.: 12.48 Popularity: 87.5

The mean values of the variables wide-ranging. Meaning, Loudness is below zero. Length is 201.

Variance Values for the Indexes

Beats.Per.Minute: 954.710612244898 Energy: 202.547346938776 Danceability:142.322040816327 Loudness..dB..: 4.22897959183673 Liveness: 123.616734693878 Valence.: 498.897959183673 Length.: 1532.24326530612: Acousticness..: 360.831020408163 Speechiness.: 124.581224489796 Popularity: 20.1734693877551

The same interpretation can be made here. The variance of Length and Beats Per Minute are so high. They can dominate the other variables.

As stated above the variables are going to be standardized, and then PCA will be performed.

Output of PCA

Comments on the Components

In PC1, 7 seven loadings on the negative side. It seems that they can be labeled for PC1. But one looks closer, BPM, Acousticness, and Popularity are close to each other. Also, some of the highest scores in each loading found themselves a room in the top 50 list. For instance, number 13 is made by Lewis Capaldi, the song's BPM is 110, Acousticness is 75, and Popularity is 88. Number 27 is made by Tones and I. The song's BPM is 98, Acousticness is 69, and Popularity is 83. Number 11 is made by Bad Bunny. The song's BPM is 176, Acousticness is 60, and Popularity is 93. To clarify these examples, the highest scores for BPM is 190, Acousticness is 75, and Popularity is 95.

In PC2, again, 7 loadings on the negative side. BPM and Speechiness can be labeled here. They have high negative loadings. Also, some of the highest scores in BPM are also some of the highest scores in Speechiness too. Meaning, the highest scores on both BPM and Speechiness go parallel. For example, number 3 is made by Ariana Grande. The song's BPM is 190, and Speechiness is 46. Number 11 is made by J Balvin. The song's BPM is 176, and Speechiness is 34. Number 7 is made by Lil Tecca. The song's BPM is 180, and Speechiness is 29.

In PC3, loadings are divided equally. Meaning, there are 5 positive loadings, and there are 5 negative loadings. Danceability and Valence come as the highest positive loadings. Valence needs an introduction here. The first visit was made to Organize Your Music. Because the author was extracted his data from this website. They explained Valence as the higher the value, the more positive mood for the song. In one study, Valence explained as valence covers the space between unpleasant (e.g. sad, stressed) to delightful (e.g. cheerful, elated) (Shahnaz & Hasan, 2016). Also, one of the methods in this research is PCA. So, that makes Valence definition stronger. As in the first two components, selected loadings had related highest scores. Number 35 is sung by ROSALÍA. The song's Danceability is 88, and its Valence is 75. Number 9 is sung by Lil Nas X. The song's Danceability is 88, its Valence is 64. Number 39 is sung by Jonas Brothers. The song's Danceability is 84, its Valence is 95. Another stronger argument that Danceability and Valence selected/labeled in PC3 is; higher Valences are described by Organize Your Music and Shahnaz and Hasan (2016) with positive words. From that explanation, it can be stated that people dance to songs that positively affect them.

In PC4, again, 7 loadings on the negative side. Selecting loadings in this component a bit tricky. Beats Per Minute and Liveness can be selected here. Because their loading scores are close to each other. But when analyzing the loadings one can see that this component, on the other hand, produces an index that shows Acousticness more.

Four components seemed enough. Why stop at the 4th component will be explained at the end of the paper.

BiPlot

According to the graphic, Beats for Minute, Speechiness and Popularity go in the same direction for PC1; Acousticness goes in another direction for PC1. In the previous topic, it was stated that BPM, Acousticness, and Popularity can be selected for labeling the first component. However, the result of the biplot is different. Speechiness took Acousticness' place.

In the previous topic, it was stated that BPM and Speechiness can be selected for labeling the second component. The biplot validates that suggestion. One interpretation can be added according to the biplot. Loudness and Energy can be labeling factors for PC2. On the other hand, BPM and Speechiness have higher negative loadings than Loudness and Energy.

Proportion of Variance Explained

0.231143413332697 0.164303321557439 0.137106109520043 0.105283062503717 0.0977312018040011 0.0836911427025456 0.0716513481690877 0.0547537154600018 0.0319672217213592 0.0223694632291085

The first principal component explains 23.11% of the variance. The second principal component explains 16.43% of the variance. The third one explains 13.71%, and the fourth one explains 10.52%. Four components collectively accounted for 63.78% of the total variance (Carlson et al. 2017).

Plots and Comments on the Results

The first principal component explains 23.11% of the variance.

Cumulative Proportion of Variance Explained - PCA

As stated above four components seemed enough, and they will be explained at the end of the paper. Holland (2008) asked the question of how many PCs should be ignored. The criteria he stated as follows;

" 1) One common criteria is to ignore principal components at the point at which the next PC offers little increase in the total variance explained.

2) A second criteria is to include all those PCs up to a predetermined total percent variance explained, such as 90%.

3) A third standard is to ignore components whose variance explained is less than 1 when a correlation matrix is used or less than the average variance explained when a covariance matrix is used, with the idea being that such a PC offers less than one variable’s worth of information.

4) A fourth standard is to ignore the last PCs whose variance explained is all roughly equal."

In the criteria he stated, the most feasible one for this study is 4. Because PC5 is 0.09773120, PC6 is 0.08369114, PC7 is 0.07165135, PC8 is 0.05475372, PC9 is 0.03196722, and finally, PC10 is 0.02236946. PC4 and PC5 (and the following) are not selected. Because the first four components are not roughly equal to each other. PC5, PC6, PC7, and so forth, are roughly equal. They are so close to being equal. To sum it up, four components will be selected. One suggestion might be added here for combining the loadings. BPM, Speechiness, Popularity, Danceability, Valence, and Acousticness indexes might be the factors for entering the top music lists (Long, Hu & Jin, 2021).

Credit for the dataset: Leonardo Henrique

Bibliography

Alspach, G. (2020). Electronic Music Subgenres for Music Providers.

Carlson, E., Saari, P., Burger, B., & Toiviainen, P. (2017). Personality and musical preference using social-tagging in excerpt-selection. Psychomusicology: Music, Mind, and Brain, 27(3), 203.

Dodge, Y., & Commenges, D. (Eds.). (2006). The Oxford dictionary of statistical terms. Oxford University Press on Demand.

Elías Alonso, G. (2016). Implementation of a real-time dance ability for mini maggie (Bachelor's thesis).

Goel, A., Sheezan, M., Masood, S., & Saleem, A. (2014, September). Genre classification of songs using neural network. In 2014 International Conference on Computer and Communication Technology (ICCCT) (pp. 285-289). IEEE.

Jamdar, A., Abraham, J., Khanna, K., & Dubey, R. (2015). Emotion analysis of songs based on lyrical and audio features. arXiv preprint arXiv:1506.05012.

Herremans, D., Martens, D., & Sörensen, K. (2014). Dance hit song prediction. Journal of New Music Research, 43(3), 291-302.

Holland, S. M. (2008). Principal components analysis (PCA). Department of Geology, University of Georgia, Athens, GA, 30602-2501.

Howlin, C., & Rooney, B. (2020). Patients choose music with high energy, danceability, and lyrics in analgesic music listening interventions. Psychology of Music, 0305735620907155.

Howlin, C., & Rooney, B. (2021). Cognitive agency in music interventions: Increased perceived control of music predicts increased pain tolerance. European Journal of Pain.

Hörschläger, F., Vogl, R., Böck, S., & Knees, P. (2015). Addressing tempo estimation octave errors in electronic music by incorporating style information extracted from Wikipedia. In Proceedings of the Sound and Music Computing Conference (SMC), Maynooth, Ireland.

Kim, Y., Aiello, L. M., & Quercia, D. (2020). PepMusic: motivational qualities of songs for daily activities. EPJ Data Science, 9(1), 13.

King, J., & Imbrasaitė, V. (2015, March). Generating music playlists with hierarchical clustering and Q-learning. In European Conference on Information Retrieval (pp. 315-326). Springer, Cham.

Krishna, A. G., Raju, C. G., Rathore, D. S., & Singh, N. S. Music Mood Visualizer using Pipelines.

Kusama, K., & Itoh, T. (2011, March). Muscat: a music browser featuring abstract pictures and zooming user interface. In Proceedings of the 2011 ACM Symposium on Applied Computing (pp. 1222-1228).

Kusama, K., & Itoh, T. (2014). Abstract picture generation and zooming user interface for intuitive music browsing. Multimedia tools and applications, 73(2), 995-1010.

Long, M., Hu, L., & Jin, F. (2021, March). Analysis of Main Characteristics of Music Genre Based on PCA Algorithm. In 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE) (pp. 101-105). IEEE.

Luo, K. (2018). Machine Learning Approach for Genre Prediction on Spotify Top Ranking Songs.

Langensiepen, C., Cripps, A., & Cant, R. (2018, March). Using PCA and K-Means to Predict Likeable Songs from Playlist Information. In 2018 UKSim-AMSS 20th International Conference on Computer Modelling and Simulation (UKSim) (pp. 26-31). IEEE.

Pérez Lloret, S., Diez, J. J., Domé, M. N., Alvarez Delvenne, A., Braidot, N., Cardinali, D. P., & Vigo, D. E. (2014). Effects of different" relaxing" music styles on the autonomic nervous system.

Shahnaz, C., & Hasan, S. S. (2016, November). Emotion recognition based on wavelet analysis of Empirical Mode Decomposed EEG signals responsive to music videos. In 2016 IEEE Region 10 Conference (TENCON) (pp. 424-427). IEEE.