Saturday, January 30, 2010

Similarity in application features - Cosine Similarity

Lets say you are planning to write a rating system for music songs.

and you are interested in the following queries:
- average rating of a given song
- average rating a user gives to a song
- which songs are most similar
- which users rate similar songs
- which users have similar tastes

here are we gave 2 dimensions user and song rating. We can use cosine similarity to answer the above mentioned queries. Lets do this in a step by step manner

Step 1
Create a 2 dimensional matrix of song_rating vs user







 ABCAvg
song13212
song24233
song324511/3
average38/3326/3


Here the column represent the ratings given by users A, B, C
The last column represents the avg rating of a given song.
The rows represents a song and the ratings given by different users for that song.

Step 2
Normalize the values in the matrix so that they lie between 0 and 1.

divide each column value by square root of the sum of the squares of all the columns in a given row. e.g to normalize column with value 3 divide 3 by
sqrt(3^2 + 2^2 + 1^2)






 ABC
song1321
song2423
song3245


Step 3
Normalize the values in the matrix so that they lie between 0 and 1.

divide each column value by square root of the sum of the squares of all the columns in a given row. e.g to normalize column with value 3 divide 3 by
sqrt(3^2 + 2^2 + 1^2)






 ABC
song1321
song2423
song3245



Step 4
obtain a new table after applying normalization rule






 ABC
song1.8018.5345.2673
song2.7428.3714.557
song3.2981.5963.7454


Now each row with every other row to obtain a similarity matrix.
The multiplication should be a dot product of the 2 rows

for example the song1 and song2 dot product yields in
.8018*.7428 + .5345*.3714 + .2673*.557 = .943

Step 5
Use the dot product rule to obtain the similarity rule






 Song1Song2Song3
song11.943.757
song2.9431.858
song3.757.8581


From this matrix its easy to figure out that song1 is more likely to be similar to song2 than song3

Step 6
Finding similar users is similar to finding similar songs, just align the
users as rows of the initial matrix and song rating as the columns of the initialization matrix.

Then apply Step 2 - Step 5 in order and you can obtain the similarity matrix for users also

1 comment:

rizwan said...

Thanks, it helped a lot.
Wondering if some song (say) song1 is to be suggested to a user (say) User C, how should we proceed after calculating this similarity matrix.