Wednesday, September 09, 2009

Finding Document Similarity

If you need to find how similar a document is we can use vector algebra to our advantage. If we have 2 document d1 and d2 then

Let D1 be a vector of the word-frequency of the document d1
Let D2 be a vector of the word-frequency of the document d2

then the deviation of the 2 documents can be calculated by

using the fact that

cos(theta) = (D1 * D2)/(N(D1) * N(D2))

where N(D) is the normalized vector represented by

N(D) = sqrt(D * D)

No comments: