If you need to find how similar a document is we can use vector algebra to our advantage. If we have 2 document d1 and d2 then
Let D1 be a vector of the word-frequency of the document d1
Let D2 be a vector of the word-frequency of the document d2
then the deviation of the 2 documents can be calculated by
using the fact that
cos(theta) = (D1 * D2)/(N(D1) * N(D2))
where N(D) is the normalized vector represented by
N(D) = sqrt(D * D)
Wednesday, September 09, 2009
Subscribe to:
Post Comments (Atom)
Labels
. linux
(1)
algorithm
(15)
analytics
(1)
bash
(2)
bigoh
(1)
bruteforce
(1)
c#
(1)
c++
(40)
collections
(1)
commands
(2)
const
(1)
cosine similarity
(1)
creating projects
(1)
daemon
(1)
device_drivers
(1)
eclipse
(6)
eclipse-plugin-development
(9)
equals
(1)
formatting
(1)
freebsd
(1)
game programming
(1)
hashcode
(1)
heap
(1)
heaps
(1)
immutable-objects
(1)
java
(19)
JDT
(1)
kernel
(1)
linux
(4)
little sugar
(23)
logging
(1)
machine learning
(1)
marker-resolution
(1)
markers
(1)
mergesort
(1)
mixins
(1)
numbers
(1)
opengl
(2)
patterns
(2)
priority-queue
(1)
programming
(51)
ps
(1)
ranking
(1)
refactoring
(3)
references
(1)
security
(1)
set
(1)
shell
(1)
similarity
(1)
statistics
(1)
stl
(1)
tetris
(1)
threads
(1)
trees
(2)
unicode
(1)
unix
(2)
views
(2)
windows programming
(2)
XNA
(1)
No comments:
Post a Comment