Finding the N largest files under a folder
find . -type f -print|xargs ls -l|sort -r -n -k 5,5 | tail -N
Finding the N smallest files under a folder
find . -type f -print|xargs ls -l|sort -r -n -k 5,5 | head -N
here replace N by the number of largest files you want in the output
Monday, March 01, 2010
Tuesday, February 16, 2010
Little Sugar: Bash
removing files older than 30 days
/home/user_dir/app_name/logs -ctime +30d -exec rm {} ';'
/home/user_dir/app_name/logs -ctime +30d -exec rm {} ';'
Saturday, January 30, 2010
Similarity in application features - Cosine Similarity
Lets say you are planning to write a rating system for music songs.
and you are interested in the following queries:
- average rating of a given song
- average rating a user gives to a song
- which songs are most similar
- which users rate similar songs
- which users have similar tastes
here are we gave 2 dimensions user and song rating. We can use cosine similarity to answer the above mentioned queries. Lets do this in a step by step manner
Step 1
Create a 2 dimensional matrix of song_rating vs user
Here the column represent the ratings given by users A, B, C
The last column represents the avg rating of a given song.
The rows represents a song and the ratings given by different users for that song.
Step 2
Normalize the values in the matrix so that they lie between 0 and 1.
divide each column value by square root of the sum of the squares of all the columns in a given row. e.g to normalize column with value 3 divide 3 by
sqrt(3^2 + 2^2 + 1^2)
Step 3
Normalize the values in the matrix so that they lie between 0 and 1.
divide each column value by square root of the sum of the squares of all the columns in a given row. e.g to normalize column with value 3 divide 3 by
sqrt(3^2 + 2^2 + 1^2)
Step 4
obtain a new table after applying normalization rule
Now each row with every other row to obtain a similarity matrix.
The multiplication should be a dot product of the 2 rows
for example the song1 and song2 dot product yields in
.8018*.7428 + .5345*.3714 + .2673*.557 = .943
Step 5
Use the dot product rule to obtain the similarity rule
From this matrix its easy to figure out that song1 is more likely to be similar to song2 than song3
Step 6
Finding similar users is similar to finding similar songs, just align the
users as rows of the initial matrix and song rating as the columns of the initialization matrix.
Then apply Step 2 - Step 5 in order and you can obtain the similarity matrix for users also
and you are interested in the following queries:
- average rating of a given song
- average rating a user gives to a song
- which songs are most similar
- which users rate similar songs
- which users have similar tastes
here are we gave 2 dimensions user and song rating. We can use cosine similarity to answer the above mentioned queries. Lets do this in a step by step manner
Step 1
Create a 2 dimensional matrix of song_rating vs user
A | B | C | Avg | |
---|---|---|---|---|
song1 | 3 | 2 | 1 | 2 |
song2 | 4 | 2 | 3 | 3 |
song3 | 2 | 4 | 5 | 11/3 |
average | 3 | 8/3 | 3 | 26/3 |
Here the column represent the ratings given by users A, B, C
The last column represents the avg rating of a given song.
The rows represents a song and the ratings given by different users for that song.
Step 2
Normalize the values in the matrix so that they lie between 0 and 1.
divide each column value by square root of the sum of the squares of all the columns in a given row. e.g to normalize column with value 3 divide 3 by
sqrt(3^2 + 2^2 + 1^2)
A | B | C | |
---|---|---|---|
song1 | 3 | 2 | 1 |
song2 | 4 | 2 | 3 |
song3 | 2 | 4 | 5 |
Step 3
Normalize the values in the matrix so that they lie between 0 and 1.
divide each column value by square root of the sum of the squares of all the columns in a given row. e.g to normalize column with value 3 divide 3 by
sqrt(3^2 + 2^2 + 1^2)
A | B | C | |
---|---|---|---|
song1 | 3 | 2 | 1 |
song2 | 4 | 2 | 3 |
song3 | 2 | 4 | 5 |
Step 4
obtain a new table after applying normalization rule
A | B | C | |
---|---|---|---|
song1 | .8018 | .5345 | .2673 |
song2 | .7428 | .3714 | .557 |
song3 | .2981 | .5963 | .7454 |
Now each row with every other row to obtain a similarity matrix.
The multiplication should be a dot product of the 2 rows
for example the song1 and song2 dot product yields in
.8018*.7428 + .5345*.3714 + .2673*.557 = .943
Step 5
Use the dot product rule to obtain the similarity rule
Song1 | Song2 | Song3 | |
---|---|---|---|
song1 | 1 | .943 | .757 |
song2 | .943 | 1 | .858 |
song3 | .757 | .858 | 1 |
From this matrix its easy to figure out that song1 is more likely to be similar to song2 than song3
Step 6
Finding similar users is similar to finding similar songs, just align the
users as rows of the initial matrix and song rating as the columns of the initialization matrix.
Then apply Step 2 - Step 5 in order and you can obtain the similarity matrix for users also
Tuesday, January 26, 2010
Adding feedback loop to your applications
These days everyone is talking about smart applications.
One very simple feature that smart applications have is measuring user interactions.
Learning from user interactions and acting in a more personalized manner towards the user.
User interaction can mean different things for different applications. Here i am going to talk about one popular user interaction metric.
"Time spent by a user on a given feature"
Assuming you have some way of measuring the time spent by a user on a given feature. You can calculate the average time spent by a user and compare it with the time spent by other realistic users by using some statistics.
Assuming you have a list of users and their corresponding time's spent on the feature.
You can calculate the mean time for the list.
Lets call this mean_time.
You can also calculate the standard deviation for the list.
Lets call this time_range_delta
you actual time_range interval is then (mean_time - (time_range_delta * range_factor), mean_time + (time_range_delta * range_factor))
where range factor is the amount spurious information you are will to tolerate a range of [1-3] for this factor is generally considered good.
once you have this range interval, you can ignore all times that fall outside this range interval.
this will give a new list, from which you can calculate the mean and standard deviation again.
By following this process iteratively you can get the actual list of genuine users and their corresponding times.
Calculating the mean of this list will give the average time spent by the user.
You can then classify any user by using this average time.
One very simple feature that smart applications have is measuring user interactions.
Learning from user interactions and acting in a more personalized manner towards the user.
User interaction can mean different things for different applications. Here i am going to talk about one popular user interaction metric.
"Time spent by a user on a given feature"
Assuming you have some way of measuring the time spent by a user on a given feature. You can calculate the average time spent by a user and compare it with the time spent by other realistic users by using some statistics.
Assuming you have a list of users and their corresponding time's spent on the feature.
You can calculate the mean time for the list.
Lets call this mean_time.
You can also calculate the standard deviation for the list.
Lets call this time_range_delta
you actual time_range interval is then (mean_time - (time_range_delta * range_factor), mean_time + (time_range_delta * range_factor))
where range factor is the amount spurious information you are will to tolerate a range of [1-3] for this factor is generally considered good.
once you have this range interval, you can ignore all times that fall outside this range interval.
this will give a new list, from which you can calculate the mean and standard deviation again.
By following this process iteratively you can get the actual list of genuine users and their corresponding times.
Calculating the mean of this list will give the average time spent by the user.
You can then classify any user by using this average time.
Subscribe to:
Posts (Atom)
Labels
. linux
(1)
algorithm
(15)
analytics
(1)
bash
(2)
bigoh
(1)
bruteforce
(1)
c#
(1)
c++
(40)
collections
(1)
commands
(2)
const
(1)
cosine similarity
(1)
creating projects
(1)
daemon
(1)
device_drivers
(1)
eclipse
(6)
eclipse-plugin-development
(9)
equals
(1)
formatting
(1)
freebsd
(1)
game programming
(1)
hashcode
(1)
heap
(1)
heaps
(1)
immutable-objects
(1)
java
(19)
JDT
(1)
kernel
(1)
linux
(4)
little sugar
(23)
logging
(1)
machine learning
(1)
marker-resolution
(1)
markers
(1)
mergesort
(1)
mixins
(1)
numbers
(1)
opengl
(2)
patterns
(2)
priority-queue
(1)
programming
(51)
ps
(1)
ranking
(1)
refactoring
(3)
references
(1)
security
(1)
set
(1)
shell
(1)
similarity
(1)
statistics
(1)
stl
(1)
tetris
(1)
threads
(1)
trees
(2)
unicode
(1)
unix
(2)
views
(2)
windows programming
(2)
XNA
(1)