Monday, March 01, 2010

LIttle Sugar: Bash :: Finding the N largest or smallest files under a folder

Finding the N largest files under a folder

find . -type f -print|xargs ls -l|sort -r -n -k 5,5 | tail -N

Finding the N smallest files under a folder

find . -type f -print|xargs ls -l|sort -r -n -k 5,5 | head -N

here replace N by the number of largest files you want in the output

Tuesday, February 16, 2010

Little Sugar: Bash

removing files older than 30 days

/home/user_dir/app_name/logs -ctime +30d -exec rm {} ';'

Saturday, January 30, 2010

Similarity in application features - Cosine Similarity

Lets say you are planning to write a rating system for music songs.

and you are interested in the following queries:
- average rating of a given song
- average rating a user gives to a song
- which songs are most similar
- which users rate similar songs
- which users have similar tastes

here are we gave 2 dimensions user and song rating. We can use cosine similarity to answer the above mentioned queries. Lets do this in a step by step manner

Step 1
Create a 2 dimensional matrix of song_rating vs user







 ABCAvg
song13212
song24233
song324511/3
average38/3326/3


Here the column represent the ratings given by users A, B, C
The last column represents the avg rating of a given song.
The rows represents a song and the ratings given by different users for that song.

Step 2
Normalize the values in the matrix so that they lie between 0 and 1.

divide each column value by square root of the sum of the squares of all the columns in a given row. e.g to normalize column with value 3 divide 3 by
sqrt(3^2 + 2^2 + 1^2)






 ABC
song1321
song2423
song3245


Step 3
Normalize the values in the matrix so that they lie between 0 and 1.

divide each column value by square root of the sum of the squares of all the columns in a given row. e.g to normalize column with value 3 divide 3 by
sqrt(3^2 + 2^2 + 1^2)






 ABC
song1321
song2423
song3245



Step 4
obtain a new table after applying normalization rule






 ABC
song1.8018.5345.2673
song2.7428.3714.557
song3.2981.5963.7454


Now each row with every other row to obtain a similarity matrix.
The multiplication should be a dot product of the 2 rows

for example the song1 and song2 dot product yields in
.8018*.7428 + .5345*.3714 + .2673*.557 = .943

Step 5
Use the dot product rule to obtain the similarity rule






 Song1Song2Song3
song11.943.757
song2.9431.858
song3.757.8581


From this matrix its easy to figure out that song1 is more likely to be similar to song2 than song3

Step 6
Finding similar users is similar to finding similar songs, just align the
users as rows of the initial matrix and song rating as the columns of the initialization matrix.

Then apply Step 2 - Step 5 in order and you can obtain the similarity matrix for users also

Tuesday, January 26, 2010

Adding feedback loop to your applications

These days everyone is talking about smart applications.

One very simple feature that smart applications have is measuring user interactions.
Learning from user interactions and acting in a more personalized manner towards the user.

User interaction can mean different things for different applications. Here i am going to talk about one popular user interaction metric.

"Time spent by a user on a given feature"

Assuming you have some way of measuring the time spent by a user on a given feature. You can calculate the average time spent by a user and compare it with the time spent by other realistic users by using some statistics.

Assuming you have a list of users and their corresponding time's spent on the feature.
You can calculate the mean time for the list.

Lets call this mean_time.

You can also calculate the standard deviation for the list.

Lets call this time_range_delta

you actual time_range interval is then (mean_time - (time_range_delta * range_factor), mean_time + (time_range_delta * range_factor))

where range factor is the amount spurious information you are will to tolerate a range of [1-3] for this factor is generally considered good.

once you have this range interval, you can ignore all times that fall outside this range interval.

this will give a new list, from which you can calculate the mean and standard deviation again.

By following this process iteratively you can get the actual list of genuine users and their corresponding times.

Calculating the mean of this list will give the average time spent by the user.

You can then classify any user by using this average time.