[R] help with mysql and R: partitioning by quintile

Sun May 15 00:52:15 CEST 2011

An easy way is to just offset the quantiles by a small increment so
that boundary condition is less likely.  If you change the line

tqm <- do.call(rbind, tq) + 0.001

in my example, that should do the trick.

On Sat, May 14, 2011 at 6:09 PM, gj <gawesh at gmail.com> wrote:
> Hi,
> I think I haven't been able to explain correctly what I want. Here another
> try:
> Given that I have the following input:
> userid,track,freq
>
> 1,1,1
> 1,2,10
> 1,3,1
> 1,4,1
> 1,5,15
> 1,6,4
> 1,7,16
> 1,8,6
> 1,9,1
> 1,10,1
> 1,11,2
> 1,12,2
> 1,13,1
> 1,14,6
> 1,15,7
> 1,16,13
> 1,17,3
> 1,18,2
> 1,19,5
> 1,20,2
> 1,21,2
> 1,22,6
> 1,23,4
> 1,24,1
> 1,25,1
> 1,26,16
> 1,27,4
> 1,28,1
> 1,29,4
> 1,30,4
> 1,31,4
> 1,32,1
> 1,33,14
> 1,34,2
> 1,35,7
>
> It is a sample of the history of tracks played: userid,track and frequency.
> What I want is to convert the frequency into a rating scale (1-5) based on
> the frequency at which a user has played a track, using the following
> interquintile ranges for the cfd:
>  0%-20% = rating 1, 20%-40% = rating 2, .... ,80%-100%=rating 5
>
> Jim kindly provided the following code:
> # cheers jim holtman
>>x=read.csv(file="C:\\Data\\lastfm\\ratings\\play_history_3.csv",header=T,
> sep=',')
>># get the quantiles for each user(we want the frequency distribution to be
> based on user)
>>tq <- tapply(x$freq,x$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1))
>># create a matrix with the rownames as the tracks to use in the
> findInterval
>>tqm <- do.call(rbind, tq)
>>#now put the ratings
>>require(data.table)
>>x.dt <- data.table(x)
>>x.new <- x.dt[,list(freq = freq,track=track,rating =
> findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed = TRUE) +
> 1L),by=userid]
>>head(x.new)
>
>
>     userid freq track rating
> [1,]      1    1     1      2
> [2,]      1   10     2      5
> [3,]      1    1     3      2
> [4,]      1    1     4      2
> [5,]      1   15     5      5
> [6,]      1    4     6      4
>
>
> which is almost what I wanted except that the ratings are 1 point higher for
> tracks where the frequency is at the cut-off points in the interquintile
> range.
> To illustrate the quintiles are:
>
>> tq$`1`
>  20%  40%  60%  80% 100%
>   1    2    4    7   16
>
>
>
> So, ideally I want (note the different ratings):
>
>
>     userid freq track rating
> [1,]      1    1     1      1
> [2,]      1   10     2      5
> [3,]      1    1     3      1
> [4,]      1    1     4      1
> [5,]      1   15     5      5
> [6,]      1    4     6      3
>
>
> Can anybody help me? I'm new to R (as you have probably guessed). Sorry for
> the long explanation.
>
> Regards
> Gawesh
>
> On Sat, May 14, 2011 at 7:37 PM, Dennis Murphy <djmuser at gmail.com> wrote:
>
>> Hi:
>>
>> Is this what you're after?
>>
>> tq <- with(ds, quantile(freq, seq(0.2, 1, by = 0.2)))
>> ds$int <- with(ds, cut(freq, c(0, tq)))
>> with(ds, table(int))
>>
>> int
>>  (0,1]  (1,2]  (2,4]  (4,7] (7,16]
>>    10      6      7      6      6
>>
>> HTH,
>> Dennis
>>
>> On Sat, May 14, 2011 at 9:42 AM, gj <gawesh at gmail.com> wrote:
>> > Hi Jim,
>> > Thanks very much for the code. I modified it a bit because I needed to
>> > allocate the track ratings by userid (eg if user 1 plays track x once, he
>> > gets rating 1, user 1 plays track y 100 times, he gets a rating 5) and
>> not
>> > by track (sorry if this wasn't clear in my original post).
>> >
>> > This is almost working! What I can't get right at the moment is the
>> cutoff
>> > interval for the ratings.
>> > Any help please?
>> >
>> > Sample data:
>> >
>> > userid,track,freq
>> > 1,1,1
>> > 1,2,10
>> > 1,3,1
>> > 1,4,1
>> >
>> > 1,5,15
>> > 1,6,4
>> > 1,7,16
>> > 1,8,6
>> > 1,9,1
>> > 1,10,1
>> > 1,11,2
>> > 1,12,2
>> > 1,13,1
>> > 1,14,6
>> > 1,15,7
>> > 1,16,13
>> > 1,17,3
>> > 1,18,2
>> > 1,19,5
>> > 1,20,2
>> > 1,21,2
>> > 1,22,6
>> > 1,23,4
>> > 1,24,1
>> > 1,25,1
>> > 1,26,16
>> > 1,27,4
>> > 1,28,1
>> > 1,29,4
>> > 1,30,4
>> > 1,31,4
>> > 1,32,1
>> > 1,33,14
>> > 1,34,2
>> > 1,35,7
>> >
>> >
>> >>tq
>> >
>> >
>> > $'1'
>> >
>> >
>> > 20%  40%  60%  80% 100%
>> >   1    2    4    7   16
>> >
>> >
>> > >From this distribution I would expect to distribute ratings as follows:
>> > freq: <=1  <=2  <=4  <=7  <=16
>> > rating: 1      2      3      4       5
>> >
>> > But my output is (the ratings are correct except at the cut off points):
>> > "userid" "freq" "track" "rating"
>> > 1 1 1 2
>> > 1 10 2 5
>> > 1 1 3 2
>> > 1 1 4 2
>> > 1 15 5 5
>> > 1 4 6 4
>> > 1 16 7 5
>> > 1 6 8 4
>> > 1 1 9 2
>> > 1 1 10 2
>> > 1 2 11 3
>> > 1 2 12 3
>> > 1 1 13 2
>> > 1 6 14 4
>> > 1 7 15 5
>> > 1 13 16 5
>> > 1 3 17 3
>> > 1 2 18 3
>> > 1 5 19 4
>> > 1 2 20 3
>> > 1 2 21 3
>> > 1 6 22 4
>> > 1 4 23 4
>> > 1 1 24 2
>> > 1 1 25 2
>> > 1 16 26 5
>> > 1 4 27 4
>> > 1 1 28 2
>> > 1 4 29 4
>> > 1 4 30 4
>> > 1 4 31 4
>> > 1 1 32 2
>> > 1 14 33 5
>> > 1 2 34 3
>> > 1 7 35 5
>> >
>> > This is the code:
>> > # cheers jim holtman
>> > x=read.csv(file="C:\\Data\\lastfm\\ratings\\play_history_3.csv",header=T,
>> > sep=',')
>> > # get the quantiles for each track
>> > tq <- tapply(x$freq,x$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1))
>> > # create a matrix with the rownames as the tracks to use in the
>> findInterval
>> > tqm <- do.call(rbind, tq)
>> > #now put the ratings
>> > require(data.table)
>> > x.dt <- data.table(x)
>> > x.new <- x.dt[,list(freq = freq,track=track,rating =
>> > findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed =
>> TRUE) +
>> > 1L),by=userid]
>> >
>> > Regards
>> > Gawesh
>> >
>> > On Sun, May 8, 2011 at 10:42 PM, jim holtman <jholtman at gmail.com> wrote:
>> >
>> >> try this:
>> >>
>> >> > # create some data
>> >> > x <- data.frame(userid = paste('u', rep(1:20, each = 20), sep = '')
>> >> +               , track = rep(1:20, 20)
>> >> +               , freq = floor(runif(400, 10, 200))
>> >> +               , stringsAsFactors = FALSE
>> >> +               )
>> >> > # get the quantiles for each track
>> >> > tq <- tapply(x$freq, x$track, quantile, prob = c(.2, .4, .6, .8, 1))
>> >> > # create a matrix with the rownames as the tracks to use in the
>> >> findInterval
>> >> > tqm <- do.call(rbind, tq)
>> >> > # now put the ratings
>> >> > require(data.table)
>> >> > x.dt <- data.table(x)
>> >> > x.new <- x.dt[,
>> >> +               list(userid = userid
>> >> +                  , freq = freq
>> >> +                  , rating = findInterval(freq
>> >> +                                        # use track as index into
>> >> quantile matrix
>> >> +                                        , tqm[as.character(track[1L]),]
>> >> +                                        , rightmost.closed = TRUE
>> >> +                                        ) + 1L
>> >> +                  )
>> >> +              , by = track]
>> >> >
>> >> > head(x.new)
>> >>     track userid freq rating
>> >> [1,]     1     u1   10      1
>> >> [2,]     1     u2   15      1
>> >> [3,]     1     u3  126      4
>> >> [4,]     1     u4  117      3
>> >> [5,]     1     u5   76      2
>> >> [6,]     1     u6  103      3
>> >> >
>> >>
>> >>
>> >> On Sun, May 8, 2011 at 2:48 PM, gj <gawesh at gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I have a mysql table with fields userid,track,frequency e.g
>> >> > u1,1,10
>> >> > u1,2,100
>> >> > u1,3,110
>> >> > u1,4,200
>> >> > u1,5,120
>> >> > u1,6,130
>> >> > .
>> >> > u2,1,23
>> >> > .
>> >> > .
>> >> > where "frequency" is the number of times a music track is played by a
>> >> > "userid"
>> >> >
>> >> > I need to turn my 'frequency' table into a rating table (it's for a
>> >> > recommender system). So, for each user, I need to categorise the
>> >> frequency
>> >> > of tracks played by quintile so that each particular track can have 5
>> >> > ratings (1-5), with the ratings allocated as follows: inter-quintile
>> >> range
>> >> > 100-80% = rating 5,   inter-quintile range 80-60% = rating 4,
>> >> > ..., inter-quintile range 20-0% = rating 1)
>> >> >
>> >> > Hence, I need to create a table with fields userid,track,rating:
>> >> > u1,1,1
>> >> > u1,2, 3
>> >> > ...
>> >> >
>> >> > Can anybody help me to do this with R?
>> >> >
>> >> > Regards
>> >> > Gawesh
>> >> >
>> >> >        [[alternative HTML version deleted]]
>> >> >
>> >> > ______________________________________________
>> >> > R-help at r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> > and provide commented, minimal, self-contained, reproducible code.
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jim Holtman
>> >> Data Munger Guru
>> >>
>> >> What is the problem that you are trying to solve?
>> >>
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?