[R] Help with Hmisc, cut2, split and quantile
Peter Ehlers
ehlers at ucalgary.ca
Tue Mar 9 07:41:53 CET 2010
On 2010-03-08 18:00, Guy Green wrote:
>
> Hi Peter& others,
>
> Thanks (Peter) - that gets me really close to what I was hoping for.
>
> The one problem I have is that the "cut" approach breaks the data into
> intervals based on the absolute value of the "Target" data, rather than
> their frequency. In other words, if the data ranged from 0 to 50, the data
> would be separated into 0-5, 5-10 and so on, regardless of the frequency
> within those categories. However I want to get the data into deciles.
>
> The code that does this (incorporating Peter's) is:
>
> read_data=read.table("C:/Sample table.txt", head = T)
> read_data$DEC<- with(read_data, cut(Target, breaks=10, labels=1:10))
> L<- split(read_data, read_data$DEC)
>
> This means that I can get separate data frames, such as L$'10', which comes
> out tidy, but only containing 2 data items (the sample has 63 rows, so each
> decile should have 6+ data items):
> Actual Target DEC
> 9 0.572 0.3778386 10
> 31 0.299 0.3546606 10
>
> If I try to adjust this to get deciles using cut2(), I can break the data
> into deciles as follows:
>
> read_data=read.table("C:/Sample table.txt", head = T)
> read_data$DEC<- with(read_data, cut2(read_data$Target, g=10), labels=1:10)
> L<- split(read_data, read_data$DEC)
>
> However this time, while the data is broken into even data frames, the
> labels for the separate data frames are unuseable, e.g.:
> $`[ 0.26477, 0.37784]`
> Actual Target DEC
> 6 0.243 0.2650960 [ 0.26477, 0.37784]
> 9 0.572 0.3778386 [ 0.26477, 0.37784]
> 10 -0.049 0.3212681 [ 0.26477, 0.37784]
> 15 0.780 0.2778518 [ 0.26477, 0.37784]
> 31 0.299 0.3546606 [ 0.26477, 0.37784]
> 33 0.105 0.2647676 [ 0.26477, 0.37784]
>
> Could anyone suggest a way of rearranging this to make the labels useable
> again? Sample data is reattached
> http://n4.nabble.com/file/n1585427/Sample_table.txt Sample_table.txt .
I think that the easiest way would be to relabel the levels of DEC:
read_data$DEC <- factor(read_data$DEC, labels = 1:10)
or, since I would prefer letters as factor levels:
read_data$DEC <- factor(read_data$DEC, labels = LETTERS[1:10])
Another way would be to use cut2() with onlycuts=TRUE to get the
breaks and then use these with cut() as in my original post:
brks <- cut2(read_data$Target, g=10, onlycuts=TRUE)
read_data$DEC<- with(read_data,
cut(Target, breaks=brks, labels=1:10))
But I still don't see why you want a list of separate data
frames. For most analyses, it's more convenient to just use the
factor variable to subset the data as needed.
-Peter Ehlers
>
> Thanks,
> Guy
>
>
>
> Peter Ehlers wrote:
>>
>> On 2010-03-08 8:47, Guy Green wrote:
>>>
>>> Hello,
>>> I have a set of data with two columns: "Target" and "Actual". A
>>> http://n4.nabble.com/file/n1584647/Sample_table.txt Sample_table.txt is
>>> attached but the data looks like this:
>>>
>>> Actual Target
>>> -0.125 0.016124906
>>> 0.135 0.120799865
>>> ... ...
>>> ... ...
>>>
>>> I want to be able to break the data into tables based on quantiles in the
>>> "Target" column. I can see (using cut2, and also quantile) how to get
>>> the
>>> barrier points between the different quantiles, and I can see how I would
>>> achieve this if I was just looking to split up a vector. However I am
>>> trying to break up the whole table based on those quantiles, not just the
>>> vector.
>>>
>>> However I would like to be able to break the table into ten separate
>>> tables,
>>> each with both "Actual" and "Target" data, based on the "Target" data
>>> deciles:
>>>
>>> top_decile = ...(top decile of "read_data", based on Target data)
>>> next_decile = ...and so on...
>>> bottom_decile = ...
>>
>> I would just add a factor variable indicating to which decile
>> a particular observation belongs:
>>
>> dat$DEC<- with(dat, cut(Target, breaks=10, labels=1:10))
>>
>> If you really want to have separate data frames you can then
>> split on the decile:
>>
>> L<- split(dat, dat$DEC)
>>
>> -Peter Ehlers
>> --
>> Peter Ehlers
>> University of Calgary
>>
>>
>
--
Peter Ehlers
University of Calgary
More information about the R-help
mailing list