[R] Efficient way to create new column based on comparison with another dataframe
Gaius Augustus
gaiusjaugustus at gmail.com
Sat Jan 30 19:48:27 CET 2016
I'll look into the Intervals idea. The data.table code posted might not
work (because I don't believe it would put the rows in the correct order if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...
*SOLUTION*
mapfile <- data.frame(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
= c(5000, 10000), key = "Chr")
for(i in 1:nrow(Chr.Arms)){
cur.row <- Chr.Arms[i, ]
mapfile$Arm[ mapfile$Chr == cur.row$Chr & mapfile$Position >=
cur.row$Start & mapfile$Position <= cur.row$End] <- cur.row$Arm
}
This took out the need for the intermediate table/vector. This worked for
me, and was VERY fast. Took <5 minutes on a dataframe with 35 million rows.
Thanks for the help,
Gaius
On Sat, Jan 30, 2016 at 10:50 AM, Gaius Augustus <gaiusjaugustus at gmail.com>
wrote:
> I'll look into the Intervals idea. The data.table code posted might not
> work (because I don't believe it would put the rows in the correct order if
> the chromosomes are interspersed), however, it did make me think about
> possibly assigning based on values...
>
> Something like:
> mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
> c(3000, 6000, 1000), key = "Chr")
> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
> = c(5000, 10000), key = "Chr")
>
> for(i in 1:nrow(Chr.Arms)){
> cur.row <- Chr.Arms[i, ]
> mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position <=
> cur.row$End] <- Chr.Arms$Arm
> }
>
> This might take out the need for the intermediate table/vector. Not sure
> yet if it'll work, but we'll see. I'm interested to know if anyone else
> has any ideas, too.
>
> Thanks,
> Gaius
>
> On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com>
> wrote:
>
>> Hi Gaius,
>>
>> Could you use data.table and loop over the small Chr.arms?
>>
>> library(data.table)
>> mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
>> c(3000, 6000, 1000), key = "Chr")
>> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001),
>> End = c(5000, 10000), key = "Chr")
>>
>> Arms <- data.table()
>> for(i in 1:nrow(Chr.Arms)){
>> cur.row <- Chr.Arms[i, ]
>> Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
>> Arm <- Arm[ , Arm:=cur.row$Arm][]
>> Arms <- rbind(Arms, Arm)
>> }
>>
>> # Or use plyr to loop over each possible arm
>> library(plyr)
>> Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){
>> mapfile <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
>> mapfile <- mapfile[ , Arm:=cur.row$Arm][]
>> return(mapfile)
>> }, mapfile = mapfile)
>>
>> I have just started to use the data.table and I have the feeling the code
>> above can be greatly improved - maybe the loop can be dropped entirely?
>>
>> Hope this helps
>> Ulrik
>>
>> On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com>
>> wrote:
>>
>>> I have two dataframes. One has chromosome arm information, and the other
>>> has SNP position information. I am trying to assign each SNP an arm
>>> identity. I'd like to create this new column based on comparing it to
>>> the
>>> reference file.
>>>
>>> *1) Mapfile (has millions of rows)*
>>>
>>> Name Chr Position
>>> S1 1 3000
>>> S2 1 6000
>>> S3 1 1000
>>>
>>> *2) Chr.Arms file (has 39 rows)*
>>>
>>> Chr Arm Start End
>>> 1 p 0 5000
>>> 1 q 5001 10000
>>>
>>>
>>> *R Script that works, but slow:*
>>> Arms <- c()
>>> for (line in 1:nrow(Mapfile)){
>>> Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
>>> Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] <
>>> Chr.Arms$End]}
>>> }
>>> Mapfile$Arm <- Arms
>>>
>>>
>>> *Output Table:*
>>>
>>> Name Chr Position Arm
>>> S1 1 3000 p
>>> S2 1 6000 q
>>> S3 1 1000 p
>>>
>>>
>>> In words: I want each line to look up the location ( 1) find the right
>>> Chr,
>>> 2) find the line where the START < POSITION < END), then get the ARM
>>> information and place it in a new column.
>>>
>>> This R script works, but surely there is a more time/processing efficient
>>> way to do it.
>>>
>>> Thanks in advance for any help,
>>> Gaius
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list