[R] Efficient way to create new column based on comparison with another dataframe
Gaius Augustus
gaiusjaugustus at gmail.com
Sat Jan 30 18:50:05 CET 2016
I'll look into the Intervals idea. The data.table code posted might not
work (because I don't believe it would put the rows in the correct order if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...
Something like:
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
= c(5000, 10000), key = "Chr")
for(i in 1:nrow(Chr.Arms)){
cur.row <- Chr.Arms[i, ]
mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position <=
cur.row$End] <- Chr.Arms$Arm
}
This might take out the need for the intermediate table/vector. Not sure
yet if it'll work, but we'll see. I'm interested to know if anyone else
has any ideas, too.
Thanks,
Gaius
On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com>
wrote:
> Hi Gaius,
>
> Could you use data.table and loop over the small Chr.arms?
>
> library(data.table)
> mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
> c(3000, 6000, 1000), key = "Chr")
> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
> = c(5000, 10000), key = "Chr")
>
> Arms <- data.table()
> for(i in 1:nrow(Chr.Arms)){
> cur.row <- Chr.Arms[i, ]
> Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
> Arm <- Arm[ , Arm:=cur.row$Arm][]
> Arms <- rbind(Arms, Arm)
> }
>
> # Or use plyr to loop over each possible arm
> library(plyr)
> Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){
> mapfile <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
> mapfile <- mapfile[ , Arm:=cur.row$Arm][]
> return(mapfile)
> }, mapfile = mapfile)
>
> I have just started to use the data.table and I have the feeling the code
> above can be greatly improved - maybe the loop can be dropped entirely?
>
> Hope this helps
> Ulrik
>
> On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com>
> wrote:
>
>> I have two dataframes. One has chromosome arm information, and the other
>> has SNP position information. I am trying to assign each SNP an arm
>> identity. I'd like to create this new column based on comparing it to the
>> reference file.
>>
>> *1) Mapfile (has millions of rows)*
>>
>> Name Chr Position
>> S1 1 3000
>> S2 1 6000
>> S3 1 1000
>>
>> *2) Chr.Arms file (has 39 rows)*
>>
>> Chr Arm Start End
>> 1 p 0 5000
>> 1 q 5001 10000
>>
>>
>> *R Script that works, but slow:*
>> Arms <- c()
>> for (line in 1:nrow(Mapfile)){
>> Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
>> Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] <
>> Chr.Arms$End]}
>> }
>> Mapfile$Arm <- Arms
>>
>>
>> *Output Table:*
>>
>> Name Chr Position Arm
>> S1 1 3000 p
>> S2 1 6000 q
>> S3 1 1000 p
>>
>>
>> In words: I want each line to look up the location ( 1) find the right
>> Chr,
>> 2) find the line where the START < POSITION < END), then get the ARM
>> information and place it in a new column.
>>
>> This R script works, but surely there is a more time/processing efficient
>> way to do it.
>>
>> Thanks in advance for any help,
>> Gaius
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list