[R] Compare data in two rows and replace objects in data frame

jim holtman jholtman at gmail.com
Mon Aug 4 21:56:29 CEST 2014


here is another way of doing it using 'tidyr' and 'dplyr'


> x <- read.table(text = "CloneID    genotype2001    genotype2002    genotype2003
+ 2471250    1    1    1
+ 2471250    0    0    0
+ 2433062    0    0    0
+ 2433062    1    1    1
+ 100021605    1    1    0
+ 100021605    1    0    1
+ 100005599    1    1    0
+ 100005599    1    1    1
+ 100002798    1    1    0
+ 100002798    1    1    1", header = TRUE, as.is = TRUE)
> # translation key
> keyTrans <- c(`11` = 'HT'
+       , `10` = "A"
+       , `01` = "B"
+       , `1-` = "Aht"
+       , `-1` = "Bht"
+       )
> require(dplyr)
> require(tidyr)
> x %>%
+     gather(key, val, -CloneID) %>%  # 'melt' the data
+     group_by(CloneID, key) %>%  # group by CloneID
+     summarise(newKey = paste0(val, collapse = '')) %>%  # add concat
to two rows
+     mutate(newVal = keyTrans[newKey]) %>%  # add the new value
+     select(-newKey) %>%  # remove newKey for output
+     spread(key, newVal)
Source: local data frame [5 x 4]

    CloneID genotype2001 genotype2002 genotype2003
1   2433062            B            B            B
2   2471250            A            A            A
3 100002798           HT           HT            B
4 100005599           HT           HT            B
5 100021605           HT            A            B

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Mon, Aug 4, 2014 at 2:21 PM, John McKown
<john.archie.mckown at gmail.com> wrote:
> On Mon, Aug 4, 2014 at 4:53 AM, raz <barvazduck at gmail.com> wrote:
>> Dear all,
>>
>> I have a data frame 144 x 20000 values.
>> I need to take every value in the first row and compare to the second row,
>> and the same for rows 3-4 and 5-6 and so on.
>> the output should be one line for each of the two row comparison.
>> the comparison is:
>> if row1==1 and row2==1 <-'HT'
>> if row1==1 and row2==0 <-'A'
>> if row1==0 and row2==1 <-'B'
>> if row1==1 and row2=='-' <-'Aht'
>> if row1=='-' and row2==1 <-'Bht'
>>
>> for example:
>> if the data is:
>> CloneID    genotype 2001    genotype 2002    genotype 2003
>> 2471250    1    1    1
>> 2471250    0    0    0
>> 2433062    0    0    0
>> 2433062    1    1    1
>> 100021605    1    1    0
>> 100021605    1    0    1
>> 100005599    1    1    0
>> 100005599    1    1    1
>> 100002798    1    1    0
>> 100002798    1    1    1
>>
>> then the output should be:
>> CloneID    genotype 2001    genotype 2002    genotype 2003
>> 2471250    A    A    A
>> 2433062    B    B    B
>> 100021605    HT    A    B
>> 100005599    HT    HT    B
>> 100002798    HT    HT    B
>>
>> I tried this for the whole data, but its so slow:
>>
>> AX <- data.frame(lapply(AX, as.character), stringsAsFactors=FALSE)
>>
>>
>> for (i in seq(1,nrow(AX),by=2)){
>> for (j in 6:144){
>> if (AX[i,j]==1 & AX[i+1,j]==0){
>> AX[i,j]<-'A'
>> }
>> if (AX[i,j]==0 & AX[i+1,j]==1){
>> AX[i,j]<-'B'
>> }
>> if (AX[i,j]==1 & AX[i+1,j]==1){
>> AX[i,j]<-'HT'
>> }
>> if (AX[i,j]==1 & AX[i+1,j]=="-"){
>> AX[i,j]<-'Aht'
>> }
>> if (AX[i,j]=="-" & AX[i+1,j]==1){
>> AX[i,j]<-'Bht'
>> }
>> }
>> }
>>
>> AX1<-AX[!duplicated(AX[,3]),]
>> AX2<-AX[duplicated(AX[,3]),]
>>
>> Thanks for any help,
>>
>> Raz
>
> I don't know if you've received a solution as yet. Below is my generic
> solution. I don't know how fast it will be, but it does _NOT_ do any
> looping. It does do a few if functions. The result is in the variable
> new_data. The variables data_odd and data_even are temporaries which
> can be removed. Or you can wrap the code up in a function which
> returns new_data and they will simply "go away" when the function
> ends.
>
> #
> # Read in the data
> data <- read.csv(file="data.csv",header=TRUE,stringsAsFactors=FALSE);
> #
> # The criteria
> #if row1==1 and row2==1 <-'HT'
> #if row1==1 and row2==0 <-'A'
> #if row1==0 and row2==1 <-'B'
> #if row1==1 and row2=='-' <-'Aht'
> #if row1=='-' and row2==1 <-'Bht'
> #
> # The following assumes that data is properly ordered!
> data$rowNumber <- seq(1:nrow(data));
> data_odd <-data[data$rowNumber %% 2 == 1,];
> data_even <-data[data$rowNumber %% 2 == 0,];
> #
> # You really need to make sure that
> # the CloneID values are correct in data_odd
> # and data_even. Something like:
> stopifnot(data_odd$CloneID == data_even$CloneID);
> CloneIDs <- data_even[,1]; # Get the list of CloneIDs
> #data_even[,1] <- NULL; # Remove CloneIDs from even data
> #data_odd[,1] <- NULL;  # And also from odd data
> #
> # Initialize new_data - make everything NA so
> # it will stick out later!
> new_data <- data_even;
> new_data[,colnames(data_even)] <- NA;
> #
> new_data[data_odd == 1 & data_odd ==1] <- 'HT';
> new_data[data_odd == 1 & data_even == 0] <- 'A';
> new_data[data_odd == 0 & data_even == 1] <- 'B';
> new_data[data_odd == 1 & data_even == '.'] <- 'Aht';
> new_data[data_odd == '-' & data_even == 1] <- 'Bht';
> new_data$CloneID <- CloneIDs;
> new_data$rowNumber<-NULL;
> #
> #stopifnot( !is.na(new_data)); # Make sure no NAs left
>
>
>
>
> --
> There is nothing more pleasant than traveling and meeting new people!
> Genghis Khan
>
> Maranatha! <><
> John McKown
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list