[R] Help with isolating and comparing data from two files.

jim holtman jholtman at gmail.com
Mon May 23 14:23:29 CEST 2011


Is this what you are after?

> pos
   V1   V2 V3 V4 V5 V6
1 c22 1445  - CG  1  4
2 c22 1542  + CG  2  3
3 c22 1678  + CG 13 15
> reg
   V1   V2   V3   V4 V5 V6     V7
1 c22 1440 1500 cpg: 44 56 ......
2 c22 1520 1700 cpg: 56 87 ......
3 c22 1800 1900 cpg: 58 90 ......
> # iterate through the 'reg' printing put match 'pos' entries
> result <- lapply(seq(nrow(reg)), function(i){
+     # get indices of match
+     indx <- (pos$V2 >= reg$V2[i]) & (pos$V2 <= reg$V3[i])
+     if (!any(indx)) return(NULL)  # no match
+     # create new dataframe
+     cbind(reg[rep(i, sum(indx)), 1:3], pos[indx, ])
+ })
> do.call(rbind, result)
     V1   V2   V3  V1   V2 V3 V4 V5 V6
1   c22 1440 1500 c22 1445  - CG  1  4
2   c22 1520 1700 c22 1542  + CG  2  3
2.1 c22 1520 1700 c22 1678  + CG 13 15
>


On Mon, May 23, 2011 at 12:00 AM, ajn21 <ajn21 at case.edu> wrote:
> Hello,
>
> I was hoping that someone would be able to help me or at least point me in
> the right direction regarding a problem I am having. I am a new R user, and
> I've been trying to read tutorials but they haven't been much help to me so
> far.
>
> The problem is relatively simple as I've already created working solutions
> in Java and Perl, but I need a solution in R as well.
>
> I have two text files, say pos.txt and reg.txt. In pos.txt, the data is
> listed for example:
>
> c22 1445  - CG 1 4
> c22 1542 + CG 2 3
> c22 1678 + CG 13 15
> ...
>
> etc. for thousands of lines. The most important column is column 2, which
> lists "position" (e.g. 1445, 1542, 1678). In reg.txt, data is listed as:
>
> c22 1440 1500 cpg: 44 56 ......
> c22 1520 1700 cpg: 56 87 ......
> c22 1800 1900 cpg: 58 90 ......
> ...
>
> where the values in column 2 is the "start" position and values in column 3
> are the "end" position. There are 10 columns total but I just listed the
> first few. Also, the text files are different lengths.
>
>
> Essentially, my problem is trying to take the position listed in column 2 of
> pos.txt and try to find the region (based on start and end positions) listed
> in reg.txt. Then I need to print:
>
> c22 "start" "end" "position" + 1 5
>
> where the last 3 columns are from pos.txt as well (i.e. all of the lines
> don't end in  + 1 5, but rather the values for the columns in pos.txt).
> Also, the position needs to be within the start and end position.
>
> So far I've been able to use read.table to create a data frame for each text
> file, and I've also named each column (e.g. reg.data$end) and I can output
> each column individually. However, the problem I keep facing is how to
> compare the numbers for "position" in pos.txt to the numbers for "start" and
> "end" in reg.txt. I tried to use:
>
> if ((pos >= start) | (pos <= end))..
>
> but an error comes up that says the files aren't the same length.
>
> In Java and Perl I used nested loops to cycle through each element in one
> file, and compare it to every element in the other file, and then printed to
> a new text file. As such, I was trying to learn a bit more about arrays in
> R, but if you know of a better way in R to do this then please let me know.
>
> Any help is greatly appreciated.
>
> Thank you,
> AJ
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Help-with-isolating-and-comparing-data-from-two-files-tp3543170p3543170.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?



More information about the R-help mailing list