[R] is there is a way to extract lines in between 3 files that are in common based on one column?

Tue Jun 2 03:48:30 CEST 2020

Hi Ana,
If I add another 6 rows to neu1, 2 to nep1 and one to ret1 and modify
the "Marker" field so that there is one more match, I get the result I
expect. I think that the program logic is correct. I can't say why
ret1 has fewer lines. If there aren't too many mismatches, maybe
checking the mismatches will help:

neu3<-neu1[!(neu1$Marker %in% Marker3),]
nep3<-nep1[!(nep1$Marker %in% Marker3),]
ret3<-ret1[!(ret1$Marker %in% Marker3),]
neu3
nep3
ret3

Jim

On Tue, Jun 2, 2020 at 10:40 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
>
> Hi Jim,
>
> thank you so much for getting back to me. I tried your code and this is what I get:
> > dim(neu2)
> [1] 3740988       9
> > dim(nep2)
> [1] 3740988       9
> > dim(ret2)
> [1] 3740001       9
>
> I think I would need to have the same number of lines in all 3 data frames.
>
> Can you please advise.
>
> Cheers
> Ana
>
> On Mon, Jun 1, 2020 at 7:31 PM Jim Lemon <drjimlemon using gmail.com> wrote:
>>
>> Hi Ana,
>> Not too hard, but your example has all the "marker" fields in common.
>> So using a sample that will show the expected result:
>>
>> neu1<-read.table(text="Chr BP Marker  MAF A1 A2 Direction  pValue N
>>  1 100000012 1:100000012:G:T 0.229925  T  G  + 0.650403 1594
>>  1 100000827 1:100000827:C:T 0.287014  T  C  + 0.955449 1594
>>  1 100002713 1:100002713:C:T 0.097867  T  C  - 0.290455 1594
>>  1 100002882 1:100002882:T:G 0.287014  G  T  + 0.955449 1594
>>  1 100002991 1:100002991:G:A 0.097867  A  G  - 0.290455 1594
>>  1 100004726 1:100004726:G:A 0.132058  A  G  + 0.115005 1594",
>>  header=TRUE,stringsAsFactors=FALSE)
>>
>> nep1<-read.table(text="Chr BP Marker MAF A1 A2 Direction    pValue N
>>  1 100000012 1:100000012:G:T 0.2300430 T  G - 0.1420030 1641
>>  1 100000827 1:100000827:C:T 0.2867150 T  C - 0.2045580 1641
>>  1 100002713 1:100002713:C:T 0.0975015 T  C - 0.0555507 1641
>>  1 100002882 1:100002882:T:G 0.2867150 G  T - 0.2045580 1641
>>  1 100002991 1:100002991:G:A 0.0975015 A  G - 0.0555507 1641
>>  1 100004726 1:100004727:G:A 0.1325410 A  G - 0.8725660 1641",
>>  header=TRUE,stringsAsFactors=FALSE)
>>
>> ret1<-read.table(text="Chr BP Marker MAF A1 A2 Direction   pValue N
>>  1 100000012 1:100000012:G:T 0.2322760 T  G - 0.230383 1608
>>  1 100000827 1:100000827:C:T 0.2882460 T  C - 0.120356 1608
>>  1 100002713 1:100002713:C:T 0.0982587 T  C - 0.272936 1608
>>  1 100002882 1:100002882:T:G 0.2882460 G  T - 0.120356 1608
>>  1 100002991 1:100002992:G:A 0.0982587 A  G - 0.272936 1608
>>  1 100004726 1:100004727:G:A 0.1340170 A  G - 0.594538 1608",
>> header=TRUE,stringsAsFactors=FALSE)
>>
>> # merge the three data frames on "Marker"
>> nn1<-merge(neu1,nep1,by="Marker")
>> nn2<-merge(nn1,ret1,by="Marker")
>> # get the common "Marker" strings
>> Marker3<-nn2$Marker
>> # subset all three data frames on Marker3
>> neu2<-neu1[neu1$Marker %in% Marker3,]
>> nep2<-nep1[nep1$Marker %in% Marker3,]
>> ret2<-ret1[ret1$Marker %in% Marker3,]
>>
>> Jim
>>
>> On Tue, Jun 2, 2020 at 7:50 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
>> >
>> > Hello,
>> >
>> > I have 3 data frames which have about 3.4 mill lines (but they don't have
>> > exactly the same number of lines)...they look like this:
>> > ...
>> > Is there is a way to create another 3 data frames, say neu2, nep2, ret2
>> > which would only contain lines that have the same entries in Marker column
>> > for all 3 data frames?
>> >
>> > Thanks
>> > Ana
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.