[R] Merging data frames on two conditions
David Winsemius
dwinsemius at comcast.net
Tue Apr 6 23:01:19 CEST 2010
OK, not the SNP's. So look at the "chr"'s. I will bet that you get 0
when you try :
length(intersect(data_lane6_snps$chr, data_lane6_snps_rsid$chr))
... since one is using a format of "chrNN" and the other is using just
"NN". You need to get the chromosome naming convention straightened out.
--
David.
On Apr 6, 2010, at 4:53 PM, Abhishek Pratap wrote:
> Just so you know
>
> length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))
> 796120
>
> I just need to include the chr condition now where I am stuck.
>
> -Abhi
>
> On Tue, Apr 6, 2010 at 4:51 PM, Abhishek Pratap <abhishek.vit at gmail.com
> > wrote:
> Hi David
>
> I can understand looking the SNP data values it can be felt that
> they are different values and hence no result in merge. However the
> columns still have ~700K SNPs common. What I am looking for is a
> merge where the SNP and Chr matches. If I match only the SNP column
> I get partially correct results since it is possible for two
> chromosomes to have a SNP at the same bp location so the merge needs
> to take both SNP position and Chromosome into account.
>
> Thanks!
> -Abhi
>
>
> On Tue, Apr 6, 2010 at 4:42 PM, David Winsemius <dwinsemius at comcast.net
> > wrote:
>
> On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote:
>
> Hi David
>
> Here it is. You can ignore the bio jargon if it sounds confusing.
>
> Sometimes it is essential to have domain details.
>
>
> The corresponding data type of column (SNP, chr) on which I am
> applying merge is same.
>
> merge(data_lane6_snps, data_lane6_snps_rsid , by = c("SNP,"chr"))
>
>
> str(data_lane6_snps)
> 'data.frame': 7724462 obs. of 10 variables:
> $ chr : Factor w/ 25 levels "chr1","chr10",..: 1 1 1 1 1
> 1 1 1 1 1 ...
> $ SNP : int 100 101 103 108 179 180 191 197 218 222 ...
> $ reference : Factor w/ 5 levels "A","C","G","N",..: 2 2 5 2 2
> 5 2 2 1 5 ...
> $ genotype : Factor w/ 10 levels "A","C","G","K",..: 1 1 1 8 2
> 2 3 8 2 2 ...
> $ consensus_qual: int 0 0 0 4 33 33 19 19 19 19 ...
> $ snp_qual : int 0 0 0 4 0 33 19 19 19 19 ...
> $ rms_qual : int 0 0 0 0 21 21 21 21 21 21 ...
> $ depth : int 1 1 1 1 2 2 2 2 2 2 ...
> $ bases : Factor w/ 453774 levels "^!,","^!,^!,",..: 5 5 5
> 410998 49793 155731 284998 416878 133393 133393 ...
> $ base_quality : Factor w/ 555104 levels "`","``","```",..: 359
> 359 359 54813 92856 92856 92856 92856 92539 55424 ...
>
> > str(data_lane6_snps_rsid)
> 'data.frame': 797807 obs. of 4 variables:
> $ chr : Factor w/ 24 levels "1","10","11",..: 3 3 3 3 3 3 3 3 3 3 ...
> $ SNP : int 68143872 11071026 69423434 12394791 1302846 95330693
> 3921381 57122299 41899656 76990037 ...
>
> Looking at this line and the line for "SNP" in the above dataframe I
> am not seeing that these are exhibiting much similarity in range.
> There are 10 times few observations. What was you plan for the non-
> matching cases? Did you really mean that you wanted a right outer
> join?
>
> You might get information by trying:
>
> length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))
>
> That would tell you how many potential matches you might have on the
> basis of SNP numbers, Although an SNP match might or might not be a
> full match given the chr matching that is also being specified.
>
>
>
> $ end : int 68143872 11071026 69423434 12394791 1302846 95330693
> 3921381 57122299 41899656 76990037 ...
> $ rsid: Factor w/ 797807 levels "rs10","rs10000010",..: 100229
> 685690 505395 470219 780326 29342 29263 327909 434159 723152 ...
>
>
> On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius <dwinsemius at comcast.net
> > wrote:
>
> On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:
>
> Hi Guys
>
> I have two data frames which I would like to merge on two conditions.
>
> I am doing the following (abstract form)
>
> new.data.frame <- merge(df1,df2, by=c("Col1","Col2"))
>
> So I am guessing that you really wanted just this:
>
> new.data.frame <- merge(df1,df2)
>
> ?merge
>
> Since the default for merge is: by = intersect(names(x), names(y)),
> this would have been equivalent to
>
> new.data.frame <- merge(df1,df2, by=c("chr", "SNP") )
>
> See above regarding the possibility that you have non-congruent SNP
> labeling problems.
>
>
>
>
>
> What does
>
> str(df1) ; str(df2)
>
> ... show?
>
>
>
> It is giving me a null result.
>
> Basically I need to apply two conditions.
>
> I also tried sqldf but it is running forever. Will indexing help ?
>
> temp <- sqldf("select
> a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM
> + data_lane6_snps a,
> + data_lane6_snps_rsid b
> + WHERE
> + a.SNP = b.SNP
> + AND
> + a.chr = b.chr
> + ")
>
> Thanks!
> -Abhi
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> West Hartford, CT
>
>
>
> David Winsemius, MD
> West Hartford, CT
>
>
>
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list