[BioC] merging GRange objects
Nair, Murlidharan T
mnair at iusb.edu
Tue Jun 25 16:14:03 CEST 2013
Thanks Hervé, that helps a lot. Indeed I was getting a lot of duplicates as you pointed out. I shall use your approach.
Cheers../Murli
-----Original Message-----
From: Hervé Pagès [mailto:hpages at fhcrc.org]
Sent: Tuesday, June 25, 2013 2:50 AM
To: Nair, Murlidharan T
Cc: Murli [guest]; bioconductor at r-project.org
Subject: Re: [BioC] merging GRange objects
Murli,
On 06/24/2013 06:00 PM, Nair, Murlidharan T wrote:
> Hi Hervé ,
> I am annotating paired end reads and code is for mate1, mate2 is the same. As you can see I am using the genomic coordinates and trying to annotate them using the UCSC known genes table. I need to ultimately make an association with the coordinates, a reason why I am trying to merge the outputs. I am converting them into data frames and merging them because select returns a data frame, so I have to convert the GRanges object to a data frame to merge them. I want to make sure is that I am not messing my data when I am merging them. Would the following lines correctly combine them?
>
>> mrg.data1=merge(as.data.frame(trans.names),
>> as.data.frame(trans.info), by.x="ENTREZID", by.y="GENEID")
>
>> mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID")
>
> When I reviewed the first few lines and they seemed ok, but there could always be exceptions. If there is a better way please let me know. I am very new to Bioconductor.
You didn't send us code we can reproduce but here is code that tries to achieve something similar:
library(GenomicRanges)
mate.range <- GRanges("chr1", IRanges(c(39800000, 90400000,
39800066),width=500))
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
codingRegions <- refLocsToLocalLocs(mate.range, txdb)
trans.info <- select(txdb, key=mcols(codingRegions)$TXID,
cols=c("GENEID","TXNAME"), keytype="TXID")
library(org.Hs.eg.db)
trans.names <- select(org.Hs.eg.db, trans.info$GENEID,
c("GENENAME", "SYMBOL"))
mrg.data1 <- merge(trans.names, trans.info,
by.x="ENTREZID", by.y="GENEID")
mrg.data2 <- merge(mrg.data1, as.data.frame(codingRegions),
by.x="TXID", by.y="TXID")
> mrg.data2
TXID ENTREZID GENENAME SYMBOL
1 963 23499 microtubule-actin crosslinking factor 1 MACF1
2 963 23499 microtubule-actin crosslinking factor 1 MACF1
3 963 23499 microtubule-actin crosslinking factor 1 MACF1
4 963 23499 microtubule-actin crosslinking factor 1 MACF1
5 963 23499 microtubule-actin crosslinking factor 1 MACF1
6 963 23499 microtubule-actin crosslinking factor 1 MACF1
7 963 23499 microtubule-actin crosslinking factor 1 MACF1
8 963 23499 microtubule-actin crosslinking factor 1 MACF1
9 1687 55144 leucine rich repeat containing 8 family, member D LRRC8D
10 1687 55144 leucine rich repeat containing 8 family, member D LRRC8D
11 1688 55144 leucine rich repeat containing 8 family, member D LRRC8D
12 1688 55144 leucine rich repeat containing 8 family, member D LRRC8D
TXNAME seqnames start end width strand CDSLOC.start
CDSLOC.end
1 uc021olw.1 chr1 39800000 39800499 500 + 3060
3559
2 uc021olw.1 chr1 39800066 39800565 500 + 3126
3625
3 uc021olw.1 chr1 39800000 39800499 500 + 3060
3559
4 uc021olw.1 chr1 39800066 39800565 500 + 3126
3625
5 uc021olw.1 chr1 39800000 39800499 500 + 3060
3559
6 uc021olw.1 chr1 39800066 39800565 500 + 3126
3625
7 uc021olw.1 chr1 39800000 39800499 500 + 3060
3559
8 uc021olw.1 chr1 39800066 39800565 500 + 3126
3625
9 uc001dnm.3 chr1 90400000 90400499 500 + 1373
1872
10 uc001dnm.3 chr1 90400000 90400499 500 + 1373
1872
11 uc001dnn.3 chr1 90400000 90400499 500 + 1373
1872
12 uc001dnn.3 chr1 90400000 90400499 500 + 1373
1872
CDSLOC.width PROTEINLOC QUERYID CDSID
1 500 1020, 1187 1 2948
2 500 1042, 1209 3 2948
3 500 1020, 1187 1 2948
4 500 1042, 1209 3 2948
5 500 1020, 1187 1 2948
6 500 1042, 1209 3 2948
7 500 1020, 1187 1 2948
8 500 1042, 1209 3 2948
9 500 458, 624 2 5132
10 500 458, 624 2 5132
11 500 458, 624 2 5132
12 500 458, 624 2 5132
At first glance it looks like the merging worked. But there are so many duplicated rows in the final data.frame that I don't find this approach very appealing:
> duplicated(mrg.data2)
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
Querying the Homo.sapiens package allows you to retrieve all the annotations you want in one single call:
> library(Homo.sapiens)
> mrg.data1b <- select(Homo.sapiens, keys=mcols(codingRegions)$TXID,
cols=c("GENEID", "TXNAME", "ENTREZID", "GENENAME", "SYMBOL"),
keytype="TXID")
> mrg.data2b <- merge(mrg.data1b, as.data.frame(codingRegions),
by.x="TXID", by.y="TXID")
> mrg.data2b
TXID GENEID TXNAME SYMBOL
1 963 23499 uc021olw.1 MACF1
2 963 23499 uc021olw.1 MACF1
3 1687 55144 uc001dnm.3 LRRC8D
4 1688 55144 uc001dnn.3 LRRC8D
5 1689 <NA> uc021opq.1 <NA>
GENENAME ENTREZID seqnames
start
1 microtubule-actin crosslinking factor 1 23499 chr1
39800000
2 microtubule-actin crosslinking factor 1 23499 chr1
39800066
3 leucine rich repeat containing 8 family, member D 55144 chr1
90400000
4 leucine rich repeat containing 8 family, member D 55144 chr1
90400000
5 <NA> <NA> chr1
90400000
end width strand CDSLOC.start CDSLOC.end CDSLOC.width PROTEINLOC QUERYID
1 39800499 500 + 3060 3559 500 1020,
1187 1
2 39800565 500 + 3126 3625 500 1042,
1209 3
3 90400499 500 + 1373 1872 500 458,
624 2
4 90400499 500 + 1373 1872 500 458,
624 2
5 90400499 500 + 1373 1872 500 458,
624 2
CDSID
1 2948
2 2948
3 5132
4 5132
5 5132
Note that this solution does not produce all the duplicated rows and it also preserves those rows that correspond to transcripts not linked to a gene id (e.g. transcript uc021opq.1).
Personally, I prefer to merge the annotations to the metadata columns of the GRanges object:
m <- match(mcols(codingRegions)$TXID, mrg.data1b$TXID)
mcols(codingRegions) <- cbind(mcols(codingRegions),
DataFrame(mrg.data1b)[m, -1])
> codingRegions
GRanges with 5 ranges and 10 metadata columns:
seqnames ranges strand | CDSLOC PROTEINLOC
<Rle> <IRanges> <Rle> | <IRanges> <IntegerList>
[1] chr1 [39800000, 39800499] + | [3060, 3559] 1020,1187
[2] chr1 [90400000, 90400499] + | [1373, 1872] 458,624
[3] chr1 [90400000, 90400499] + | [1373, 1872] 458,624
[4] chr1 [90400000, 90400499] + | [1373, 1872] 458,624
[5] chr1 [39800066, 39800565] + | [3126, 3625] 1042,1209
QUERYID TXID CDSID GENEID TXNAME SYMBOL
<integer> <character> <integer> <character> <character> <character>
[1] 1 963 2948 23499 uc021olw.1 MACF1
[2] 2 1687 5132 55144 uc001dnm.3 LRRC8D
[3] 2 1688 5132 55144 uc001dnn.3 LRRC8D
[4] 2 1689 5132 <NA> uc021opq.1 <NA>
[5] 3 963 2948 23499 uc021olw.1 MACF1
GENENAME ENTREZID
<character> <factor>
[1] microtubule-actin crosslinking factor 1 23499
[2] leucine rich repeat containing 8 family, member D 55144
[3] leucine rich repeat containing 8 family, member D 55144
[4] <NA> <NA>
[5] microtubule-actin crosslinking factor 1 23499
---
seqlengths:
chr1
NA
Maybe this is what you wanted i.e. add the GENEID, TXNAME, SYMBOL, GENENAME, and ENTREZID metadata cols to the GRanges object 'codingRegions'?
H.
>
> Thanks for your help.
>
> Cheers../Murli
>
>
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Monday, June 24, 2013 8:29 PM
> To: Murli [guest]
> Cc: bioconductor at r-project.org; Nair, Murlidharan T
> Subject: Re: [BioC] merging GRange objects
>
> Hi Murli,
>
> On 06/24/2013 04:55 PM, Murli [guest] wrote:
>>
>> Hi,
>>
>> I would appreciate if you could tell me if the way I am merging the GappedAlignments object and GRanges objects is correct. mate1 and mate2 are GappedAlignment objects. I am merging them in order to associate my reads with the annotation.
>>
>> txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
>>
>> mate.range=
>> GRanges(seqnames(mate1[isSameCzome]),IRanges(start(mate1[isSameCzome]
>> )
>> -offset,start(mate1[isSameCzome])+offset))
>>
>> codingRegions = refLocsToLocalLocs(mate.range, txdb)
>>
>> trans.info=select(txdb, key=values(codingRegions)$TXID,
>> cols=c("GENEID","TXNAME"), keytype="TXID")
>>
>> trans.names=select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME",
>> "SYMBOL"))
>>
>> mrg.data1=merge(as.data.frame(trans.names),
>> as.data.frame(trans.info), by.x="ENTREZID", by.y="GENEID")
>>
>> mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID",
>> by.y="TXID")
>
> It looks like you are merging data.frames, not GappedAlignments or GRanges objects. Also you say that 'mate1' and 'mate2' are GappedAlignments objects but I only see 'mate1' in the above code.
>
> The exact meaning of "merging" depends on the objects involved.
> Sometimes people use the term "merging" when they actually want to combine or bind objects together with c(), rbind() or cbind().
> Note that since GRanges and GappedAlignments objects are conceptually
> vector-like objects of dimension 1, only c() works on them. That is,
> rbind(), cbind(), and merge() (which are typically operating on 2-D
> objects) are not supported on those objects.
>
> Cheers,
> H.
>
>>
>> Thanks ../Murli
>>
>>
>>
>> -- output of sessionInfo():
>>
>>> sessionInfo()
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] biomaRt_2.16.0
>> [2] org.Hs.eg.db_2.9.0
>> [3] RSQLite_0.11.4
>> [4] DBI_0.2-7
>> [5] VariantAnnotation_1.6.6
>> [6] Rsamtools_1.12.3
>> [7] BSgenome.Hsapiens.UCSC.hg19_1.3.19
>> [8] BSgenome_1.28.0
>> [9] Biostrings_2.28.0
>> [10] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2
>> [11] GenomicFeatures_1.12.2
>> [12] AnnotationDbi_1.22.6
>> [13] Biobase_2.20.0
>> [14] GenomicRanges_1.12.4
>> [15] IRanges_1.18.1
>> [16] BiocGenerics_0.6.0
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-5 RCurl_1.95-4.1 rtracklayer_1.20.2 stats4_3.0.1
>> [5] tcltk_3.0.1 tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list