[BioC] Extract Exon Features from GenomicFeatures Leads to Inconsistent Pairing of Gene and Exon IDs
Steve Lianoglou
mailinglist.honeypot at gmail.com
Sun Jan 13 11:36:44 CET 2013
Hi Fong,
On Sun, Jan 13, 2013 at 3:09 AM, Fong Chun Chan <fongchun at alumni.ubc.ca> wrote:
> Hi,
>
> I was hoping someone could explain this weird situation to me when I am
> trying to extract exon information using the GenomicFeatures Bioconductor
> package. Here is what I've done:
>
> txdb <- loadFeatures(txdbFile)
> allExons <- exons(txdb, columns = c('gene_id', 'exon_id'))
> annotDf <- values(allExons)
> dim(annotDf)
>
> [1] 541825 2
>
> According to this code, I've extracted a total of 541825 exons which are
> associated with x number of genes. Now when I do the following:
>
> length(unlist(annotDf[, 'gene_id']))
>
> [1] 546664
>
> How is it possible that there are suddenly an additional 4839 genes added
> to the 'gene_id' column?
The first clue is that the gene_id column is stored as a *List type,
which you correctly realized because you are `unlisting` it it to get
its length. The reason that it's a list is so that one could store
more than one element per index, otherwise they would have just used a
character (or factor) vector.
So, what you can do is to see what's going on here -- I am using the
latest version of R and the "TxDb.Hsapiens.UCSC.hg19.knownGene"
package, so things might not be exactly the same for you, but I said
up my vars to mimc yours, so:
R> gene.ids <- annotDf$gene_id
R> counts <- elementLengths(gene.ids) ## number of elements per bin
R> table(counts)
counts
1 2 3 4 5 6 9 15 22
282939 3788 101 7 6 1 4 3 3
So we see that several gene ids can be assigned to one exon. You might
as why, and I reckon the answer is that the exons are defined as
"unique" simply by their chr,strand,start,end combo and several genes
(or transcripts) might share the same exon (in "genomic space"). Let's
see:
R> allExons[elementLengths(gene.ids) > 1]
GRanges with 3913 ranges and 2 metadata columns:
seqnames ranges strand |
gene_id exon_id
<Rle> <IRanges> <Rle> |
<CompressedCharacterList> <integer>
[1] chr1 [ 324288, 324345] + |
100132287,100133331 12
[2] chr1 [10490159, 10490625] + |
100526739,378708 897
[3] chr1 [10490804, 10491458] + |
100526739,378708 898
[4] chr1 [10493899, 10494022] + |
100526739,378708 899
[5] chr1 [10494714, 10494747] + |
100526739,378708 900
[6] chr1 [10500404, 10500470] + |
100526739,378708 901
[7] chr1 [10511434, 10512060] + |
100526739,1325 904
[8] chr1 [16355621, 16355794] + |
1187,1188 1504
[9] chr1 [19923471, 19923603] + |
100532736,440574 1788
... ... ... ... ...
... ...
[3905] chrY [26944570, 26944641] - |
57054,57135 274737
[3906] chrY [26946960, 26947031] - |
57054,57135 274738
[3907] chrY [26949403, 26949474] - |
57054,57135 274739
[3908] chrY [26950875, 26951014] - |
57054,57135 274740
[3909] chrY [26951104, 26951167] - |
57054,57135 274741
[3910] chrY [26951604, 26951655] - |
57054,57135 274742
[3911] chrY [26952216, 26952307] - |
57054,57135 274743
[3912] chrY [26952582, 26952728] - |
57054,57135 274744
[3913] chrY [26959330, 26959639] - |
57054,57135 274745
If you hop to any of these regions in your favorite genome browser,
you should find overlapping gene annotations there that share exact
exon boundaries across transcripts.
> The reason this is a problem is that I would like
> to create a 3 column output where I use the command:
>
> data.frame( geneID = unlist(annotDf[, 'gene_id']), exonID = annotDf[,
> 'exon_id'], exonCounts = exonCounts)
>
> Where exonCounts is generated with the countOverlaps() function using
> allExons and reads from a bam file. This command is failing because it
> seems that the number of gene_ids is not pairing up with the exon_ids
> properly. Is there something I am missing here?
Hopefully this clarification helps?
And also:
> I am using 1.6.9 of GenomicFeatures, R is 2.14.2 and my txdb was built
> using the command:
You should upgrade to the latest version of R and bioconductor
packages to get the best functionality (and help).
Hope that helps,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list