[BioC] findOverlaps question from a newbie...
vobencha at fhcrc.org
Tue May 29 18:03:59 CEST 2012
Here is one approach,
(1) create the GRanges or GRangesList from gff
Using a gff file in GenomicFeatures as an example,
> gffFile <- system.file("extdata","a.gff3",package="GenomicFeatures")
> txdb <- makeTranscriptDbFromGFF(file=gffFile,
+ dataSource="partial gtf file for Tomatoes for testing",
+ species="Solanum lycopersicum",
If you are interested in genes either exons by genes or transcripts by
genes might be useful,
exbygene <- exonsBy(txdb, "gene")
txbygene <- transcriptsBy(txdb, "gene")
The gene names are associated with each outer list element of the
GRangesList of length 488:
GRanges with 2 ranges and 2 elementMetadata cols:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <integer> <character>
 SL2.40ch00 [16437, 17275] + | 1
 SL2.40ch00 [17336, 18189] + | 2
GRanges with 3 ranges and 2 elementMetadata cols:
seqnames ranges strand | exon_id
 SL2.40ch00 [68062, 68211] + | 3
 SL2.40ch00 [68344, 68568] + | 4
 SL2.40ch00 [68654, 68764] + | 5
For the sake of example, I'll take a subset of one of the GRangesLists
and overlap it with the full list,
subset <- exbygene[c(2,4,6)]
olap <- findOverlaps(subset, exbygene)
Hits of length 3
1 1 2
2 2 4
3 3 6
The information in the 'Hits' object can be used to map back to the
query and subject used in findOverlaps().
See ?queryHits and ?subjectHIts.
Next identify which subjects were hit by the queries and extract the
(3) reconsiling gene sets
Take a look at the 'type' argument to findOverlaps(). To find exact
matches you'll want to find ranges with type='equal', to find any
overlaps use type='any' (default).
You may also want to look at summarizeOverlaps(). It is an extension of
findOverlaps() that counts a read a maximum of once (i.e., no double
counting). If a read hits multiple features different rule sets are
applied to assign the read to a single feature.
On 05/24/2012 10:05 AM, gowtham wrote:
> Hi Everyone,
> This is my first attempt in R/bioconductor outside of plot()/hist()
> function. Thanks to Martin Morgan, I was impressed with what
> bioconductor can do for me ( after attending bioc workshop last week).
> When I try to findOverlap on two GFF files I get information on which
> feature is overlapping with another feature. But, instead of using "feature
> names" it just indicates 1,2, 3...etc. Is there a way, i can make it output
> gene names. (my GRanges objects are read from gff files using Rtracklayer&
> later have names (geneids) assigned to them using names() function).
>> findOverlaps(au, am)
> Hits of length 1740
> queryLength: 790
> subjectLength: 1054
> queryHits subjectHits
> <integer> <integer>
> 1 1 1
> 2 1 528
> 3 2 13
> 4 2 540
> 5 3 136
> And here is the background of what i am trying to do. In case, if any of
> you have a better suggestion:
> I run 3 de novo gene prediction algorithm on our newly assembled genome.
> And I would like to reconcile the gene sets from all of them to genes (a)
> predicted with exact same boundaries in all three (b) boundaries not same
> but overlap (c) predicted by only one or two of them.
> And use reduce in case of (b) to obtain outer most boundaries of
> overlapping prediction.
> As a novice users its bit overwhelming for me.
> Thanks very much,
More information about the Bioconductor