[BioC] Finding my un-counted reads

Thu Aug 29 00:38:59 CEST 2013

After a second read I think I may have missed what you're really after. 
Specifically the question about 'where did the reads go'?

A simple check of whether or not the mapped reads hit any part of the 
annotation would be to overlap them with the 'gff0' GRanges.

     olap <- findOverlaps(gff0, reads)

The result is a 'Hits' object that tells you what query overlapped what 
subject. See ?Hits for details.

     ## number of hits for each row of gff0
     countQueryHits(olap)
     ## number of hits for each read
     countSubjectHits(olap)

Valerie

On 08/28/2013 03:15 PM, Valerie Obenchain wrote:
> Hi Sam,
>
> ## counting reads
>
> If you haven't done this already I'd confirm counts on a single file. A
> bam can be read in with readGAlignments(). If you have paired-end data
> you can use ?readGAlignmentPairs or ?readGAlignmentsList.
>
>      bf <- BamFile("pathToOneBam")
>      reads <- readGAlignments(bf) ## devel
>      reads <- readGappedAlignments(bf) ## release
>
> Take a look at the makeTranscriptDbFromGFF() function in
> GenomicFeatures. It creates a TranscriptDb object from gff or gtf. The
> advantage of working with a TxDb is that you get the accessors such as
> cdsBy(..., "gene"), etc. for free (you may know this, just an fyi).
>
> Get an idea of how many reads hit each list element of the GRangesList.
> This method does double count so it's possible for the sum of 'olap' to
> be greater than the length of 'reads'.
>
>      olap <- countOverlaps(cds_gene, reads)
>
> If the numbers in 'olap' are very small (and this is unexpected) you may
> have a problem with how you've created the 'cds_gene' annotation. If the
> numbers make sense then try summarizeOverlaps.
>
>      se <- summarizeOverlaps(cds_gene, bf, mode="IntersectionNotEmpty")
>
> If the number of hits in assays(se)$counts are greatly reduced you may
> have many reads overlapping each other or a single feature in
> 'cds_gene'. This can be checked by looking at the numbers in 'olap'. If
> you are using the devel version (GenomicRanges >= 1.13.9) the
> 'inter.feature' argument allows reads to be counted for each feature
> they overlap.
>
>
> ## mapped vs unmapped reads
>
> readGAlignments() only imports mapped reads; unmapped are dropped. If
> you are interested in the number of mapped vs unmapped and some details
> about them you can use scanBam(). If your file is large you can use the
> 'yieldSize' argument to BamFile. See ?BamFile and ?ScanBamParam for
> param details.
>
> All reads in the file, mapped and unmapped.
>      scn <- scanBam(bf)
>      names(scn[[1]])
>
> Unmapped only.
>      flag1 <- scanBamFlag(isUnmappedQuery=TRUE)
>      param1 <- ScanBamParam(what=scanBamWhat(), flag=flag1)
>      unmapped <- (bf, param1)
>
> Mapped only.
>      flag2 <- scanBamFlag(isUnmappedQuery=FALSE)
>      param1 <- ScanBamParam(what=scanBamWhat(), flag=flag2)
>      mapped <- (bf, param2)
>
>
> Valerie
>
> On 08/28/2013 01:50 PM, Sam McInturf wrote:
>> Bioconductors,
>>      I am working on an Arabidopsis RNA sequencing experiment and have
>> run
>> into some trouble.  I found that I was able to map ~94% of my ~35 million
>> reads / library to the Arabidopsis TAIR10 genome using tophat2 (100 bp
>> single end reads).  After I counted my mapped reads I found that only
>> about
>> 47% of my mapped reads were counted. I found out there is a strong
>> possibility that there may be genomic DNA contamination, and I would like
>> to verify this hypothesis.
>> I used GenomicRanges and rtracklayer to do my read counting:
>>
>> library(GenomicRanges)
>> library(rtracklayer)
>> library(Rsamtools)
>> setwd("/myDir")
>> myBamFiles <- list.files(pattern = ".bam$")
>> myBamIndex <- list.files(pattern = ".bai$")
>>
>> bfl <- BamFileList(myBamFiles, index = myBamIndex)
>> gff0 <- import("/myOtherDir/Arabidopsis_thaliana.TAIR10.17.gtf",
>> asRangedData = FALSE)
>> cds <- subset(gff0, type == "CDS")
>> cds_gene <- reduce(split(cds, cds$gene_id))
>> olap <- summarizeOverlaps(cds_gene, bfl, mode = "IntersectionNotEmpty")
>>
>> My intention with subsetting for the CDS and then splitting on gene_id is
>> to create a GRangesList where each element is a unique AGI number and has
>> all of the features.
>> To extract the number of reads counted I use:
>>
>> colSums(assays(olap)$counts)
>>
>> The output of colSums(...) is how I have found that less than half my
>> reads
>> mapped to transcripts.  The first question becomes "Is this how to count
>> the number of reads mapped?"  The second question becomes "how do I
>> figure
>> out where the reads went?"
>>
>> My initial inclination would be to make a GRangesList that contained
>> intervals between genes and then do the same thing as before and see if I
>> can find my other reads.  Something in the vein of:
>>
>> maxLimits <- t(as.dataframe(lapply(cds_gene, function(x)
>> c(min(start(ranges(x))),
>>     max(end(  ranges(x)))
>>     )
>> )
>> ))
>> which is just an elaborate lapply call to get the smallest start
>> coordinate
>> and the largest end coordinate for each unique gene_id and then transform
>> it to a pretty data frame.  From there I could find a way to get the
>> 'in-betweens' of these features, but I am not sure that is the most
>> prudent
>> thing to do.  I looked at some of the ShortRead and GenomicRanges
>> functions
>> for a guiding hand and found the disjointBins() function may be useful in
>> establishing features to map over, but am at a bit of a loss as I don't
>> have the expertise to figure this out straight away.
>>    Any suggestions or resources to find mapped but uncounted reads
>> would be
>> helpful!
>>
>>
>> Thanks
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor