[BioC] pileup coverage from BAM file

Fri Mar 1 22:15:48 CET 2013

Hi Chris --

On 02/28/2013 09:30 AM, Chris Cabanski wrote:
> Hi,
>
> I have a bam file and I am trying to calculate the coverage at each exon
> position (base) within a gene.  First, I find the exon boundaries for the
> gene of interest:
>
> library(org.Hs.eg.db)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>
> genesym <- c("CDKN2A")
> geneid <- select(org.Hs.eg.db, keys=genesym,
> keytype="SYMBOL",cols="ENTREZID")
>
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> ex <- exonsBy(txdb, "gene")
> ex1 <- ex[[geneid$ENTREZID[1]]]
>> ex1
> GRanges with 6 ranges and 2 metadata columns:
>        seqnames               ranges strand |   exon_id   exon_name
>           <Rle>            <IRanges>  <Rle> | <integer> <character>
>    [1]     chr9 [21967751, 21968241]      - |    127318        <NA>
>    [2]     chr9 [21968574, 21968770]      - |    127319        <NA>
>    [3]     chr9 [21970901, 21971207]      - |    127320        <NA>
>    [4]     chr9 [21974403, 21974826]      - |    127321        <NA>
>    [5]     chr9 [21974677, 21975132]      - |    127322        <NA>
>    [6]     chr9 [21994138, 21994490]      - |    127323        <NA>
>    ---
>    seqlengths:
>                      chr1                  chr2 ...        chrUn_gl000249
>                 249250621             243199373 ...                 38502
>
> Next, I would like to find the coverage at each of these positions and
> this is where I am getting stuck.  I have tried to follow
> example(applyPileups) but I get an error and am not sure that I am
> calculating coverage at the base-level.

depending on what you're interested in, you might find

   coverage(bamFile, param=ScanBamParam(which=ex1))

to be what you're looking for -- this is an 'Rle' (run length encoding) that 
describes how many reads overlap each position, without providing information on 
what the nucleotide or quality in the read was.

When looking a coverage on a large number of regions, it is sometimes more 
efficient to calculate the coverage of the whole bam file (this takes some but 
not huge amounts of memory, say 4Gb) and then take a 'Views' on the Rle that 
reflect your regions of interest.

The GenomicRanges and IRanges vignettes

   http://bioconductor.org/packages/2.11/bioc/html/GenomicRanges.html
   http://bioconductor.org/packages/2.11/bioc/html/IRanges.html

are good resources for insight into working with these objects.

>
> library(Rsamtools)
>
> # create character list of BAM files (only use first file for now)
> fls <- scan("SortedBams_tumor.txt",what='character')
>
> calcInfo <- function(x) {
>    ## information at each pile-up position
>    info <- apply(x[["seq"]], 2, function(y) {
>      y <- y[c("A", "C", "G", "T"),]

probably what's happening here is that y[,] is subsetting a matrix, but there is 
only one (or no) columns, so it is being coerced to a vector, and then...

>      y <- y + 1L
>      # continuity
>      cvg <- colSums(y)

the equivalent of...

 > colSums(1:5)
Error in colSums(1:5) : 'x' must be an array of at least two dimensions

so the immediate solution is likely

   y <- y[c("A", "C", "G", "T"), , drop=FALSE]

and...

>      p <- y / cvg[col(y)]
>      h <- -colSums(p * log(p))
>      ifelse(cvg == 4L, NA, h)
>    })
>    list(seqnames=x[["seqnames"]], pos=x[["pos"]], info=info)
> }
>
> which <- ex1
> param <- PileupParam(which=which, what=c("seq","qual"),
>      maxDepth=1000L, yieldBy="position", yieldAll=TRUE)
> fl <- PileupFiles(fls[1],param=param)
> res <- applyPileups(fl,calcInfo,param=param)
>>   Error: applyPileups: 'x' must be an array of at least two dimensions
>
> Is this correct route that I should be taking, or should I be going about
> this differently?

The function gmapR::bam_tally is an alternative to applyPileups.

Hope that helps, Martin

>
> Thanks in advance,
> Chris
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] Rsamtools_1.10.2
>   [2] Biostrings_2.26.3
>   [3] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0
>   [4] GenomicFeatures_1.10.1
>   [5] GenomicRanges_1.10.6
>   [6] IRanges_1.16.4
>   [7] org.Hs.eg.db_2.8.0
>   [8] RSQLite_0.9-4
>   [9] DBI_0.2-5
> [10] AnnotationDbi_1.20.3
> [11] Biobase_2.18.0
> [12] BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
>   [1] BSgenome_1.26.1    RCurl_1.5-0        XML_3.2-0          biomaRt_2.14.0
>   [5] bitops_1.0-4.1     parallel_2.15.2    rtracklayer_1.18.2 stats4_2.15.2
>   [9] tools_2.15.2       zlibbioc_1.4.0
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793