[BioC] pileup coverage from BAM file
Martin Morgan
mtmorgan at fhcrc.org
Fri Mar 1 22:15:48 CET 2013
Hi Chris --
On 02/28/2013 09:30 AM, Chris Cabanski wrote:
> Hi,
>
> I have a bam file and I am trying to calculate the coverage at each exon
> position (base) within a gene. First, I find the exon boundaries for the
> gene of interest:
>
> library(org.Hs.eg.db)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>
> genesym <- c("CDKN2A")
> geneid <- select(org.Hs.eg.db, keys=genesym,
> keytype="SYMBOL",cols="ENTREZID")
>
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> ex <- exonsBy(txdb, "gene")
> ex1 <- ex[[geneid$ENTREZID[1]]]
>> ex1
> GRanges with 6 ranges and 2 metadata columns:
> seqnames ranges strand | exon_id exon_name
> <Rle> <IRanges> <Rle> | <integer> <character>
> [1] chr9 [21967751, 21968241] - | 127318 <NA>
> [2] chr9 [21968574, 21968770] - | 127319 <NA>
> [3] chr9 [21970901, 21971207] - | 127320 <NA>
> [4] chr9 [21974403, 21974826] - | 127321 <NA>
> [5] chr9 [21974677, 21975132] - | 127322 <NA>
> [6] chr9 [21994138, 21994490] - | 127323 <NA>
> ---
> seqlengths:
> chr1 chr2 ... chrUn_gl000249
> 249250621 243199373 ... 38502
>
> Next, I would like to find the coverage at each of these positions and
> this is where I am getting stuck. I have tried to follow
> example(applyPileups) but I get an error and am not sure that I am
> calculating coverage at the base-level.
depending on what you're interested in, you might find
coverage(bamFile, param=ScanBamParam(which=ex1))
to be what you're looking for -- this is an 'Rle' (run length encoding) that
describes how many reads overlap each position, without providing information on
what the nucleotide or quality in the read was.
When looking a coverage on a large number of regions, it is sometimes more
efficient to calculate the coverage of the whole bam file (this takes some but
not huge amounts of memory, say 4Gb) and then take a 'Views' on the Rle that
reflect your regions of interest.
The GenomicRanges and IRanges vignettes
http://bioconductor.org/packages/2.11/bioc/html/GenomicRanges.html
http://bioconductor.org/packages/2.11/bioc/html/IRanges.html
are good resources for insight into working with these objects.
>
> library(Rsamtools)
>
> # create character list of BAM files (only use first file for now)
> fls <- scan("SortedBams_tumor.txt",what='character')
>
> calcInfo <- function(x) {
> ## information at each pile-up position
> info <- apply(x[["seq"]], 2, function(y) {
> y <- y[c("A", "C", "G", "T"),]
probably what's happening here is that y[,] is subsetting a matrix, but there is
only one (or no) columns, so it is being coerced to a vector, and then...
> y <- y + 1L
> # continuity
> cvg <- colSums(y)
the equivalent of...
> colSums(1:5)
Error in colSums(1:5) : 'x' must be an array of at least two dimensions
so the immediate solution is likely
y <- y[c("A", "C", "G", "T"), , drop=FALSE]
and...
> p <- y / cvg[col(y)]
> h <- -colSums(p * log(p))
> ifelse(cvg == 4L, NA, h)
> })
> list(seqnames=x[["seqnames"]], pos=x[["pos"]], info=info)
> }
>
> which <- ex1
> param <- PileupParam(which=which, what=c("seq","qual"),
> maxDepth=1000L, yieldBy="position", yieldAll=TRUE)
> fl <- PileupFiles(fls[1],param=param)
> res <- applyPileups(fl,calcInfo,param=param)
>> Error: applyPileups: 'x' must be an array of at least two dimensions
>
> Is this correct route that I should be taking, or should I be going about
> this differently?
The function gmapR::bam_tally is an alternative to applyPileups.
Hope that helps, Martin
>
> Thanks in advance,
> Chris
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Rsamtools_1.10.2
> [2] Biostrings_2.26.3
> [3] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0
> [4] GenomicFeatures_1.10.1
> [5] GenomicRanges_1.10.6
> [6] IRanges_1.16.4
> [7] org.Hs.eg.db_2.8.0
> [8] RSQLite_0.9-4
> [9] DBI_0.2-5
> [10] AnnotationDbi_1.20.3
> [11] Biobase_2.18.0
> [12] BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] BSgenome_1.26.1 RCurl_1.5-0 XML_3.2-0 biomaRt_2.14.0
> [5] bitops_1.0-4.1 parallel_2.15.2 rtracklayer_1.18.2 stats4_2.15.2
> [9] tools_2.15.2 zlibbioc_1.4.0
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list