[Bioc-sig-seq] readAligned and coverage methods of the short read package

Tue Aug 25 08:04:36 CEST 2009

Hi Sirisha --

It's important to mention sessionInfo(). Until recently coverage()
returned a 'GenomeData' object, more recently it returns a
SimpleRleList, which simplifies working with other parts of the
Bioconductor tools.

For my setup (based on the development version of R and current
Bioconductor packages) I can do

 library(ShortRead)
 example(readAligned) # sets up next line, which is the last example
 aln = readAligned(sp, "s_2_export.txt", filter=filt)
 cvg = coverage(aln)

I then have

> cvg
SimpleRleList: 5 elements
names(5): chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa
> cvg[["chr1.fa"]]
  'integer' Rle instance of length 189122529 with 44 runs
  Lengths:  3393024 35 9077058 35 12148400 35 1576458 35 52888550 35 ...
  Values :  0 1 0 1 0 1 0 1 0 1 ...

You can think of 'coverage' like a list, with one element for each
'chromosome'. The elements of the list are 'Rle' objects, meaning that
they are a run-length encoded representation of coverage. So above there
are 3393024 0's (i.e., uncovered nucleotides), followed by 35 1's (i.e.,
nucleotides covered by one read), and so on. The Rle ends at the end of
the last read, or if you'd provided coverage with an appropriate 'end'
argument (see ?'coverage,AlignedRead-method') an arbitrarily determined
nucleotide, e.g., the last nucleotide on the chromosome, so padding with
0 coverage from the end of the last read to the end of the chromosome.

GenomeData has a very similar structure.

Some more below...

Sirisha Sunkara wrote:
> Hello All,
> 
> I am looking at read coverage bias in a few samples (some libraries
> constructed under different PCR amplification cycles), and would like to
> look at the coverage plots as % read count distribution/lane  over the 
> chromosome/reference coordinates.

Because SimpleRleList is a list, you can do something like

  lapply(cvg, sum)

to get the total coverage on a per-chromosome basis, and

  laneCnt = sum(unlist(lapply(cvg, sum)))

to get the total coverage per lane (assuming you'd read only one lane
into aln with readAligned; this will often be a good strategy anyway,
because the aligned reads are very large but coverage is trivially
small). You can normalize by lane with something along the lines of

  cvg1 = cvg / laneCnt

(if normalizing by lane is appropriate...). In R 2.9.1 the operations
are quite similar, e.g., lapply(cvg, "/", laneCnt) gives a simple list
of coverage Rle's.

> Right now reading the export.txt files with the readAligned and the
> coverage methods, I get the raw counts. Is there something that does the
> conversion to % over the total read count/lane before plotting?
> 
> Sorry, if I am missing something obvious..
> 
> Also, I was wondering if the coverage method retains information of
> regions with "zero" coverage? This would be particularly useful when
> comparing sequencing bias in different samples and/or different library
> construction methods.

yes, especially if you provide the 'end' argument to coverage. These are
the zeros in the Rle. You can do fun things like

  sum(cvg[[1]] == 0)

to find out the total number of uncovered nucleotides, or

  slice(cvg, upper=0)

to get a 'view' of all uncovered nucleotides.

I'm not sure if that's what you're asking, or whether you had something
else in mind?

Martin

> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-08-21 r49350)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ShortRead_1.3.27   lattice_0.17-25    BSgenome_1.13.10
Biostrings_2.13.33
[5] IRanges_1.3.59

loaded via a namespace (and not attached):
[1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1

> 
> Thanks a lot for any pointers,
> Sirisha
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing