[Bioc-sig-seq] readAligned and coverage methods of the short read package

Thu Aug 27 04:02:47 CEST 2009

Hi Martin,

This is great, exactly what I was asking for - Thanks a lot!

Sorry, I didn't realize how frequently the changes are being made at-least to the ShortRead Package. In my installation, the coverage function still seems to return the "GenomeData" object.

Is the svn Changelog ( http://fgc.lsi.umich.edu/cgi-bin/blosxom.cgi) the best place to look up all the updates?

Thanks Again.

> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-08-19 r49321)
x86_64-unknown-linux-gnu

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ShortRead_1.3.25   lattice_0.17-25    BSgenome_1.13.10   Biostrings_2.13.32
[5] IRanges_1.3.56

loaded via a namespace (and not attached):
[1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1   tools_2.10.0

--sirisha

----- Original Message -----
From: Martin Morgan <mtmorgan at fhcrc.org>
Date: Monday, August 24, 2009 10:56 pm
Subject: Re: [Bioc-sig-seq] readAligned and coverage methods of the short read package
To: SSunkara at lbl.gov
Cc: bioc-sig-sequencing at r-project.org

> Hi Sirisha --
> 
> It's important to mention sessionInfo(). Until recently coverage()
> returned a 'GenomeData' object, more recently it returns a
> SimpleRleList, which simplifies working with other parts of the
> Bioconductor tools.
> 
> For my setup (based on the development version of R and current
> Bioconductor packages) I can do
> 
> library(ShortRead)
> example(readAligned) # sets up next line, which is the last example
> aln = readAligned(sp, "s_2_export.txt", filter=filt)
> cvg = coverage(aln)
> 
> I then have
> 
> > cvg
> SimpleRleList: 5 elements
> names(5): chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa
> > cvg[["chr1.fa"]]
>  'integer' Rle instance of length 189122529 with 44 runs
>  Lengths:  3393024 35 9077058 35 12148400 35 1576458 35 52888550 
> 35 ...
>  Values :  0 1 0 1 0 1 0 1 0 1 ...
> 
> You can think of 'coverage' like a list, with one element for each
> 'chromosome'. The elements of the list are 'Rle' objects, meaning that
> they are a run-length encoded representation of coverage. So above 
> thereare 3393024 0's (i.e., uncovered nucleotides), followed by 35 
> 1's (i.e.,
> nucleotides covered by one read), and so on. The Rle ends at the 
> end of
> the last read, or if you'd provided coverage with an appropriate 'end'
> argument (see ?'coverage,AlignedRead-method') an arbitrarily 
> determinednucleotide, e.g., the last nucleotide on the chromosome, 
> so padding with
> 0 coverage from the end of the last read to the end of the chromosome.
> 
> GenomeData has a very similar structure.
> 
> Some more below...
> 
> Sirisha Sunkara wrote:
> > Hello All,
> > 
> > I am looking at read coverage bias in a few samples (some libraries
> > constructed under different PCR amplification cycles), and would 
> like to
> > look at the coverage plots as % read count distribution/lane  
> over the 
> > chromosome/reference coordinates.
> 
> Because SimpleRleList is a list, you can do something like
> 
>  lapply(cvg, sum)
> 
> to get the total coverage on a per-chromosome basis, and
> 
>  laneCnt = sum(unlist(lapply(cvg, sum)))
> 
> to get the total coverage per lane (assuming you'd read only one lane
> into aln with readAligned; this will often be a good strategy anyway,
> because the aligned reads are very large but coverage is trivially
> small). You can normalize by lane with something along the lines of
> 
>  cvg1 = cvg / laneCnt
> 
> (if normalizing by lane is appropriate...). In R 2.9.1 the operations
> are quite similar, e.g., lapply(cvg, "/", laneCnt) gives a simple list
> of coverage Rle's.
> 
> > Right now reading the export.txt files with the readAligned and the
> > coverage methods, I get the raw counts. Is there something that 
> does the
> > conversion to % over the total read count/lane before plotting?
> > 
> > Sorry, if I am missing something obvious..
> > 
> > Also, I was wondering if the coverage method retains information of
> > regions with "zero" coverage? This would be particularly useful when
> > comparing sequencing bias in different samples and/or different 
> library> construction methods.
> 
> yes, especially if you provide the 'end' argument to coverage. 
> These are
> the zeros in the Rle. You can do fun things like
> 
>  sum(cvg[[1]] == 0)
> 
> to find out the total number of uncovered nucleotides, or
> 
>  slice(cvg, upper=0)
> 
> to get a 'view' of all uncovered nucleotides.
> 
> I'm not sure if that's what you're asking, or whether you had 
> somethingelse in mind?
> 
> Martin
> 
> > sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-08-21 r49350)
> x86_64-unknown-linux-gnu
> 
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] ShortRead_1.3.27   lattice_0.17-25    BSgenome_1.13.10
> Biostrings_2.13.33
> [5] IRanges_1.3.59
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1
> 
> 
> > 
> > Thanks a lot for any pointers,
> > Sirisha
> > 
> > _______________________________________________
> > Bioc-sig-sequencing mailing list
> > Bioc-sig-sequencing at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> 
>