[Bioc-sig-seq] Bioc short read directions

Wed Apr 2 17:48:31 CEST 2008

Hi Stephen,

Stephen Henderson wrote:
> Hi
> Sorry Im not at work so cant download
>  
> I think for many applications such as ChIP-seq you want to have multiple 
> reads as you are effectively scoring the hits to a given location. They 
> are not duplicates as such but enrichment scores for that point.

yep.

> On a related point. Do you have some functionality that deals say with 
> reads that match multiple locations in the reference (ie repetitive 
> regions)? Again this would help make sense of ChIP experiments.

matchPDict returns information on all matches satisfying the criterion 
implied by the trusted band and number of mismatches. Collating these in 
a way that would be informative for ChIP-seq is probably 'easy' to do in 
R (along the lines of table(cut(unlist(endIndex(matched)), ...)); I'm 
not sure how efficient this would be (and one would want to be clever 
about read & match quality, and probably directly describing ChIP-seq 
data based on distribution of match locations rather than bins).

ChIP-seq is definitely an area where R can provide some very powerful 
tools, and I think there is effort already along those directions. It 
would be great (for the list) to hear from those developing approaches...

Martin

> Stephen Henderson
> ------------------------------------------------------------------------
> *From:* bioc-sig-sequencing-bounces at r-project.org on behalf of 
> mtmorgan at fhcrc.org
> *Sent:* Wed 02/04/2008 14:49
> *To:* Loyal Goff
> *Cc:* Bioc-sig-sequencing at r-project.org
> *Subject:* Re: [Bioc-sig-seq] Bioc short read directions
> 
> Hi Loyal --
> 
> 
> Quoting Loyal Goff <lgoff at rci.rutgers.edu>:
> 
>  > This is a great start...thanks to both Martin and Herve. The speed is 
>  > indeed impressive! I do have one question.  Would it be advantageous 
>  > to reduce the data to a unique list of read sequences, and in doing so 
>  > both retain counts in a separate slot and reduce the matrix size? It 
>  > seems to me this would speed everything along as well. (ie. only 
>  > attempt to align a unique sequence once).  Does anyone have a need to 
>  > retain independent reads after a quality score cutoff?
> 
> In terms of matching, PDict does the right thing and it makes (little)
> difference in terms of performance whether the reads have been made unique,
> while introducing a bookkeeping nightmare for the user. Hopefully other
> algorithms will be implemented to cope with duplicates effectively, if
> appropriate.
> 
> I think the bookkeeping argument also weighs quite heavily in favor of 
> dealing
> with the data 'as-is'. I suspect also that quality scores will play an
> increasingly important part in algorithms, so a simple cut-off will not be a
> good solution. And probably the time / space 'savings' is only a 
> fraction (10%
> duplicate reads?? Maybe someone on the list has experience with this?) 
> of the
> requirements, so doesn't really change things that much.
> 
> My first inclination is to leave the data accessible in all its glory.
> 
> Martin
> 
>  >
>  > Loyal
>  >
>  > Loyal A. Goff
>  >
>  > Rutgers Stem Cell Research Center
>  > Rutgers: The State University of New Jersey
>  > Nelson Biology Labs D-251
>  > 604 Allison Rd,
>  > Piscataway, NJ 08854
>  > lgoff at rci.rutgers.edu
>  >
>  > On Apr 1, 2008, at 11:20 PM, Martin Morgan wrote:
>  >
>  > > Short-readers!
>  > >
>  > > I want to take this opportunity to provide an update on the directions
>  > > we've initiated for short reads. It would be great to get your 
>  > > feedback,
>  > > and to hear of your own efforts.
>  > >
>  > > WARNING: This software is in development, and is far from final. In
>  > > particular, functions in the BiostringsCinterfaceDemo package are NOT
>  > > meant to be final or for use in other packages; they're here to
>  > > demonstrate 'proof-of-concept' and to illustrate how users can access
>  > > the Biostrings package from C code. I'll indicate which functions are
>  > > from BiostringsCinterfaceDemo below. Expect a short-read 'base' 
>  > > package
>  > > to materialize in the devel branch of Bioconductor in the not too
>  > > distant future.
>  > >
>  > > WARNING: The data used to illustrate functionality are not meant to be
>  > > indiciative of the quality of data produced by Solexa; they are
>  > > generally 'first runs' that present a number of interesting challenges
>  > > for interpretation.
>  > >
>  > > HARDWARE AND SOFTWARE: The following include timing and object size
>  > > measurements. The machine being used is fast, but we're not doing
>  > > anything fancy to, e.g., exploit multiple processors. The machine 
>  > > has a
>  > > very large amount of memory; we used about 10 GB below, looking at 
>  > > three
>  > > different data sets. The following uses the R-2-7-branch (this is
>  > > different from R-devel). The Biostrings and BiostringsCinterfaceDemo
>  > > packages are updated very regularly, so be prepared for broken or
>  > > outdated functions.
>  > >
>  > > Herve Pages is responsible for the clever code; I am just a scribe.
>  > >
>  > > Ok, first a convenience function to print out 'size' in megabytes,
>  > > 'cause objects are large!
>  > >
>  > >> mb <- function(sz) noquote(sprintf("%.1f MB", sz / (1024^2)))
>  > >
>  > >
>  > > * Starters...
>  > >
>  > > We load the BiostringsCinterfaceDemo, which requires Biostrings. 
>  > > Both of
>  > > these need to be from the 'development' branch of Bioconductor. Both 
>  > > are
>  > > changing rapidly, and should be obtained from svn and updated 
>  > > regularly
>  > > (http://wiki.fhcrc.org/bioc/DeveloperPage,
>  > > http://wiki.fhcrc.org/bioc/SvnHowTo).
>  > >
>  > >> library(BiostringsCinterfaceDemo)
>  > >
>  > >
>  > > * I/O, DNAStringSet, and alphabetFrequency
>  > >
>  > > We next read in a fasta file derived from a lane of solexa reads. 
>  > > Here's
>  > > what the data looks like:
>  > >
>  > >> readLines(fastaFile, 10)
>  > >  [1] ">5_1_102_368"
>  > >  [2] "TAAGAGGTTTAAATTTTCTTCAGGTCAGTATTCTTT"
>  > >  [3] ">5_1_120_254"
>  > >  [4] "TTAATTCGTAAACAAGCAGTAGTAATTCCTGCTTTT"
>  > >  [5] ">5_1_110_385"
>  > >  [6] "GCTAATTTGCCTACTAACCAAGAACTTGATTTCTTC"
>  > >  [7] ">5_1_118_88"
>  > >  [8] "GTTTGGAGTGATACTGACCGCTCTCGTGGTCGTCGC"
>  > >  [9] ">5_1_113_327"
>  > > [10] "GCTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC"
>  > >
>  > > This is a single lane from a Solexa training run; the data are not
>  > > filtered, and the run is not meant to be representative in terms of
>  > > quality or other characteristics. The DNA used for the reads is from
>  > > phage phiX-174. Here's how we read it in (countLines and 
>  > > readSolexaFastA
>  > > are in BiostringsCinterfaceDemo)
>  > >
>  > >> countLines(fastaFile)
>  > > s_5.fasta
>  > >  18955056
>  > >> system.time({
>  > > +     seqa <- readSolexaFastA(fastaFile)
>  > > + }, gcFirst=TRUE)
>  > >    user  system elapsed
>  > >   67.48    2.08   69.68
>  > >> mb(object.size(seqa))
>  > > [1] 397.7 MB
>  > >> seqa
>  > >   A DNAStringSet instance of length 9477528
>  > >           width seq
>  > >       [1]    36 TAAGAGGTTTAAATTTTCTTCAGGTCAGTATTCTTT
>  > >       [2]    36 TTAATTCGTAAACAAGCAGTAGTAATTCCTGCTTTT
>  > >       [3]    36 GCTAATTTGCCTACTAACCAAGAACTTGATTTCTTC
>  > >       [4]    36 GTTTGGAGTGATACTGACCGCTCTCGTGGTCGTCGC
>  > >       [5]    36 GCTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC
>  > >       [6]    36 TGACCCTCAGCAATCTTAAACTTCTTAGACGAATCA
>  > >       [7]    36 GCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGC
>  > >       [8]    36 TTTAGGTGTCTGTAAAACAGGTGCCGAAGAAGCTGG
>  > >       [9]    36 GGTCTGTTGAACACGACCAGAAAACTGGCCTAACGA
>  > >       ...   ... ...
>  > > [9477520]    36 TACGCAGTTTTGCCGTATACTCGTTGTTCTGACTCT
>  > > [9477521]    36 TATACCCCCCCTCCTACTTGTGCTGTTTCTCATGTT
>  > > [9477522]    36 CAGGTTGTTTCTGTTGGTGCTGATATTTCTTTTTTT
>  > > [9477523]    36 GTCTTCCTTGCTTGTCAGATTGGTCGTCTTATTACC
>  > > [9477524]    36 ATACGAAAGACCAGGTATATGCACAAAATGAGTTGC
>  > > [9477525]    36 ACCACAAACGCGCTCGTTTATGCTTGCCTCTATTAC
>  > > [9477526]    36 ------------------------------------
>  > > [9477527]    36 CCAGCAAGGAAGCCAAGATGGGAAAGGTCATGCGGC
>  > > [9477528]    36 CATTGTTGACCACCTACATACCACAGACGAGCACCT
>  > >
>  > > It takes just over a minute to read in the nearly 9.5 million reads. 
>  > > The
>  > > reads are stored efficiently, without the overhead of R character
>  > > strings. The data structure (a DNAStringSet, from Biostrings and
>  > > therefore stable) will not copy the large data, but instead contains
>  > > 'views' into it.
>  > >
>  > > A basic question is about the nucleotides present in the reads.
>  > > alphabetFrequency (from Biostrings) scans all the sequences and 
>  > > tallies
>  > > nucleotides
>  > >
>  > >> system.time({
>  > > +     alf1 <- alphabetFrequency(seqa, collapse=TRUE, freq=TRUE)
>  > > + }, gcFirst=TRUE)
>  > >    user  system elapsed
>  > >   0.612   0.000   0.612
>  > >> alf1
>  > >      A      C      G      T      M      R      W      S      Y      K
>  > > 0.2449 0.2119 0.2201 0.3030 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
>  > >      V      H      D      B      N      -
>  > > 0.0000 0.0000 0.0000 0.0000 0.0000 0.0201
>  > >
>  > > (bases are recored with IUPAC symbols, see ?IUPAC_CODE_MAP). This
>  > > executes very efficiently. A variant produces a matrix with rows
>  > > corresponding to reads and columns to bases.
>  > >
>  > >> alf <- alphabetFrequency(seqa, baseOnly=TRUE)
>  > >> dim(alf)
>  > > [1] 9477528       5
>  > >> head(alf)
>  > >       A  C  G  T other
>  > > [1,]  9  4  6 17     0
>  > > [2,] 11  6  5 14     0
>  > > [3,] 10  9  4 13     0
>  > > [4,]  4  9 12 11     0
>  > > [5,]  6  6 11 13     0
>  > > [6,] 12 10  4 10     0
>  > >
>  > > This can be remarkably useful. For instance, to select just the 
>  > > 'clean'
>  > > sequences (those without ambiguous base calls), one can
>  > >
>  > >> cleanSeqs <- seqa[alf[,"other"]==0]
>  > >> length(seqa)
>  > > [1] 9477528
>  > >> length(cleanSeqs)
>  > > [1] 9207292
>  > >
>  > > This creates a new DNAStringSet with just the clean sequences. It
>  > > executes very quickly, because the DNAStringSet is a view into the
>  > > original. The memory associated with the reads themselves is not 
>  > > copied.
>  > > here is the alphabetFrequency of the 'clean' reads.
>  > >
>  > >> cleanAlf <- alphabetFrequency(cleanSeqs, baseOnly=TRUE)
>  > >
>  > > Again this is very useful, for instance the distribution of GC content
>  > > among clean reads is
>  > >
>  > >> plot(density(rowSums(cleanAlf[,c("G", "C")]) / rowSums(cleanAlf)))
>  > >
>  > >
>  > > * PDict, countPDict, matchPDict
>  > >
>  > > A 'PDict' (defined in Biostrings) is a dictionary-like structure that
>  > > can be used for very efficient exact- and partially-exact matching
>  > > algorithms. To illustrate, we'll use data from about a million reads 
>  > > of
>  > > the Solexa BAC cloning vector. These reads again come from an early 
>  > > run
>  > > on the Solexa instrumentation here, and results should not be taken to
>  > > be representative of performance.
>  > >
>  > > We read and clean the sequences as above, resulting in
>  > >
>  > >> length(cleanSeqs)
>  > > [1] 923680
>  > >
>  > > We then create a PDict from our DNAStringSet with
>  > >
>  > >> system.time({
>  > > +     pDict <- PDict(cleanSeqs)
>  > > + }, gcFirst=TRUE)
>  > >    user  system elapsed
>  > >    1.09    0.00    1.10
>  > >> pDict
>  > > 923680-pattern constant width PDict object of width 25 (patterns 
>  > > have no
>  > > names)
>  > >> mb(object.size(pDict))
>  > > [1] 160.4 MB
>  > >
>  > > This is created quickly. It is a larger object, but the size allows 
>  > > fast
>  > > searches. Here we'll use Biostrings readFASTA to read in the 
>  > > sequence to
>  > > which the data are to be aligned.
>  > >
>  > >> bac <- read.DNAStringSet(bacFile, "fasta")[[1]]
>  > > Read 2479 items
>  > >> length(bac)
>  > > [1] 173427
>  > >
>  > > This is a BAC clone. We'll match our pDict to the BAC subject, finding
>  > > all EXACT matches;
>  > >
>  > >> system.time({
>  > > +     counts <- countPDict(pDict, bac)
>  > > + }, gcFirst=FALSE)
>  > >    user  system elapsed
>  > >   0.200   0.048   0.268
>  > >> length(counts)
>  > > [1] 923680
>  > >> table(counts)[1:5]/sum(table(counts))
>  > > counts
>  > >       0       1       2       3       4
>  > > 0.53954 0.42738 0.01136 0.00528 0.00334
>  > >
>  > > This is very fast, partly because the subject against which the 
>  > > PDict is
>  > > being matched is short. A more realistic use case is to match 
>  > > against a
>  > > genome. The integration with BSgenome packages is very smooth. To 
>  > > match
>  > > our pDict against human chromosome 6 (where the BAC cloning vector 
>  > > comes
>  > > from), we can load the appropriate BSgenome package and chromosome
>  > >
>  > >> library(BSgenome.Hsapiens.UCSC.hg18)
>  > >
>  > >> Hsapiens[["chr6"]]
>  > >   170899992-letter "DNAString" instance
>  > > seq:
>  > > NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>  > > ...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>  > >
>  > > (The N's represent the chromsome telomeres, whose seuqences is not
>  > > available). We look for exact matches
>  > >
>  > >> system.time({
>  > > +     hcount <- countPDict(pDict, Hsapiens[["chr6"]])
>  > > + }, gcFirst=TRUE)
>  > >    user  system elapsed
>  > >    24.2     0.0    24.2
>  > >> table(hcount)[1:5] / sum(table(hcount))
>  > > hcount
>  > >       0       1       2       3       4
>  > > 0.50466 0.32996 0.02043 0.00959 0.00761
>  > >
>  > > About 1/2 the sequences exactly match one or more locations on
>  > > chromosome 6. Some sequences match many times, though the reason (in
>  > > this case, anyway) is not too surprising:
>  > >
>  > >> max(hcount)
>  > > [1] 8286
>  > >> maxIdx = which(hcount==max(hcount))
>  > >> cleanSeqs[maxIdx]
>  > >   A DNAStringSet instance of length 70
>  > >      width seq
>  > >  [1]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > >  [2]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > >  [3]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > >  [4]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > >  [5]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > >  [6]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > >  [7]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > >  [8]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > >  [9]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > >  ...   ... ...
>  > > [62]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > > [63]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > > [64]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > > [65]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > > [66]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > > [67]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > > [68]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > > [69]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > > [70]    25 TTTTTTTTTTTTTTTTTTTTTTTTT
>  > >
>  > > (that these are identical can be checked with
>  > > unique(as.character(cleanSeqs[maxIdx])))
>  > >
>  > > The current implementation of PDict allows for a 'trusted band' of
>  > > nucleotides that need to match exactly, allowing for approximate 
>  > > matches
>  > > in the remaining nucleotides. Here we trust the first 12 bases, and
>  > > allow up to 3 mismatches. Also, we use the more informative 
>  > > matchPDict,
>  > > which allows us to find the positions of all matches
>  > >
>  > >> trusted <- PDict(cleanSeqs, tb.end=12)
>  > >> system.time({
>  > > +     mmMatch <- matchPDict(trusted, Hsapiens[[6]], max.mismatch=3)
>  > > + }, gcFirst=TRUE)
>  > >    user  system elapsed
>  > >  195.58    3.85  199.44
>  > >> table(cut(countIndex(mmMatch), c(0, 10^(0:5)), right=FALSE))
>  > >
>  > >         [0,1)        [1,10)      [10,100)   [100,1e+03) [1e+03,1e+04)
>  > >        377233        337806         72983         66322         60818
>  > > [1e+04,1e+05)
>  > >          8518
>  > >
>  > > Execution time increases as the stringency of the match decreases. 
>  > > PDict
>  > > facilities do not yet incorporate quality scores, but filtering 
>  > > results
>  > > based on quality (qualities are discussed further below) represents a
>  > > natural direction for development.
>  > >
>  > >
>  > >
>  > > * alphabetByCycle
>  > >
>  > > alphabetByCycle uses a small C function in BiostringsCinterfaceDemo,
>  > > alphabet_by_cycle. It and the read* functions in this package 
>  > > illustrate
>  > > how to access DNAStringSet objects at the C level. alphabetByCycle 
>  > > is a
>  > > matrix that tallies nucleotide use per cycle
>  > >
>  > >> system.time({
>  > > +     abc <- alphabetByCycle(seqa)
>  > > + })
>  > >    user  system elapsed
>  > >    2.68    0.00    2.68
>  > >
>  > > Again this can be quite useful. For instance, we can find out the 
>  > > number
>  > > of bases that were not called, as a function of cycle
>  > >
>  > >> abc["-",]
>  > >  [1]  22181 108829 123173 180382 225091 225055 216787 208538 208881
>  > > [10] 213104 148936 142966 141030 148163 178747 204304 211538 211303
>  > > [19] 213722 211167 208021 208715 165441 158359 147110 151462 204781
>  > > [28] 221008 223922 221171 227622 226936 232232 236606 242387 241718
>  > >
>  > > and the number of 'T' nucleotides as a function of cycle
>  > >
>  > >> abc["T",] / colSums(abc[1:4,])
>  > >  [1] 0.286 0.292 0.292 0.284 0.292 0.280 0.293 0.297 0.299 0.287 0.290
>  > > [12] 0.294 0.294 0.299 0.300 0.301 0.304 0.305 0.306 0.309 0.310 0.311
>  > > [23] 0.312 0.316 0.315 0.319 0.320 0.322 0.326 0.328 0.331 0.335 0.339
>  > > [34] 0.342 0.346 0.358
>  > >
>  > > That's quite a striking increase after cycle 25!
>  > >
>  > >
>  > > * Qualities
>  > >
>  > > Solexa reads have quality scores associated with each base call. These
>  > > are summarized in files formatted like:
>  > >
>  > >> readLines(fastqFile, n=8)
>  > > [1] "@HWI-EAS88_1_1_1_1001_499"
>  > > [2] "GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT"
>  > > [3] "+HWI-EAS88_1_1_1_1001_499"
>  > > [4] "]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS"
>  > > [5] "@HWI-EAS88_1_1_1_898_392"
>  > > [6] "GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC"
>  > > [7] "+HWI-EAS88_1_1_1_898_392"
>  > > [8] "]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK"
>  > >
>  > > A record consists of four lines. The first and third lines are
>  > > identifiers (repeated), the second line is the sequence, the fourth 
>  > > line
>  > > an ASCII character representing the score (']' is good, Z is better 
>  > > than
>  > > A). Here we read a quality file into a data structure defined in
>  > > BiostringsCinterfaceDemo designed to coordinate the sequence, name, 
>  > > and
>  > > quality information.
>  > >
>  > >> system.time({
>  > > +     seqq <- readSolexaFastQ(fastqFile)
>  > > + }, gcFirst=TRUE)
>  > >    user  system elapsed
>  > >  17.533   0.417  17.969
>  > >> mb(object.size(seqq))
>  > > [1] 254.5 MB
>  > >> seqq
>  > > class: SolexaSequenceQ
>  > > length: 2218237
>  > >
>  > >> sequences(seqq)[1:5]
>  > >   A DNAStringSet instance of length 5
>  > >     width seq
>  > > [1]    36 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
>  > > [2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC
>  > > [3]    36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT
>  > > [4]    36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA
>  > > [5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT
>  > >> names(seqq)[1:5]
>  > >   A BStringSet instance of length 5
>  > >     width seq
>  > > [1]    24 HWI-EAS88_1_1_1_1001_499
>  > > [2]    23 HWI-EAS88_1_1_1_898_392
>  > > [3]    23 HWI-EAS88_1_1_1_922_465
>  > > [4]    23 HWI-EAS88_1_1_1_895_493
>  > > [5]    23 HWI-EAS88_1_1_1_953_493
>  > >> scores(seqq)[1:5]
>  > >   A BStringSet instance of length 5
>  > >     width seq
>  > > [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
>  > > [2]    36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
>  > > [3]    36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
>  > > [4]    36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA
>  > > [5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
>  > >
>  > > The object returned by readSolexaFastQ is an S4 object with three 
>  > > slots.
>  > > Each slot contains a XStringSet, where 'X' is DNA for sequences, and
>  > > 'BString' for names and scores. This represents one way of structuring
>  > > quality data; the S4 class coordinates subsetting, and provides
>  > > (read-only) accessors to the underlying objects. A likely addition to
>  > > this class as it matures is the inclusion of lane-specific phenotype
>  > > (sample) information, much as an ExpressionSet coordinates sample and
>  > > expression values.
>  > >
>  > > We can gain some basic insight into the sequences as before, e.g.,
>  > >
>  > >> abc <- alphabetByCycle(sequences(seqq))
>  > >> abc["N",]
>  > >  [1]     0     0     0     0     0     0     0     0     0     0     0
>  > > [12]     0  1213  1631  1155  1240   721   418  8503   526  6493   703
>  > > [23] 14999   718  1623   737   243 40704   811   590  1964   961   809
>  > > [34]   910   477   208
>  > >
>  > > Solexa provides files with quality information after filtering reads
>  > > based on 'purity', a measure that precludes uncertain bases (IUPAC 
>  > > code
>  > > 'N') from a user-specified region (the first 12 cycles, by default).
>  > >
>  > >> abc["T",] / colSums(abc[1:4,])
>  > >  [1] 0.298 0.290 0.292 0.290 0.295 0.279 0.292 0.298 0.301 0.293 0.295
>  > > [12] 0.299 0.296 0.293 0.294 0.294 0.293 0.295 0.294 0.297 0.297 0.299
>  > > [23] 0.298 0.303 0.302 0.307 0.312 0.312 0.321 0.331 0.334 0.351 0.361
>  > > [34] 0.374 0.380 0.423
>  > >
>  > > We can also summarize quality information by cycle, using an alphabet
>  > > that reflects the encoded scores:
>  > >
>  > >> alphabet <- sapply(as.raw(32:93), rawToChar)
>  > >> abcScore <- alphabetByCycle(scores(seqq), alphabet=alphabet)
>  > >>
>  > >> rowSums(abcScore)
>  > >                 !        "        #        $        %        
>  > > &        '
>  > >        0        0        0        0        0        0        
>  > > 0        0
>  > >        (        )        *        +        ,        
>  > > -        .        /
>  > >        0        0        0        0        0        0        
>  > > 0        0
>  > >        0        1        2        3        4        5        
>  > > 6        7
>  > >        0        0        0        0        0        0        
>  > > 0        0
>  > >        8        9        :        ;        <        =        
>  > > >        ?
>  > >        0        0        0        0        0        0        
>  > > 0        0
>  > >        @        A        B        C        D        E        
>  > > F        G
>  > >        0  1367337        0  2035064        0  1006372   
>  > > 819775        0
>  > >        H        I        J        K        L        M        
>  > > N        O
>  > >  1926112   116395  1614109   356496   517731  1602298   578172  
>  > > 2016009
>  > >        P        Q        R        S        T        U        
>  > > V        W
>  > >  3043393   401196  2152160  1553511  2149231   254108  3896558   
>  > > 232409
>  > >        X        Y        Z        [       \\        ]
>  > >   706452  4353706  1292309   732210        0 45133419
>  > >> abcScore[34:62,c(1:4, 33:36)]
>  > >       [,1]    [,2]    [,3]    [,4]   [,5]   [,6]   [,7]   [,8]
>  > > A        1    1788    1798       8 260944 299631 332500 354717
>  > > B        0       0       0       0      0      0      0      0
>  > > C       59   36426   12705     227     39 131268 140461 154643
>  > > D        0       0       0       0      0      0      0      0
>  > > E      133   13391    4033     180 120135      0      0      0
>  > > F        0       0       0       0      0 259549 272238 287988
>  > > G        0       0       0       0      0      0      0      0
>  > > H      272    8815    2933     357 242741 234088 239917 239157
>  > > I        0       0       0       0 116395      0      0      0
>  > > J      472    4678    1656     586      0 195991 198114 187535
>  > > K        5      59      25      11 110052  84345  83744  77353
>  > > L        0       0       0       0 102438 144459 142049 128785
>  > > M       75     171     121      92  93945 119209 116492 103765
>  > > N      653    2992    1232     769  85940  91924  89084  79246
>  > > O     1227    2719    1754    1576 175403 187339 176997 160100
>  > > P    65834   57830   60855   65863  61910      0      0      0
>  > > Q        0       0       0       0 236290      0      0      0
>  > > R     3211    4482    4321    4215      0      0      0      0
>  > > S        0       0       0       0  68378 470434 426641 444948
>  > > T     4444    5373    5868    5637      0      0      0      0
>  > > U        0       0       0       0      0      0      0      0
>  > > V     9146   10270   11922   11547 543627      0      0      0
>  > > W        0       0       0       0      0      0      0      0
>  > > X        0       0       0       0      0      0      0      0
>  > > Y    28363   29080   34093   32873      0      0      0      0
>  > > Z        0       0       0       0      0      0      0      0
>  > > [        0       0       0       0      0      0      0      0
>  > > \\       0       0       0       0      0      0      0      0
>  > > ]  2104342 2040163 2074921 2094296      0      0      0      0
>  > >
>  > > Output from the last line shows how scores decrease from the first 
>  > > four
>  > > cycles to the last four. Standard R and Biostrings commands can be 
>  > > used
>  > > to ask many interesting questions, such as an overall quality score of
>  > > reads (e.g., summing the scores of individual nucleotides) and the
>  > > relationship between sequence characteristics (e.g., frequency of 'T')
>  > > and read quality.
>  > >
>  > >> atgc <- alphabetFrequency(sequences(seqq), baseOnly=TRUE)
>  > >> qscore <- alphabetFrequency(scores(seqq))
>  > >> dim(qscore)
>  > > [1] 2218237     256
>  > >> mb(object.size(qscore))
>  > > [1] 2166.2 MB
>  > >
>  > >> quality <- colSums(t(qscore) * 1:ncol(qscore))
>  > >> plot(density(quality))   # small secondary peak at low quality
>  > >> scores(seqq)[which(quality<2925)]
>  > >   A BStringSet instance of length 87696
>  > >         width seq
>  > >     [1]    36 PPPPPPPPPPPPPEPPPPPOPPMOOPPOMMMPOOOJ
>  > >     [2]    36 PPPPPPPPPPPPPPPPPPPPHPPPOPPMOPPPNKMA
>  > >     [3]    36 PPPPPPPPPPPPPPPPPOPPPPPPEPMPPPMPOFAF
>  > >     [4]    36 PPPPPPPPPPPPPPPPPPPPOPPPPPPPPPOPOOHK
>  > >     [5]    36 PPPPPPPPPPPPPPPPPPPPPPPOPPOPOPPMKMLF
>  > >     [6]    36 PPPPPPPPPPPPPPPPOPPPPPCPPPPHOOPPOOOO
>  > >     [7]    36 PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPOOKO
>  > >     [8]    36 PPPPPPPPPPPPPPPPPPPOPPPPPPPPJPOPOMAO
>  > >     [9]    36 PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPOOOL
>  > >     ...   ... ...
>  > > [87688]    36 PPPPPPPPMPOPCOOCOOJEOMMCCOEEHCCMAAAA
>  > > [87689]    36 PTYOR]NTVVVJOHJCCPMEHCMHOJCECCCCAACA
>  > > [87690]    36 Y]]]NTVYYT]TPNRPNRYHCTECOHCHHCECHAAC
>  > > [87691]    36 YPRVNVYVPYHORCOCOPEPECJCCHCCJECHAAAA
>  > > [87692]    36 PPPPPPPPPPPPPHPPOCPOPPOPCHMMPPPEKHOO
>  > > [87693]    36 JJHPTTJYV]]]]JJCJCJJTCCOVCMHMCCOAAAA
>  > > [87694]    36 PPPPPPPPPPPPPPPPPOPPPJPPPCPOHHCMOCAF
>  > > [87695]    36 TYR]R]TN]YOEPTNERPPTVTCTVCCCRHOMIFAC
>  > > [87696]    36 YRRYJVVTPHVPPECHPNCMCCJCEECOCCCCAAAA
>  > >>
>  > >> t <- atgc[,"T"] / rowSums(atgc[,1:4])
>  > >> cor(t, quality)
>  > > [1] 0.154
>  > >
>  > > All of these operations are quick enough to perform in an interactive
>  > > session; the qscore is a large matrix (it can be made smaller by
>  > > choosing bounds that reflect allowable scores, e.g., 32:127), and its
>  > > transposition is relatively expensive.
>  > >
>  > > A final point to remember is that R stores a matrix m as a vector of
>  > > length nrow(m) * ncol(m). R has an internal limit on the size of a
>  > > vector equal to 2^32-1, so the maximum number of reads whose scores 
>  > > can
>  > > be represented by alphabetFrequency is less than 2^32 / 256, or 
>  > > about 16
>  > > million reads; this number of reads might be approached in a single
>  > > Solexa lane; a simple solution is to divide the DNAStringSet into 
>  > > pieces
>  > > that are processed separately.
>  > >
>  > >
>  > >
>  > > I hope that the forgoing provides some indication of where we stand at
>  > > the moment. Again, it would be great to have feedback, and to hear of
>  > > other efforts. And again, the programming credit goes to Herve Pages.
>  > >
>  > > Martin
>  > >
>  > > The obligatory sessionInfo, plus some stats on processing this 
>  > > document
>  > > (referencing, in the end, three different data sets).
>  > >
>  > >> sessionInfo()
>  > > R version 2.7.0 alpha (2008-03-28 r44972)
>  > > x86_64-unknown-linux-gnu
>  > >
>  > > locale:
>  > > LC_CTYPE
>  > > =
>  > > en_US
>  > > .UTF
>  > > -8
>  > > ;LC_NUMERIC
>  > > =
>  > > C
>  > > ;LC_TIME
>  > > =
>  > > en_US
>  > > .UTF
>  > > -8
>  > > ;LC_COLLATE
>  > > =
>  > > en_US
>  > > .UTF
>  > > -8
>  > > ;LC_MONETARY
>  > > =
>  > > C
>  > > ;LC_MESSAGES
>  > > =
>  > > en_US
>  > > .UTF
>  > > -8
>  > > ;LC_PAPER
>  > > =
>  > > en_US
>  > > .UTF
>  > > -8
>  > > ;LC_NAME
>  > > =
>  > > C
>  > > ;LC_ADDRESS
>  > > =C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>  > >
>  > > attached base packages:
>  > > [1] tools     stats     graphics  grDevices utils     datasets
>  > > [7] methods   base
>  > >
>  > > other attached packages:
>  > > [1] BiostringsCinterfaceDemo_0.1.2
>  > > [2] BSgenome.Hsapiens.UCSC.hg18_1.3.2
>  > > [3] BSgenome_1.7.4
>  > > [4] Biostrings_2.7.41
>  > > [5] Biobase_1.99.4
>  > >> gc()
>  > >            used   (Mb) gc trigger  (Mb) max used  (Mb)
>  > > Ncells 9.37e+05   50.1   4.87e+06   260 1.93e+07  1033
>  > > Vcells 6.88e+08 5252.6   1.31e+09 10031 1.31e+09  9998
>  > >> proc.time()
>  > >    user  system elapsed
>  > >   356.7    16.6   378.5
>  > >
>  > > _______________________________________________
>  > > Bioc-sig-sequencing mailing list
>  > > Bioc-sig-sequencing at r-project.org
>  > > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>  >
>  > _______________________________________________
>  > Bioc-sig-sequencing mailing list
>  > Bioc-sig-sequencing at r-project.org
>  > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>  >
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> **********************************************************************
> 
> This email and any files transmitted with it are confidential and
> 
> intended solely for the use of the individual or entity to whom they
> 
> are addressed. If you have received this email in error please notify
> 
> the system manager (it.support at wibr.ucl.ac.uk). All files are scanned 
> for viruses.
> 
> **********************************************************************
> 
>  
> 

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793