[Bioc-sig-seq] Introducing the ShortRead package

Martin Morgan mtmorgan at fhcrc.org
Thu May 29 15:46:52 CEST 2008


Short readers!

I want to introduce the ShortRead package for high throughput
sequencing. It is available using biocLite with a development version
of R, or via svn. It is still under active development, but useful
anyway. Here are some of the main functions:

readXStringColumns, readFastq, readAligned: these functions read
sequence data into R objects. The XStringColumns variant is the
building block, reading one or more columns of sequence, quality
score, or other data into the corresponding XStringSet (from
Biostrings) object. readFastq reads fastq-style (sequence + quality)
files, readAligned reads alignment files (currently Solexa 'export'
and maq 'mapview'; soon maq binary) files.

SolexaPath, SolexaSet: these are functions and classes to help with
Solexa data. SolexaPath provides a convenient way to navigate the
file hierarchy created by a Solexa run. SolexaSet is like an
ExpressionSet, coordinating sequence data with the phenotype
description of the samples. The SolexaSet class is still a work in
progress.

alphabetByCycle, srorder, srduplicated, srsort and additional
functions provide basic tools for exploring XStringSet objects. For
instance, alphabetByCycle can be used to summarize nucleotide
frequency or quality score by cycle; several of the data sets I've
looked at show surprising patterns that trace back to quality control
issues of one sort or another.

All of the objects are intended to be created with constructors (e.g.,
the read* functions, or SolexaPath() and the like) rather than
explicit calls to 'new'. There are accessors (e.g., sread(), quality()
to extract the reads and quality scores) and other basic manipulations
(e.g., subset operations) that coordinate different components of the
object.

The package is still very much in development. Complete man pages
usually indicate a relatively stable structure or functionality; all
of the functions and classes mentioned above have man pages.

Directions include a 'qa' suite for Solexa data, a MAQ binary file
parser (thanks to Simon Anders), further tools for efficiently
manipulating and representing these objects (generally, 32-bit users
will be frustrated by the current generation, especially in
down-stream analysis), and useful functionality for qa and exploratory
assessment. A vignette is also in the works, to provide common work
flows and more detail on package use.

I'm eager to hear your feedback, and would be happy to incorporate
additions that you might have been working on for your own purposes --
the current Solexa emphasis reflects the data we have most ready
access to, but ShortRead is meant as an entry point for many of the
high throughput technologies.

Martin
-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list