[Bioc-devel] Solexa-ChipSeq

Martin Morgan mtmorgan at fhcrc.org
Mon Nov 5 15:51:33 CET 2007


"Jose M Muino" <jmui at igr.poznan.pl> writes:
> Hi everyone,

Hi Chema --

I have been working on a package over the last couple of weeks. It
tries to provide infrastructure to analyze Solexa data, focusing
mostly on QA issues at the moment. The basic idea is to write
functions that operate on per-tile or per-lane files produced by the
Solexa pipeline. The results of these functions are then summarized
over the entire experiment. Roughly, it works like sapply(X, FUN,
...), where FUN is the per-tile function and the 's' in sapply is the
summary across tile files.

For instance, I wanted to determine the X and Y coordinate of every
sequence called by Bustard, the Solexa base caller. I wrote a function
to extract this information from a tile file of the appropriate type

.qaSeq_xyCoords <- function (fp, params, lane, tile, ..., verbose) 
{
    res <- scan(fp,
                what = list(Lane=NULL, Tile=NULL, X=1L, Y=1L, NULL))
    res$Lane <- lane
    res$Tile <- tile
    res[1:4]
}

and another function to invoke the function on all files, and
summarize the result

.qaSeq <- function (sbRunParams, qaFunction, ..., verbose = FALSE) 
{
    result <- .qaScrape( # visit each tile file, and apply qaFunction
    xyCoords <- ... # summarize result across tiles, rbind-like
    ... # return an S4 object representing the result
}

These parts plug into the infrastructure I have, so that I can,
without too much more effort, do the following

> fl <- "/path/to/pipeline/output"
> v <- SBView(fl) # a 'view' of the pipeline results
> v
runsAvail:
1. ImageAnalysis: C1-36_Firecrest1.8.28_01-10-2007_rbasom
  1. BaseCallAnalysis: Bustard1.8.28_01-10-2007_rbasom
    1. AdHocAnalysis: GERALD_01-10-2007_rbasom
> v[1][[1]] # Bustard, the base caller; 4 file types, 2 tiles
filePath: ...1-10-2007_rbasom/Bustard1.8.28_01-10-2007_rbasom
analysisType: BaseCallAnalysis
lanes fileNames:
  lanes:
tiles fileNames: prb, qhg, seq, sig2
  lanes: 1
  tiles: 1 70 (2 total)
runsAvail: GERALD_01-10-2007_rbasom
> r <- qa(v[1][[1]], qaSeq) # apply 'qaSeq' to the Bustard view
> r
QAseq 
runParameters: runParameters(object)
lanes: 1 (1 total)
tiles: 1 70 (2 total)
xyCoords dimensions: 39923 x 4
xyBin dimensions: 800 x 5
seqCount: 39923
> head(xyCoords(r))
  Lane Tile    X   Y
1    1    1 1001 499
2    1    1  774 665
3    1    1  898 392
4    1    1  922 465
5    1    1  947 451
6    1    1  895 493

The package is in quite a state of flux, but I could make it
available to you (and others) if you're interested.

> Hi everyone,
>
> This is my first mail to this mailing-list.
>
> I am beginning to write a R script in order to analyze some Solexa  
> experiments (Chip-Seq) that we already have done in the lab. The aim of  
> Chip-Seq experiments is to find DNA regions which have an  
> overrepresentation of reads depending of the condition.
>
> I would like to know if there is some other people doing this in  
> bioconductor, and if there is some special format that I should use, or if  
> I can help in something.
>
> About the format I was thinking to use a Perl script, or a modification of  
> the grep algorithm to locate each read, and after to use R to do the

Not sure exactly whether you mean actually aligning the reads to a
sequence, or just parsing files produced by an alignment program to
extract the relevant information. If the later, then R is probably as
effective as perl. For actual alignment, the magnitude of the data and
inability to model sequencing error makes grep not practical. Options
seem to be PhageAlign (Solexa, small genome, includes base call error
model, impractical for large genomes), ELAND (Solexa; max 2bp
mismatch, no error model), MAQ
(http://sourceforge.net/project/showfiles.php?group_id=191815&package_id=229423),
PolyBayesShort (http://bioinformatics.bc.edu/marthlab/Beta_Release),
or Bioconductor Biostrings (might have acceptable performance for
exact matches, though haven't really tried).

One possible advantage of doing as much analysis on tile files, before
summarizing the entire experiment, is that the tile files can be
divided amon several processors.

Martin

> statistical analysis on the positions obtained with the Perl script.
>
> Thanks for your time,
> Chema
>
>
> -- 
>
> --------------------------------------
> ALTERNATIVE EMAIL: josem.muinyo at cmb.udl.es
>   Jose M Muino
>   Institute of Plant Genetics
>   Polish Academy of Science
>   Strzeszynska 34
>   61-479 Poznan
>   Poland
>   tel. 48 61 6550254
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list