[Bioc-devel] Solexa-ChipSeq

Tue Nov 6 13:53:30 CET 2007

Hi Martin,

Thanks a lot for your comments; they were very useful to me.

I have to read the manuals about MAQ and PolyBayesShort, and I didn’t know  
these software.

Now, I am reading about SxOligoSearch (   
http://synasite.mgrc.com.my:8080/sxog/files/SXOligoSearch_benchmark.pdf ).  
This software allows alignment of Solexa reads to a genome, using the base  
probability information of the read in order to improve the score of the  
alignment. Also, they are able to improve the quality of the alignment  
using the information that a given 5’ read should match in a position near  
of a 3’ read match (in the correct orientation).

Have a nice day,
Chema

On Mon, 05 Nov 2007 15:51:33 +0100, Martin Morgan <mtmorgan at fhcrc.org>  
wrote:

> "Jose M Muino" <jmui at igr.poznan.pl> writes:
>> Hi everyone,
>
> Hi Chema --
>
> I have been working on a package over the last couple of weeks. It
> tries to provide infrastructure to analyze Solexa data, focusing
> mostly on QA issues at the moment. The basic idea is to write
> functions that operate on per-tile or per-lane files produced by the
> Solexa pipeline. The results of these functions are then summarized
> over the entire experiment. Roughly, it works like sapply(X, FUN,
> ...), where FUN is the per-tile function and the 's' in sapply is the
> summary across tile files.
>
> For instance, I wanted to determine the X and Y coordinate of every
> sequence called by Bustard, the Solexa base caller. I wrote a function
> to extract this information from a tile file of the appropriate type
>
> .qaSeq_xyCoords <- function (fp, params, lane, tile, ..., verbose)
> {
>     res <- scan(fp,
>                 what = list(Lane=NULL, Tile=NULL, X=1L, Y=1L, NULL))
>     res$Lane <- lane
>     res$Tile <- tile
>     res[1:4]
> }
>
> and another function to invoke the function on all files, and
> summarize the result
>
> .qaSeq <- function (sbRunParams, qaFunction, ..., verbose = FALSE)
> {
>     result <- .qaScrape( # visit each tile file, and apply qaFunction
>     xyCoords <- ... # summarize result across tiles, rbind-like
>     ... # return an S4 object representing the result
> }
>
> These parts plug into the infrastructure I have, so that I can,
> without too much more effort, do the following
>
>> fl <- "/path/to/pipeline/output"
>> v <- SBView(fl) # a 'view' of the pipeline results
>> v
> runsAvail:
> 1. ImageAnalysis: C1-36_Firecrest1.8.28_01-10-2007_rbasom
>   1. BaseCallAnalysis: Bustard1.8.28_01-10-2007_rbasom
>     1. AdHocAnalysis: GERALD_01-10-2007_rbasom
>> v[1][[1]] # Bustard, the base caller; 4 file types, 2 tiles
> filePath: ...1-10-2007_rbasom/Bustard1.8.28_01-10-2007_rbasom
> analysisType: BaseCallAnalysis
> lanes fileNames:
>   lanes:
> tiles fileNames: prb, qhg, seq, sig2
>   lanes: 1
>   tiles: 1 70 (2 total)
> runsAvail: GERALD_01-10-2007_rbasom
>> r <- qa(v[1][[1]], qaSeq) # apply 'qaSeq' to the Bustard view
>> r
> QAseq
> runParameters: runParameters(object)
> lanes: 1 (1 total)
> tiles: 1 70 (2 total)
> xyCoords dimensions: 39923 x 4
> xyBin dimensions: 800 x 5
> seqCount: 39923
>> head(xyCoords(r))
>   Lane Tile    X   Y
> 1    1    1 1001 499
> 2    1    1  774 665
> 3    1    1  898 392
> 4    1    1  922 465
> 5    1    1  947 451
> 6    1    1  895 493
>
> The package is in quite a state of flux, but I could make it
> available to you (and others) if you're interested.
>
>> Hi everyone,
>>
>> This is my first mail to this mailing-list.
>>
>> I am beginning to write a R script in order to analyze some Solexa
>> experiments (Chip-Seq) that we already have done in the lab. The aim of
>> Chip-Seq experiments is to find DNA regions which have an
>> overrepresentation of reads depending of the condition.
>>
>> I would like to know if there is some other people doing this in
>> bioconductor, and if there is some special format that I should use, or  
>> if
>> I can help in something.
>>
>> About the format I was thinking to use a Perl script, or a modification  
>> of
>> the grep algorithm to locate each read, and after to use R to do the
>
> Not sure exactly whether you mean actually aligning the reads to a
> sequence, or just parsing files produced by an alignment program to
> extract the relevant information. If the later, then R is probably as
> effective as perl. For actual alignment, the magnitude of the data and
> inability to model sequencing error makes grep not practical. Options
> seem to be PhageAlign (Solexa, small genome, includes base call error
> model, impractical for large genomes), ELAND (Solexa; max 2bp
> mismatch, no error model), MAQ
> (http://sourceforge.net/project/showfiles.php?group_id=191815&package_id=229423),
> PolyBayesShort (http://bioinformatics.bc.edu/marthlab/Beta_Release),
> or Bioconductor Biostrings (might have acceptable performance for
> exact matches, though haven't really tried).
>
> One possible advantage of doing as much analysis on tile files, before
> summarizing the entire experiment, is that the tile files can be
> divided amon several processors.
>
> Martin
>
>> statistical analysis on the positions obtained with the Perl script.
>>
>> Thanks for your time,
>> Chema
>>
>>
>> --
>>
>> --------------------------------------
>> ALTERNATIVE EMAIL: josem.muinyo at cmb.udl.es
>>   Jose M Muino
>>   Institute of Plant Genetics
>>   Polish Academy of Science
>>   Strzeszynska 34
>>   61-479 Poznan
>>   Poland
>>   tel. 48 61 6550254
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 

--------------------------------------
ALTERNATIVE EMAIL: josem.muinyo at cmb.udl.es
  Jose M Muino
  Institute of Plant Genetics
  Polish Academy of Science
  Strzeszynska 34
  61-479 Poznan
  Poland
  tel. 48 61 6550254