[Bioc-devel] Solexa-ChipSeq
Jose M Muino
jmui at igr.poznan.pl
Tue Nov 6 13:53:30 CET 2007
Hi Martin,
Thanks a lot for your comments; they were very useful to me.
I have to read the manuals about MAQ and PolyBayesShort, and I didn’t know
these software.
Now, I am reading about SxOligoSearch (
http://synasite.mgrc.com.my:8080/sxog/files/SXOligoSearch_benchmark.pdf ).
This software allows alignment of Solexa reads to a genome, using the base
probability information of the read in order to improve the score of the
alignment. Also, they are able to improve the quality of the alignment
using the information that a given 5’ read should match in a position near
of a 3’ read match (in the correct orientation).
Have a nice day,
Chema
On Mon, 05 Nov 2007 15:51:33 +0100, Martin Morgan <mtmorgan at fhcrc.org>
wrote:
> "Jose M Muino" <jmui at igr.poznan.pl> writes:
>> Hi everyone,
>
> Hi Chema --
>
> I have been working on a package over the last couple of weeks. It
> tries to provide infrastructure to analyze Solexa data, focusing
> mostly on QA issues at the moment. The basic idea is to write
> functions that operate on per-tile or per-lane files produced by the
> Solexa pipeline. The results of these functions are then summarized
> over the entire experiment. Roughly, it works like sapply(X, FUN,
> ...), where FUN is the per-tile function and the 's' in sapply is the
> summary across tile files.
>
> For instance, I wanted to determine the X and Y coordinate of every
> sequence called by Bustard, the Solexa base caller. I wrote a function
> to extract this information from a tile file of the appropriate type
>
> .qaSeq_xyCoords <- function (fp, params, lane, tile, ..., verbose)
> {
> res <- scan(fp,
> what = list(Lane=NULL, Tile=NULL, X=1L, Y=1L, NULL))
> res$Lane <- lane
> res$Tile <- tile
> res[1:4]
> }
>
> and another function to invoke the function on all files, and
> summarize the result
>
> .qaSeq <- function (sbRunParams, qaFunction, ..., verbose = FALSE)
> {
> result <- .qaScrape( # visit each tile file, and apply qaFunction
> xyCoords <- ... # summarize result across tiles, rbind-like
> ... # return an S4 object representing the result
> }
>
> These parts plug into the infrastructure I have, so that I can,
> without too much more effort, do the following
>
>> fl <- "/path/to/pipeline/output"
>> v <- SBView(fl) # a 'view' of the pipeline results
>> v
> runsAvail:
> 1. ImageAnalysis: C1-36_Firecrest1.8.28_01-10-2007_rbasom
> 1. BaseCallAnalysis: Bustard1.8.28_01-10-2007_rbasom
> 1. AdHocAnalysis: GERALD_01-10-2007_rbasom
>> v[1][[1]] # Bustard, the base caller; 4 file types, 2 tiles
> filePath: ...1-10-2007_rbasom/Bustard1.8.28_01-10-2007_rbasom
> analysisType: BaseCallAnalysis
> lanes fileNames:
> lanes:
> tiles fileNames: prb, qhg, seq, sig2
> lanes: 1
> tiles: 1 70 (2 total)
> runsAvail: GERALD_01-10-2007_rbasom
>> r <- qa(v[1][[1]], qaSeq) # apply 'qaSeq' to the Bustard view
>> r
> QAseq
> runParameters: runParameters(object)
> lanes: 1 (1 total)
> tiles: 1 70 (2 total)
> xyCoords dimensions: 39923 x 4
> xyBin dimensions: 800 x 5
> seqCount: 39923
>> head(xyCoords(r))
> Lane Tile X Y
> 1 1 1 1001 499
> 2 1 1 774 665
> 3 1 1 898 392
> 4 1 1 922 465
> 5 1 1 947 451
> 6 1 1 895 493
>
> The package is in quite a state of flux, but I could make it
> available to you (and others) if you're interested.
>
>> Hi everyone,
>>
>> This is my first mail to this mailing-list.
>>
>> I am beginning to write a R script in order to analyze some Solexa
>> experiments (Chip-Seq) that we already have done in the lab. The aim of
>> Chip-Seq experiments is to find DNA regions which have an
>> overrepresentation of reads depending of the condition.
>>
>> I would like to know if there is some other people doing this in
>> bioconductor, and if there is some special format that I should use, or
>> if
>> I can help in something.
>>
>> About the format I was thinking to use a Perl script, or a modification
>> of
>> the grep algorithm to locate each read, and after to use R to do the
>
> Not sure exactly whether you mean actually aligning the reads to a
> sequence, or just parsing files produced by an alignment program to
> extract the relevant information. If the later, then R is probably as
> effective as perl. For actual alignment, the magnitude of the data and
> inability to model sequencing error makes grep not practical. Options
> seem to be PhageAlign (Solexa, small genome, includes base call error
> model, impractical for large genomes), ELAND (Solexa; max 2bp
> mismatch, no error model), MAQ
> (http://sourceforge.net/project/showfiles.php?group_id=191815&package_id=229423),
> PolyBayesShort (http://bioinformatics.bc.edu/marthlab/Beta_Release),
> or Bioconductor Biostrings (might have acceptable performance for
> exact matches, though haven't really tried).
>
> One possible advantage of doing as much analysis on tile files, before
> summarizing the entire experiment, is that the tile files can be
> divided amon several processors.
>
> Martin
>
>> statistical analysis on the positions obtained with the Perl script.
>>
>> Thanks for your time,
>> Chema
>>
>>
>> --
>>
>> --------------------------------------
>> ALTERNATIVE EMAIL: josem.muinyo at cmb.udl.es
>> Jose M Muino
>> Institute of Plant Genetics
>> Polish Academy of Science
>> Strzeszynska 34
>> 61-479 Poznan
>> Poland
>> tel. 48 61 6550254
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
--------------------------------------
ALTERNATIVE EMAIL: josem.muinyo at cmb.udl.es
Jose M Muino
Institute of Plant Genetics
Polish Academy of Science
Strzeszynska 34
61-479 Poznan
Poland
tel. 48 61 6550254
More information about the Bioc-devel
mailing list