[Bioc-sig-seq] Standard 454 quality checks?

Mon May 4 23:26:12 CEST 2009

What if there were only one generic and completely
technology-independent function for the import of sequences and quality
data generated by Solexa/Illumina, 454/Roche, traditional Sanger, and
perhaps some other technologies? Aren't we dealing in all these cases
with essentially the same two sequence formats: FASTQ or FASTA (one for 
the seq and one for qual). The specific information about the type of 
ASCII conversion and methods used for computing quality scores could 
be provided in an argument. All our data from Solexa, 454 and Sanger
instruments could be handled this way.

Thomas

On Mon, May 04, 2009 at 09:11:14AM -0700, Patrick Aboyoun wrote:
> Michael,
> I think we need to reconcile interfaces because ShortRead's 
> readBaseQuality function is intended to perform the same type of 
> activity as readFastaQual. The only difference was that it was looking 
> at Solexa, rather than 454, data.
> 
> 
> Patrick
> 
> 
> Michael Lawrence wrote:
> >On Mon, May 4, 2009 at 2:19 AM, Dan Bolser <dan.bolser at gmail.com> wrote:
> >
> >  
> >>2009/4/23 Martin Morgan <mtmorgan at fhcrc.org>:
> >>    
> >>>Dan Bolser <dan.bolser at gmail.com> writes:
> >>>
> >>>      
> >>>>Sorry for the noob question, but is there a set of standard quality
> >>>>checks in R that I can run over some 454 data? I have the fasta and
> >>>>the fasta format quality files as well as an sff. I scanned the manual
> >>>>for the ShortReads package, but it seems focused on Illumina, I
> >>>>couldn't pick out the general bits from the specifics.
> >>>>        
> >>>Hi Dan --
> >>>
> >>>You might be on somewhat uncharted territory here; most of our
> >>>experience (though we have some 454 data now) is with Solexa.
> >>>
> >>>I don't think the standard QA pipeline, along the lines of
> >>>report(qa(<...>)), will work at the moment, but I'll try to add that
> >>>today.
> >>>
> >>>You should be able to read the fasta and quality scores with
> >>>read454(). This returns a 'ShortReadQ' object, srq, that bundles the
> >>>reads, their quality scores, and their ids.
> >>>      
> >>Hi Martin, thanks for the info on this. I'm have only just got round
> >>to looking at this but I'm a little bit confused about how to read in
> >>the reads / qualities.
> >>
> >>I had expected to say something like:
> >>
> >>x <- read454(fasta = "my.fas", qual.fasta = "my.qual")
> >>
> >>
> >>or
> >>
> >>x <- read454(srf = "my.srf")
> >>
> >>
> >>or
> >>
> >>x <- read454(fastq = "my.fastq")
> >>
> >>
> >>
> >>However, this is clearly not correct. "?read454" brings up the
> >>"RochePath-class" page that suggests I try something like:
> >>
> >>x <- RochePath(readPath="./")
> >>y<- read454(x)
> >>
> >>
> >>But that fails too... "no input files found / pattern: \.fna$". I
> >>tried to set pattern somewhere (to match the ".fas" / and ".qual"
> >>patterns of my files), but I couldn't seem to set them anywhere.
> >>
> >>Can you help with an example of what I am doing wrong?
> >>
> >>    
> >
> >
> >Thanks for this great feedback. One way to achieve what you want would be 
> >to
> >call readFasta and readQual separately for each file, and then combine the
> >results into a ShortReadQ. In the devel version of ShortRead (it might take
> >a day to propagate), I just renamed read454 to readFastaQual and added
> >separate pattern arguments for the fasta and qual file, so you could do:
> >srq <- readFastaQual(".", "my.fas", "my.qual")
> >
> >How does that sound? I think that this interface could be made more
> >convenient (such as having separate pattern arguments for the extensions 
> >and
> >then a "base" pattern), but I'm just trying to stay consistent with what is
> >ShortRead already.
> >
> >Michael
> >
> >
> >  
> >>How come I need to specify a directory and a pattern when what I
> >>really want to do is just to specify a file (because that is what I
> >>have)?
> >>
> >>
> >>Thanks for any help.
> >>
> >>All the best,
> >>Dan.
> >>
> >>
> >>
> >>    
> >>>The basic touch points of the qa report for read (i.e., not aligned)
> >>>data are numbers of reads, nucleotide frequencies
> >>>
> >>> alphabetFrequency(sread(srq), baseOnly=TRUE, collapse=TRUE)
> >>>
> >>>and cycle-specific alphabet frequencies and average quality scores
> >>>(use alphabetByCycle on sread(srq) and quality(srq)). For 454 it seems
> >>>like a simple plot of average quality score, along the lines of
> >>>alphabetScore(quality(srq)) / width(quality(srq)) against
> >>>width(quality(srq)) can also be quite insightful. There might be
> >>>issues where the functions expect / it makes sense to do analysis on
> >>>uniform-width reads, or on groups of uniformly-widthed reads.
> >>>
> >>>Sorry for the only limited help.
> >>>
> >>>Martin
> >>>
> >>>
> >>>      
> >>>>Thanks for any help,
> >>>>Dan.
> >>>>
> >>>>_______________________________________________
> >>>>Bioc-sig-sequencing mailing list
> >>>>Bioc-sig-sequencing at r-project.org
> >>>>https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>>>        
> >>>--
> >>>Martin Morgan
> >>>Computational Biology / Fred Hutchinson Cancer Research Center
> >>>1100 Fairview Ave. N.
> >>>PO Box 19024 Seattle, WA 98109
> >>>
> >>>Location: Arnold Building M1 B861
> >>>Phone: (206) 667-2793
> >>>
> >>>      
> >>_______________________________________________
> >>Bioc-sig-sequencing mailing list
> >>Bioc-sig-sequencing at r-project.org
> >>https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>
> >>    
> >
> >	[[alternative HTML version deleted]]
> >
> >_______________________________________________
> >Bioc-sig-sequencing mailing list
> >Bioc-sig-sequencing at r-project.org
> >https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>