[Bioc-sig-seq] reading 454 data with ShortRead

Martin Morgan mtmorgan at fhcrc.org
Wed Aug 19 20:15:49 CEST 2009


Dan Bolser <dan.bolser at gmail.com> writes:

> OK, I got what I wanted with the following:
>
>
> library(ShortRead)

  rp <- RochePath("/foo/bar/baz")

populates readPath and qualPath with /foo/bar/baz.

  sr <- read454(rp)

will I think do what you want, returning a 'ShortReadQ' object (this
can be constructed from your x, y with ShortReadQ(x, y), and provides
coordinated ways of dealing with quality and sequence data). The fasta
and quality files match

  list.files(readPath(rp), "\\.fna$")
  list.files(qualpath(rp), "\\.qual$")

where "\\.fna$" and "\\.qual$" can be over-riddent by providing
arguments fastaPattern="some-other-pattern" and qualPattern, e.g.,

In terms of qa, there is as yet no direct support for 454 data. There
are two problems, the variable read length and the quality metrics
specific to 454. A simple first step is to look at average base
quality as a function of read length.

A hack is to export the 'sr' object to a fastq file (with writeFastq)
and to do a qa on that file, but providing a filter that returns
strictly even lengthed reads. I did this as

  sr <- read454(rp)
  wmax <- which.max(tabulate(width(sr))) # most common read width
  writeFastq(sr, "/tmp/sr.fastq")
  qa <- qa("/tmp", "sr.fastq", type="fastq",
           filter=srFilter(function(x) width(x) == wmax)
  browseURL(report(qa))

This requires the development version of ShortRead (and hence of R).

Here's a chunk of script from an exploratory analysis that might
also provide some ideas.

library(lattice)
sr <- read454(rp)

## distribution of read widths
df <- data.frame(CummulativeProportion=seq_len(length(sr)) / length(sr),
                 Width=sort(width(sr)))
xyplot(CummulativeProportion ~ Width,
       df[!duplicated(df$Width),], type="l")

## distribution of average base quality as a
function of read width
df <- data.frame(AlphabetScore=alphabetScore(sr), Width=width(sr))
xyplot(log(AlphabetScore) - log(Width) ~ Width, df)

## how does quality change by cycle, for the 12 longest categories of
## width (e.g., average quality for the N=221 reads with width 282)
tbl <- table(width(sr))
top <- tbl[order(tbl, decreasing=TRUE)][1:12]
wd <- as.integer(names(top))
scores <- lapply(wd, function(w, q) {
    colMeans(as(q[width(q)==w], "matrix"))
}, quality(sr))
lbls <- paste(names(top), " (N = ", top, ")", sep="")
df <- data.frame(Cycle=unlist(lapply(wd, seq_len)),
                 AverageQuality=unlist(scores),
                 ReadLength=rep(lbls, wd))
xyplot(AverageQuality ~ Cycle | ReadLength, df)
                              
Becuase of the read lengths, I think you'd want a fully capable
alignment tool (indels, arbitrary mismatches, etc), maybe blast as an
accessible starting point?  Is alignment to a full reference genome
relevant to your question?  Often 454 data seem to be collected with
more targeted questions in mind.

Martin
                          

> experimentPath <-
>   "./Fasta/"
>
> readPath <- experimentPath
> qualPath <- experimentPath
>
> my.rp <-
>   RochePath(experimentPath=experimentPath,
>             readPath=readPath,
>             qualPath=qualPath)
>
>
> pattern <- "FX0QVYF02"
>
> x <-
>   readFasta(my.rp, pattern)
> y <-
>   readQual(my.rp, pattern)
>
> show(x)
>
> mean(width(x))
>
>
> Is there an equivalent 'qa' function on the RochePath?
>
>
> As an aside, what tool would you recommend to align 454 reads to a
> reference genome?
>
>
> Cheers,
> Dan.
>
>
>
> 2009/8/19 Dan Bolser <dan.bolser at gmail.com>:
>> I'm trying to use this function (read454) but it keeps dying..
>>
>>
>> library(ShortRead)
>>
>> experimentPath <-
>>  "Fasta/"
>>
>> readPath <- experimentPath
>> qualPath <- experimentPath
>>
>> pattern <- "FUBJ29X06.sff."
>>
>> reads <-
>>  read454(RochePath(experimentPath=experimentPath,
>>                    readPath=readPath,
>>                    qualPath=qualPath), pattern)
>>
>> +   read454(RochePath(experimentPath=experimentPath,
>> +                     readPath=readPath,
>> +                     qualPath=qualPath), pattern)
>> Error in file(file, "r") : cannot open the connection
>> In addition: Warning message:
>> In file(file, "r") : cannot open file 'NA': No such file or directory
>>
>>
>> The contents of the 'Fasta' directory look like this:
>>
>> ls Fasta/FUBJ29X06.sff.*
>> Fasta/FUBJ29X06.sff.fna  Fasta/FUBJ29X06.sff.qual
>>
>>
>>
>> Why can't I just specify a fasta file and a quality file?
>>
>> Cheers,
>> Dan.
>>
>>
>> 2009/7/17 Pratap, Abhishek <APratap at som.umaryland.edu>:
>>> Hi
>>>
>>> In that case based on my short experience I think you could specify a pattern while reading files from a directory.
>>>
>>> pattern-<"regex-pattern-here"
>>>
>>> reads <- read454(RochePath(experimentPath=experimentPath,readPath=readPath, qualPath=qualPath), pattern)
>>>
>>> I think should work.
>>>
>>> Thanks
>>> -A
>>>
>>>
>>> -----Original Message-----
>>> From: bioc-sig-sequencing-bounces at r-project.org on behalf of Wolfgang Raffelsberger
>>> Sent: Fri 7/17/2009 1:35 PM
>>> Cc: bioc-sig-sequencing at r-project.org
>>> Subject: Re: [Bioc-sig-seq] reading 454 data with ShortRead
>>>
>>> Dear Michael,
>>>
>>> Thank's a lot, it works now ...
>>> If I understand right, I may not have more than one single pair of
>>> sequence-reads (.fna) and quality-scores (.qual) in a given directory,
>>> since you don't specify the file-names.
>>>
>>> Wolfgang
>>>
>>> Michael Mueller a écrit :
>>>> Dear Wolfgang,
>>>>
>>>> readPath and qualPath specify the paths to the directory containing the
>>>> .fna and .qual files rather than the paths to the files themselves.
>>>>
>>>> Try:
>>>>
>>>> experimentPath <- "/path/to/directory/containing/fna/and/qual/files"
>>>> readPath <- experimentPath
>>>> qualPath <- experimentPath
>>>>
>>>> reads <- read454(RochePath(experimentPath=experimentPath,
>>>> readPath=readPath, qualPath=qualPath))
>>>>
>>>> If only experimentPath is specified all sub-directories of
>>>> <experimentPath> matching run are searched for read/quality files.
>>>>
>>>> Regards,
>>>> Michael
>>>>
>>>> ----------------------------
>>>> Michael Mueller
>>>> Bioinformatician
>>>>
>>>> Imperial College London
>>>> MRC Clinical Sciences Centre
>>>> Du Cane Road
>>>> London, W12 0NN
>>>> United Kingdom
>>>>
>>>> phone +44 (0)20 8383 8537
>>>> fax   +44 (0)20 8383 8577
>>>>
>>>>
>>>> Wolfgang Raffelsberger wrote:
>>>>> Dear list,
>>>>> I'm trying to read some 454 data into ShortRead, but I'm having some
>>>>> difficulties. Basically I have a FASTA concatenated file with all the
>>>>> sequence reads and a FASTA-like file with the Phred-quailty scores,
>>>>> the identifyiers for each read from both files match. Here I'm
>>>>> testing a small sub-set of the reads, so memory etc shouldn't be an
>>>>> issue.
>>>>> Based on the ShortRead Vignette I thought could do the job and then I
>>>>> realized that there are 454-dedicated functions. But so far I didn't
>>>>> get either of them of them to work.
>>>>>
>>>>>  > library(ShortRead)
>>>>>  > Sfile1 <- "01Reads.fna"
>>>>>  > Qfile1 <- "01Reads.qual"
>>>>>  > tmp <- read454(".", readPath=Sfile1, qualPath=Qfile1)    # I'm
>>>>> already in the proper path where both files are located ...
>>>>> Error in function (classes, fdef, mtable)  :
>>>>>   unable to find an inherited method for function "read454", for
>>>>> signature "character"
>>>>>  >
>>>>>  > # the I tried ...
>>>>>  >  tmp <- readXStringColumns(".",pattern=Sfile1,
>>>>> colClasses=list("DNAString"))
>>>>> Error: Input/Output
>>>>>   while reading files 'test16seq.fna':
>>>>>     key 62 not in lookup table
>>>>>  >
>>>>>  > # for completeness :
>>>>>  > sessionInfo()
>>>>> R version 2.9.1 (2009-06-26)
>>>>> i386-pc-mingw32
>>>>>
>>>>> locale:
>>>>> LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252
>>>>>
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>> other attached packages:
>>>>> [1] ShortRead_1.2.1   lattice_0.17-25   BSgenome_1.12.3
>>>>> Biostrings_2.12.7 IRanges_1.2.3
>>>>> loaded via a namespace (and not attached):
>>>>> [1] Biobase_2.4.1 grid_2.9.1    hwriter_1.1   tools_2.9.1
>>>>>
>>>>>
>>>>> Thanks in advance,
>>>>> Wolfgang
>>>>>
>>>>>
>>>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>>>> Wolfgang Raffelsberger, PhD
>>>>> Laboratoire de BioInformatique et Génomique Intégratives
>>>>> CNRS UMR7104, IGBMC,  1 rue Laurent Fries,  67404 Illkirch
>>>>> Strasbourg,  France
>>>>> Tel (+33) 388 65 3300         Fax (+33) 388 65 3276
>>>>> wolfgang.raffelsberger (at) igbmc.fr
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-sig-sequencing mailing list
>>>>> Bioc-sig-sequencing at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list