[Bioc-sig-seq] ReadFastq error

Hervé Pagès hpages at fhcrc.org
Sat Feb 20 01:36:25 CET 2010


Ramzi,

In case you have trouble or don't want to install R-devel + Bioc-devel,
here is code that should work with release and devel (my sessionInfo
at the end):

   library(Biostrings)
   bset <- read.BStringSet("path/to/your/file", format="fastq")

   dnaletter_cols <- as.integer(
       BString(paste(DNA_ALPHABET, collapse=""))) + 1L

   ndnaletter_per_string <-
       rowSums(alphabetFrequency(bset)[ , dnaletter_cols])

   which(ndnaletter_per_string != width(bset))

Cheers,
H.

 > sessionInfo()
R version 2.10.1 (2009-12-14)
x86_64-unknown-linux-gnu

locale:
  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8
  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.14.12 IRanges_1.4.11

loaded via a namespace (and not attached):
[1] Biobase_2.6.1 tools_2.10.1


Hervé Pagès wrote:
> Hi Ramzi,
> 
> One thing you can try is loading your fastq file with:
> 
>   library(Biostrings)
>   bset <- read.BStringSet("path/to/your/file", format="fastq")
> 
> Note the use of read.BStringSet() instead of read.DNAStringSet().
> 
> Since BString/BStringSet objects are not limited to the DNA alphabet
> (see ?DNA_ALPHABET), you should be able to load your file even if
> it contains non-DNA letters (unless it has other problems of course).
> 
> Then you can do something like:
> 
>   ndnaletter_per_string <-
>       vcountPDict(BStringSet(DNA_ALPHABET), bset, collapse=2)
>   which(ndnaletter_per_string != width(bset))
> 
> to extract the list of fastq records (as an integer vector) that
> contain at least 1 non-DNA letter. (Note that the code above works
> only with R-devel + BioC-devel.)
> 
> That way you'll be able to know if you have records like this and
> where they are.
> 
> readFastq() won't load a fastq file with non-DNA letters in it.
> 
> Cheers,
> H.
> 
> 
> Ramzi TEMANNI wrote:
>> Hi,
>> I'm encountering the following error when trying to load fastq file:
>>
>> Error in .local(dirPath, pattern, ...) :
>>   _DNAencode(): key 73 not in lookup table
>>
>> Key 73 in ascii table is "I" (capital i)
>>
>> Anyone had encountered such error before ?
>>
>> Thanks in advance for your help
>>
>> Regards,
>> Ramzi
>>
>>> sessionInfo()
>> R version 2.10.1 (2009-12-14)
>> x86_64-pc-linux-gnu
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] biomaRt_2.2.0      ShortRead_1.4.0    lattice_0.18-3
>> BSgenome_1.14.2
>> [5] Biostrings_2.14.12 IRanges_1.4.11
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.6.1 grid_2.10.1   hwriter_1.1   RCurl_1.3-1   XML_2.6-0
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list