[Bioc-sig-seq] ShortRead, feature request (if not a bug report)

Martin Morgan mtmorgan at fhcrc.org
Wed May 18 00:10:40 CEST 2011


On 05/17/2011 02:35 PM, Ivan Gregoretti wrote:
> Hello ShortRead connoisseurs,
>
> ShortRead::readAligned is very smart because it allows you to load the
> content of a large file without decompressing it. For example:
>
> aln<- readAligned("s_1_export.txt.gz", type="SolexaExport")
>
> However, its analogue reading function ShortRead::readFasta in my
> system complains about being unable to handle gziped targets
>
> fas<- readFasta("s_1.fa.gz")
> Error in .normargInputFilepath(filepath) :
>    file "s_1.fa.gz" has unsupported type: gzfile

This is a limitation of Biostrings' read.DNAStringSet.

a work-around if these are classic single-reads-per-line is

  all <- readLines("s_1.fa.gz")
  sread <- DNAStringSet(all[c(FALSE, TRUE)])
  id <- BStringSet(all[c(TRUE, FALSE)])
  fas <- ShortRead(sread=sread, id=id)

(there may be a warning from readLines about an internal error; this can 
be ignored). Also Rsamtools::FaFile, though these are meant more for 
reference sequences than short reads.

Martin

>
>
> Currently the solution seems to be:
>
> system("gunzip -f s_1.fa.gz")
> fas<- readFasta("s_1.fa")
> system("gzip -9f s_1.fa")
>
> but this code is highly inefficient, especially with large files.
>
> Please consider adding the missing functionality just like in readAligned.
>
> In case it is a bug in my ShortRead version, see my session below.
>
> Thank you,
>
> Ivan
>
>> sessionInfo()
> R version 2.14.0 Under development (unstable) (2011-04-14 r55450)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
>   [5] LC_MONETARY=C             LC_MESSAGES=en_US.utf8
>   [7] LC_PAPER=en_US.utf8       LC_NAME=C
>   [9] LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] annotate_1.31.0      AnnotationDbi_1.15.1 Biobase_2.13.1
> [4] ShortRead_1.11.1     Rsamtools_1.5.9      lattice_0.19-26
> [7] Biostrings_2.21.1    GenomicRanges_1.5.0  IRanges_1.11.1
>
> loaded via a namespace (and not attached):
> [1] DBI_0.2-5     grid_2.14.0   hwriter_1.3   RSQLite_0.9-4 tools_2.14.0
> [6] xtable_1.5-6
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-sig-sequencing mailing list