[Bioc-sig-seq] read sequences from the web

Hervé Pagès hpages at fhcrc.org
Fri Feb 5 18:43:15 CET 2010


Hi Thomas,

Oops, some recent speed improvements to the read.*StringSet() family
that turn out to be regressions for your use case, sorry!

Back in November I re-implemented in C the FASTA parser used by the
read.*StringSet() family to make it faster. Now it's 10x or 20x
faster (I don't remember exactly) to load Human chr1 from a FASTA
file. Because handling R connections in C is not easily doable
right now (the C code in R that handles these connections has not
been designed to be easily reusable in a package), this FASTA parser
uses standard C facilities to read the file, with all the restrictions
that this implies. For example the file must be local, no more URLs,
pipes, fifos, socket connections, etc... all the fancy stuff
supported by R connections (see ?file).

I under estimated the value of supporting URLs so I'll work on a fix
to at least support those (the fix will consist in downloading
the file first to a temp file, nothing fancy). I'll post again here
when this is ready.

Cheers,
H.


Thomas Girke wrote:
> Dear Biostrings Developers,
> 
> There seems to be a change (bug?) in the behavior of the read.XXStringSet functions
> in the latest Biostrings version when pointing to files on the web. 
> 
> For instance: 
> 
> ## This works under R-2.10.0
> library(Biostrings)
> read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa", "fasta") 
> 
> ## But the same command under R-2.10.1 returns the following error:
> Error in .read.fasta.in.XStringSet(filepath, set.names, elementType, lkup) :
> cannot open file 'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
> 
> My session info for R-2.10.0 is:
> 
> R version 2.10.1 (2009-12-14) 
> x86_64-unknown-linux-gnu 
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=C 
>              LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
> [1] Biostrings_2.14.10 IRanges_1.4.9     
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.1
> 
> 
> Thanks in advance for your help.
> 
> Thomas
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list