[Bioc-sig-seq] read sequences from the web

Hervé Pagès hpages at fhcrc.org
Wed Feb 10 08:28:46 CET 2010


Hi Thomas,

In Biostrings 2.15.21, read.*StringSet() works again with remote
files:

 > aaset <- 
read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa")
trying URL 
'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
ftp data connection made, file length 770075 bytes
opened URL
==================================================
downloaded 752 Kb

 > aaset[1:3]
   A AAStringSet instance of length 3
     width seq                                               names 

[1]   401 MTRRSRVGAGLAAIVLALAAVSA...FKIGGAVAVIAIVVVVVRRWRNP 
gi|10579650|gb|AA...
[2]   221 MSIIELEGVVKRYETGAETVEAL...THDTQLEEFSDRAVNLVDGVLHT 
gi|10579651|gb|AA...
[3]   369 MAWRNLGRNRVRTALAALGIVIG...SLLSGLYPAWKAANDPPVEALGE 
gi|10579652|gb|AA...

Note that I'm using download.file() in the background with quiet=FALSE
(the default) hence the verbose output and progress bar.

Cheers,
H.


Thomas Girke wrote:
> Thanks Hervé. - For me, URL-based sequence imports are useful mainly for demo 
> purposes. For now, I can certainly work around this limitations by using stepwise 
> downloads and imports. As usual, speed matters more in this area than convenience...
> 
> Best, 
> 
> Thomas
> 
> 
> On Fri, Feb 05, 2010 at 09:43:15AM -0800, Hervé Pagès wrote:
>> Hi Thomas,
>>
>> Oops, some recent speed improvements to the read.*StringSet() family
>> that turn out to be regressions for your use case, sorry!
>>
>> Back in November I re-implemented in C the FASTA parser used by the
>> read.*StringSet() family to make it faster. Now it's 10x or 20x
>> faster (I don't remember exactly) to load Human chr1 from a FASTA
>> file. Because handling R connections in C is not easily doable
>> right now (the C code in R that handles these connections has not
>> been designed to be easily reusable in a package), this FASTA parser
>> uses standard C facilities to read the file, with all the restrictions
>> that this implies. For example the file must be local, no more URLs,
>> pipes, fifos, socket connections, etc... all the fancy stuff
>> supported by R connections (see ?file).
>>
>> I under estimated the value of supporting URLs so I'll work on a fix
>> to at least support those (the fix will consist in downloading
>> the file first to a temp file, nothing fancy). I'll post again here
>> when this is ready.
>>
>> Cheers,
>> H.
>>
>>
>> Thomas Girke wrote:
>>> Dear Biostrings Developers,
>>>
>>> There seems to be a change (bug?) in the behavior of the read.XXStringSet 
>>> functions
>>> in the latest Biostrings version when pointing to files on the web. 
>>>
>>> For instance: 
>>>
>>> ## This works under R-2.10.0
>>> library(Biostrings)
>>> read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa", "fasta") 
>>>
>>> ## But the same command under R-2.10.1 returns the following error:
>>> Error in .read.fasta.in.XStringSet(filepath, set.names, elementType, lkup) 
>>> :
>>> cannot open file 
>>> 'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
>>>
>>> My session info for R-2.10.0 is:
>>>
>>> R version 2.10.1 (2009-12-14) 
>>> x86_64-unknown-linux-gnu 
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               
>>> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=C 
>>>             LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       
>>>             LC_NAME=C                 [9] LC_ADDRESS=C               LC_TELEPHONE=C   
>>> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base     
>>>
>>> other attached packages:
>>> [1] Biostrings_2.14.10 IRanges_1.4.9     
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.6.1
>>>
>>>
>>> Thanks in advance for your help.
>>>
>>> Thomas
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> -- 
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list