[Bioc-sig-seq] read sequences from the web

Laurent Gautier lgautier at gmail.com
Fri Feb 5 19:42:36 CET 2010


On 2/5/10 6:43 PM, Hervé Pagès wrote:
> Hi Thomas,
>
> Oops, some recent speed improvements to the read.*StringSet() family
> that turn out to be regressions for your use case, sorry!
>
> Back in November I re-implemented in C the FASTA parser used by the
> read.*StringSet() family to make it faster. Now it's 10x or 20x
> faster (I don't remember exactly) to load Human chr1 from a FASTA
> file. Because handling R connections in C is not easily doable
> right now (the C code in R that handles these connections has not
> been designed to be easily reusable in a package),

This is surfacing occasionally on the R-devel mailing-list, with even 
someone contributing a patch. All seems to have been largely ignored, 
may be because a critical mass has not been reach (I am still trying to 
rationalize ;-) ). May be you'll have a strategy to have it pushed through.

> this FASTA parser
> uses standard C facilities to read the file, with all the restrictions
> that this implies. For example the file must be local, no more URLs,
> pipes, fifos, socket connections, etc... all the fancy stuff
> supported by R connections (see ?file).
>
> I under estimated the value of supporting URLs so I'll work on a fix
> to at least support those (the fix will consist in downloading
> the file first to a temp file, nothing fancy). I'll post again here
> when this is ready.
>
> Cheers,
> H.
>
>
> Thomas Girke wrote:
>> Dear Biostrings Developers,
>>
>> There seems to be a change (bug?) in the behavior of the
>> read.XXStringSet functions
>> in the latest Biostrings version when pointing to files on the web.
>> For instance:
>> ## This works under R-2.10.0
>> library(Biostrings)
>> read.AAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa",
>> "fasta")
>> ## But the same command under R-2.10.1 returns the following error:
>> Error in .read.fasta.in.XStringSet(filepath, set.names, elementType,
>> lkup) :
>> cannot open file
>> 'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa'
>>
>>
>> My session info for R-2.10.0 is:
>>
>> R version 2.10.1 (2009-12-14) x86_64-unknown-linux-gnu
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
>> LC_COLLATE=en_US.UTF-8 LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C
>> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>> other attached packages:
>> [1] Biostrings_2.14.10 IRanges_1.4.9
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.6.1
>>
>>
>> Thanks in advance for your help.
>>
>> Thomas
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>



More information about the Bioc-sig-sequencing mailing list