[Bioc-devel] read.XStringSet with spaces in or at end of sequence

Thomas Girke thomas.girke at ucr.edu
Tue May 22 19:58:47 CEST 2012


Currently, spaces in sequences are handled inconsistently by the FASTA
read functions in Biostrings. This applies to spaces in or at the end of
sequence strings. Because of this users often think Biostrings cannot
handle their sequence data and give up using it which I find
unfortunate.

For instance, given this sequence stored in "test.fasta":
>123
AATTTAAA GGGG

read.DNAStringSet fails to import this sequence which is the
least desirable outcome.

> read.DNAStringSet("test.fasta")
Error in .Call2("read_fasta_in_XStringSet", efp_list, nrec, skip, use.names,  : 
  key 32 (char ' ') not in lookup table

however, read.AAStringSet imports it but maintains the space 

> read.AAStringSet("test.fasta")                                                                                                                                                                                                                                                                                              
  A AAStringSet instance of length 1
      width seq                                               names               
      [1]    13 AATTTAAA GGGG                                     123

Wouldn't it make most sense to remove/ignore spaces during the import?

Thomas

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.24.1  IRanges_1.14.2     BiocGenerics_0.2.0

loaded via a namespace (and not attached):
[1] stats4_2.15.0 tools_2.15.0



More information about the Bioc-devel mailing list