[Bioc-devel] read.XStringSet with spaces in or at end of sequence

Tue May 22 22:35:41 CEST 2012

Hervé,

I agree, an argument where the user has to explicitly decide how to handle 
unusual characters (e.g. if.invalid.char="") would solve this in the most
sensible manner.

Thomas

On Tue, May 22, 2012 at 07:39:14PM +0000, Hervé Pagès wrote:
> Hi Thomas,
> 
> On 05/22/2012 10:58 AM, Thomas Girke wrote:
> > Currently, spaces in sequences are handled inconsistently by the FASTA
> > read functions in Biostrings. This applies to spaces in or at the end of
> > sequence strings. Because of this users often think Biostrings cannot
> > handle their sequence data and give up using it which I find
> > unfortunate.
> >
> > For instance, given this sequence stored in "test.fasta":
> >> 123
> > AATTTAAA GGGG
> >
> > read.DNAStringSet fails to import this sequence which is the
> > least desirable outcome.
> >
> >> read.DNAStringSet("test.fasta")
> > Error in .Call2("read_fasta_in_XStringSet", efp_list, nrec, skip, use.names,  :
> >    key 32 (char ' ') not in lookup table
> >
> > however, read.AAStringSet imports it but maintains the space
> >
> >> read.AAStringSet("test.fasta")
> >    A AAStringSet instance of length 1
> >        width seq                                               names
> >        [1]    13 AATTTAAA GGGG                                     123
> 
> Note that this doesn't fail because the letters in an AAStringSet
> object can be anything right now, but it's on my TODO list to change
> this i.e. it will become an error to try to store a letter in an
> AAStringSet that doesn't belong to the Amino Acid alphabet (stored
> in predefined constant AA_ALPHABET).
> 
> So the import function to use when one doesn't want to enforce a
> particular alphabet is read.BStringSet():
> 
>    > read.BStringSet("test.fasta")
>      A BStringSet instance of length 1
>        width seq                                               names 
> 
>    [1]    13 AATTTAAA GGGG                                      123
> 
> The other functions in the family (i.e. read.DNAStringSet,
> read.RNAStringSet, and read.AAStringSet) will fail if the FASTA file
> contains letters that are not in DNA_ALPHABET, RNA_ALPHABET, or
> AA_ALPHABET, respectively.
> 
> >
> > Wouldn't it make most sense to remove/ignore spaces during the import?
> 
> According to Wikipeddia
> 
>    http://en.wikipedia.org/wiki/FASTA_format
> 
> yes the spaces and any other invalid code should be ignored. My concern
> with this behavior though is that removing/ignoring letters in the input 
> will shift the positions of all the remaining letters, which for
> some use cases is not desirable (maybe everything is fine because all
> the letters end up at the right position anyway, but maybe not, hard
> to tell without knowing why a space was inserted in the file in the
> first place).
> 
> Note that we have special letters in the DNA/RNA/AA alphabets that
> could be used as a replacement for invalid chars:
> 
>    > DNA_ALPHABET
>     [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
>    > RNA_ALPHABET
>     [1] "A" "C" "G" "U" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
>    > AA_ALPHABET
>     [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" 
> "T" "W" "Y"
>    [20] "V" "U" "B" "Z" "X" "*" "-" "+"
> 
> "-" stands for "gap" and "+" is used for hard masking. IMO they are
> both reasonable candidates. I propose to add an extra arg (e.g.
> if.invalid.char) to read.DNAStringSet, read.RNAStringSet, and
> read.AAStringSet to let the user choose what the substitution letter
> should be, e.g. if.invalid.char="+", or if.invalid.char="" (for
> removing the invalid letters).
> 
> Now should we set its default to "" (and strictly follow the FASTA
> spec), or should we set it to NA so by default an error would still
> be raised if the file contains invalid chars? I prefer the latter
> because I think it's good to let the user know that there is something
> uncommon (at best) or potentially wrong with the file.
> 
> Thanks for your feedback,
> H.
> 
> 
> >
> > Thomas
> >
> >> sessionInfo()
> > R version 2.15.0 (2012-03-30)
> > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
> >
> > locale:
> > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> >
> > other attached packages:
> > [1] Biostrings_2.24.1  IRanges_1.14.2     BiocGenerics_0.2.0
> >
> > loaded via a namespace (and not attached):
> > [1] stats4_2.15.0 tools_2.15.0
> >
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319