[Bioc-devel] read.XStringSet with spaces in or at end of sequence

Fri Jun 29 00:45:13 CEST 2012

Hi Thomas,

Just to let you know that readDNAStringSet() and family now support
FASTA files with invalid sequence codes. I changed my initial proposal:
invalid codes are just ignored with a warning.

Another improvement is that there is no more limit on the length of
the lines containing the sequence data (used to be limited to 19999).
Seems like users were sometimes bumping into that limit.

The changes are in Biostrings 2.25.5 (current devel). Please let me
know if you run into any other issue with readDNAStringSet().

Thanks,
H.

On 05/22/2012 01:35 PM, Thomas Girke wrote:
> Hervé,
>
> I agree, an argument where the user has to explicitly decide how to handle
> unusual characters (e.g. if.invalid.char="") would solve this in the most
> sensible manner.
>
> Thomas
>
> On Tue, May 22, 2012 at 07:39:14PM +0000, Hervé Pagès wrote:
>> Hi Thomas,
>>
>> On 05/22/2012 10:58 AM, Thomas Girke wrote:
>>> Currently, spaces in sequences are handled inconsistently by the FASTA
>>> read functions in Biostrings. This applies to spaces in or at the end of
>>> sequence strings. Because of this users often think Biostrings cannot
>>> handle their sequence data and give up using it which I find
>>> unfortunate.
>>>
>>> For instance, given this sequence stored in "test.fasta":
>>>> 123
>>> AATTTAAA GGGG
>>>
>>> read.DNAStringSet fails to import this sequence which is the
>>> least desirable outcome.
>>>
>>>> read.DNAStringSet("test.fasta")
>>> Error in .Call2("read_fasta_in_XStringSet", efp_list, nrec, skip, use.names,  :
>>>     key 32 (char ' ') not in lookup table
>>>
>>> however, read.AAStringSet imports it but maintains the space
>>>
>>>> read.AAStringSet("test.fasta")
>>>     A AAStringSet instance of length 1
>>>         width seq                                               names
>>>         [1]    13 AATTTAAA GGGG                                     123
>>
>> Note that this doesn't fail because the letters in an AAStringSet
>> object can be anything right now, but it's on my TODO list to change
>> this i.e. it will become an error to try to store a letter in an
>> AAStringSet that doesn't belong to the Amino Acid alphabet (stored
>> in predefined constant AA_ALPHABET).
>>
>> So the import function to use when one doesn't want to enforce a
>> particular alphabet is read.BStringSet():
>>
>>     > read.BStringSet("test.fasta")
>>       A BStringSet instance of length 1
>>         width seq                                               names
>>
>>     [1]    13 AATTTAAA GGGG                                      123
>>
>> The other functions in the family (i.e. read.DNAStringSet,
>> read.RNAStringSet, and read.AAStringSet) will fail if the FASTA file
>> contains letters that are not in DNA_ALPHABET, RNA_ALPHABET, or
>> AA_ALPHABET, respectively.
>>
>>>
>>> Wouldn't it make most sense to remove/ignore spaces during the import?
>>
>> According to Wikipeddia
>>
>>     http://en.wikipedia.org/wiki/FASTA_format
>>
>> yes the spaces and any other invalid code should be ignored. My concern
>> with this behavior though is that removing/ignoring letters in the input
>> will shift the positions of all the remaining letters, which for
>> some use cases is not desirable (maybe everything is fine because all
>> the letters end up at the right position anyway, but maybe not, hard
>> to tell without knowing why a space was inserted in the file in the
>> first place).
>>
>> Note that we have special letters in the DNA/RNA/AA alphabets that
>> could be used as a replacement for invalid chars:
>>
>>     > DNA_ALPHABET
>>      [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
>>     > RNA_ALPHABET
>>      [1] "A" "C" "G" "U" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
>>     > AA_ALPHABET
>>      [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S"
>> "T" "W" "Y"
>>     [20] "V" "U" "B" "Z" "X" "*" "-" "+"
>>
>> "-" stands for "gap" and "+" is used for hard masking. IMO they are
>> both reasonable candidates. I propose to add an extra arg (e.g.
>> if.invalid.char) to read.DNAStringSet, read.RNAStringSet, and
>> read.AAStringSet to let the user choose what the substitution letter
>> should be, e.g. if.invalid.char="+", or if.invalid.char="" (for
>> removing the invalid letters).
>>
>> Now should we set its default to "" (and strictly follow the FASTA
>> spec), or should we set it to NA so by default an error would still
>> be raised if the file contains invalid chars? I prefer the latter
>> because I think it's good to let the user know that there is something
>> uncommon (at best) or potentially wrong with the file.
>>
>> Thanks for your feedback,
>> H.
>>
>>
>>>
>>> Thomas
>>>
>>>> sessionInfo()
>>> R version 2.15.0 (2012-03-30)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.24.1  IRanges_1.14.2     BiocGenerics_0.2.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] stats4_2.15.0 tools_2.15.0
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319