[Bioc-devel] read.XStringSet with spaces in or at end of sequence

Thomas Girke thomas.girke at ucr.edu
Fri Jun 29 00:52:23 CEST 2012


Perfect, thanks!

Thomas

On Thu, Jun 28, 2012 at 10:45:13PM +0000, Hervé Pagès wrote:
> Hi Thomas,
> 
> Just to let you know that readDNAStringSet() and family now support
> FASTA files with invalid sequence codes. I changed my initial proposal:
> invalid codes are just ignored with a warning.
> 
> Another improvement is that there is no more limit on the length of
> the lines containing the sequence data (used to be limited to 19999).
> Seems like users were sometimes bumping into that limit.
> 
> The changes are in Biostrings 2.25.5 (current devel). Please let me
> know if you run into any other issue with readDNAStringSet().
> 
> Thanks,
> H.
> 
> 
> On 05/22/2012 01:35 PM, Thomas Girke wrote:
> > Hervé,
> >
> > I agree, an argument where the user has to explicitly decide how to handle
> > unusual characters (e.g. if.invalid.char="") would solve this in the most
> > sensible manner.
> >
> > Thomas
> >
> > On Tue, May 22, 2012 at 07:39:14PM +0000, Hervé Pagès wrote:
> >> Hi Thomas,
> >>
> >> On 05/22/2012 10:58 AM, Thomas Girke wrote:
> >>> Currently, spaces in sequences are handled inconsistently by the FASTA
> >>> read functions in Biostrings. This applies to spaces in or at the end of
> >>> sequence strings. Because of this users often think Biostrings cannot
> >>> handle their sequence data and give up using it which I find
> >>> unfortunate.
> >>>
> >>> For instance, given this sequence stored in "test.fasta":
> >>>> 123
> >>> AATTTAAA GGGG
> >>>
> >>> read.DNAStringSet fails to import this sequence which is the
> >>> least desirable outcome.
> >>>
> >>>> read.DNAStringSet("test.fasta")
> >>> Error in .Call2("read_fasta_in_XStringSet", efp_list, nrec, skip, use.names,  :
> >>>     key 32 (char ' ') not in lookup table
> >>>
> >>> however, read.AAStringSet imports it but maintains the space
> >>>
> >>>> read.AAStringSet("test.fasta")
> >>>     A AAStringSet instance of length 1
> >>>         width seq                                               names
> >>>         [1]    13 AATTTAAA GGGG                                     123
> >>
> >> Note that this doesn't fail because the letters in an AAStringSet
> >> object can be anything right now, but it's on my TODO list to change
> >> this i.e. it will become an error to try to store a letter in an
> >> AAStringSet that doesn't belong to the Amino Acid alphabet (stored
> >> in predefined constant AA_ALPHABET).
> >>
> >> So the import function to use when one doesn't want to enforce a
> >> particular alphabet is read.BStringSet():
> >>
> >>     > read.BStringSet("test.fasta")
> >>       A BStringSet instance of length 1
> >>         width seq                                               names
> >>
> >>     [1]    13 AATTTAAA GGGG                                      123
> >>
> >> The other functions in the family (i.e. read.DNAStringSet,
> >> read.RNAStringSet, and read.AAStringSet) will fail if the FASTA file
> >> contains letters that are not in DNA_ALPHABET, RNA_ALPHABET, or
> >> AA_ALPHABET, respectively.
> >>
> >>>
> >>> Wouldn't it make most sense to remove/ignore spaces during the import?
> >>
> >> According to Wikipeddia
> >>
> >>     http://en.wikipedia.org/wiki/FASTA_format
> >>
> >> yes the spaces and any other invalid code should be ignored. My concern
> >> with this behavior though is that removing/ignoring letters in the input
> >> will shift the positions of all the remaining letters, which for
> >> some use cases is not desirable (maybe everything is fine because all
> >> the letters end up at the right position anyway, but maybe not, hard
> >> to tell without knowing why a space was inserted in the file in the
> >> first place).
> >>
> >> Note that we have special letters in the DNA/RNA/AA alphabets that
> >> could be used as a replacement for invalid chars:
> >>
> >>     > DNA_ALPHABET
> >>      [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
> >>     > RNA_ALPHABET
> >>      [1] "A" "C" "G" "U" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
> >>     > AA_ALPHABET
> >>      [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S"
> >> "T" "W" "Y"
> >>     [20] "V" "U" "B" "Z" "X" "*" "-" "+"
> >>
> >> "-" stands for "gap" and "+" is used for hard masking. IMO they are
> >> both reasonable candidates. I propose to add an extra arg (e.g.
> >> if.invalid.char) to read.DNAStringSet, read.RNAStringSet, and
> >> read.AAStringSet to let the user choose what the substitution letter
> >> should be, e.g. if.invalid.char="+", or if.invalid.char="" (for
> >> removing the invalid letters).
> >>
> >> Now should we set its default to "" (and strictly follow the FASTA
> >> spec), or should we set it to NA so by default an error would still
> >> be raised if the file contains invalid chars? I prefer the latter
> >> because I think it's good to let the user know that there is something
> >> uncommon (at best) or potentially wrong with the file.
> >>
> >> Thanks for your feedback,
> >> H.
> >>
> >>
> >>>
> >>> Thomas
> >>>
> >>>> sessionInfo()
> >>> R version 2.15.0 (2012-03-30)
> >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
> >>>
> >>> locale:
> >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >>>
> >>> attached base packages:
> >>> [1] stats     graphics  grDevices utils     datasets  methods   base
> >>>
> >>> other attached packages:
> >>> [1] Biostrings_2.24.1  IRanges_1.14.2     BiocGenerics_0.2.0
> >>>
> >>> loaded via a namespace (and not attached):
> >>> [1] stats4_2.15.0 tools_2.15.0
> >>>
> >>> _______________________________________________
> >>> Bioc-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >>
> >> --
> >> Hervé Pagès
> >>
> >> Program in Computational Biology
> >> Division of Public Health Sciences
> >> Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N, M1-B514
> >> P.O. Box 19024
> >> Seattle, WA 98109-1024
> >>
> >> E-mail: hpages at fhcrc.org
> >> Phone:  (206) 667-5791
> >> Fax:    (206) 667-1319
> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
> 
>



More information about the Bioc-devel mailing list