[Bioc-devel] read.XStringSet with spaces in or at end of sequence
Thomas Girke
thomas.girke at ucr.edu
Tue May 22 22:35:41 CEST 2012
Hervé,
I agree, an argument where the user has to explicitly decide how to handle
unusual characters (e.g. if.invalid.char="") would solve this in the most
sensible manner.
Thomas
On Tue, May 22, 2012 at 07:39:14PM +0000, Hervé Pagès wrote:
> Hi Thomas,
>
> On 05/22/2012 10:58 AM, Thomas Girke wrote:
> > Currently, spaces in sequences are handled inconsistently by the FASTA
> > read functions in Biostrings. This applies to spaces in or at the end of
> > sequence strings. Because of this users often think Biostrings cannot
> > handle their sequence data and give up using it which I find
> > unfortunate.
> >
> > For instance, given this sequence stored in "test.fasta":
> >> 123
> > AATTTAAA GGGG
> >
> > read.DNAStringSet fails to import this sequence which is the
> > least desirable outcome.
> >
> >> read.DNAStringSet("test.fasta")
> > Error in .Call2("read_fasta_in_XStringSet", efp_list, nrec, skip, use.names, :
> > key 32 (char ' ') not in lookup table
> >
> > however, read.AAStringSet imports it but maintains the space
> >
> >> read.AAStringSet("test.fasta")
> > A AAStringSet instance of length 1
> > width seq names
> > [1] 13 AATTTAAA GGGG 123
>
> Note that this doesn't fail because the letters in an AAStringSet
> object can be anything right now, but it's on my TODO list to change
> this i.e. it will become an error to try to store a letter in an
> AAStringSet that doesn't belong to the Amino Acid alphabet (stored
> in predefined constant AA_ALPHABET).
>
> So the import function to use when one doesn't want to enforce a
> particular alphabet is read.BStringSet():
>
> > read.BStringSet("test.fasta")
> A BStringSet instance of length 1
> width seq names
>
> [1] 13 AATTTAAA GGGG 123
>
> The other functions in the family (i.e. read.DNAStringSet,
> read.RNAStringSet, and read.AAStringSet) will fail if the FASTA file
> contains letters that are not in DNA_ALPHABET, RNA_ALPHABET, or
> AA_ALPHABET, respectively.
>
> >
> > Wouldn't it make most sense to remove/ignore spaces during the import?
>
> According to Wikipeddia
>
> http://en.wikipedia.org/wiki/FASTA_format
>
> yes the spaces and any other invalid code should be ignored. My concern
> with this behavior though is that removing/ignoring letters in the input
> will shift the positions of all the remaining letters, which for
> some use cases is not desirable (maybe everything is fine because all
> the letters end up at the right position anyway, but maybe not, hard
> to tell without knowing why a space was inserted in the file in the
> first place).
>
> Note that we have special letters in the DNA/RNA/AA alphabets that
> could be used as a replacement for invalid chars:
>
> > DNA_ALPHABET
> [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
> > RNA_ALPHABET
> [1] "A" "C" "G" "U" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
> > AA_ALPHABET
> [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S"
> "T" "W" "Y"
> [20] "V" "U" "B" "Z" "X" "*" "-" "+"
>
> "-" stands for "gap" and "+" is used for hard masking. IMO they are
> both reasonable candidates. I propose to add an extra arg (e.g.
> if.invalid.char) to read.DNAStringSet, read.RNAStringSet, and
> read.AAStringSet to let the user choose what the substitution letter
> should be, e.g. if.invalid.char="+", or if.invalid.char="" (for
> removing the invalid letters).
>
> Now should we set its default to "" (and strictly follow the FASTA
> spec), or should we set it to NA so by default an error would still
> be raised if the file contains invalid chars? I prefer the latter
> because I think it's good to let the user know that there is something
> uncommon (at best) or potentially wrong with the file.
>
> Thanks for your feedback,
> H.
>
>
> >
> > Thomas
> >
> >> sessionInfo()
> > R version 2.15.0 (2012-03-30)
> > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
> >
> > locale:
> > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > other attached packages:
> > [1] Biostrings_2.24.1 IRanges_1.14.2 BiocGenerics_0.2.0
> >
> > loaded via a namespace (and not attached):
> > [1] stats4_2.15.0 tools_2.15.0
> >
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioc-devel
mailing list