[Bioc-devel] AAString - Amino acid code enforced?
Felix Ernst
felix.ernst at ulb.ac.be
Mon Apr 2 15:07:08 CEST 2018
Dear all,
probably this is for Hervé Pagès:
I tried the following code, which should according to ?AAString not work, since ÜÖÄ are not part of any AA code.
> AAString("ÜÄÖ")
3-letter "AAString" instance
seq: ÜÄÖ
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.46.0 XVector_0.18.0 IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.24.0 compiler_3.4.4 tools_3.4.4 yaml_2.1.18
I don’t have access right now to the devel version of Biostrings, bit I checked out the current Code in the github repo and its recent changes. I am pretty sure, that this behavior is also in the current devel branch. Can someone confirm this?
My current interest is in using the XString classes and methods for an additional biological string representation. The initial question was, how can I restrict this to a certain character set, if the characters are not saved byte encoded? The latter option is not available to me, since characters like ‚«‘ or ‚=‘ result in a two byte code using the charToRaw function. This trips up the build of the internal lookup table, which are passed down to the C backend.
Therefore I looked into, how this is done for an AAString differing from a BString. I discovered, that it currently doesn‘t. I also looked into the current 2.47.12 repo, which as far as I can tell does not use the AMINO_ACID_CODE constant in the creation of an AAString object.
So my questions are:
- What is the best practice for extending a class from XString with a restricted character set, which is not byte encoded?
- Is there a way to use byte encoding for chars with two ore more bytes?
Thanks in advance for any help and suggestions.
Best regards,
Felix
PS: regarding the second question: One could change „as.integer(charToRaw(paste(letters, collapse="")))“ to „lapply(lapply(letters,charToRaw),as.integer)“ in .letterAsByteVal, but in any case it will not be atomic anymore, which I think is required to be excepted by the C backend. I didn’t test it.
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list