[R] Data parsing question: adding characters within a string of characters
Duncan Murdoch
murdoch.duncan at gmail.com
Thu Jan 2 13:27:05 CET 2014
On 14-01-01 10:55 PM, Joshua Banta wrote:
> Dear Listserve,
>
> I have a data-parsing question for you. I recognize this is more in the domain of PERL/Python, but I don't know those languages! On the other hand, I am pretty good overall with R, so I'd rather get the job done within the R "ecosphere."
>
> Here is what I want to do. Consider the following data:
>
> string <- "ATCGCCCGTA[AGA]TAACCG"
>
> I want to alter string so that it looks like this:
>
> ATCGCCCGTA[A][G][A]TAACCG
>
> In other words, I want to design a piece of code that will scan a character string, find bracketed groups of characters, break up each character within the bracket into its own individual bracketed character, and then put the group of individually bracketed characters back into the character string. The lengths of the character strings enclosed by a bracket will vary, but in every case, I want to do the same thing: break up each character within the bracket into its own individual bracketed character, and then put the group of individually bracketed characters back into the character string.
>
> So, for example, another string may look like this:
>
> string2 <- "ATTATACGCA[AAATGCCCCA]GCTA[AT]GCATTA"
>
> I want to alter string so that it looks like this:
>
> "ATTATACGCA[A][A][A][T][G][C][C][C][C][A]GCTA[A][T]GCATTA"
R is fine for that sort of operation, using regular expressions for
matching and sub() or gsub() for substitution. For example, this code
finds all the bracketed strings of 1 or more ATCG letters:
matches <- gregexpr("[[][ATCG]+]", string)
In the result, which looks like this for your example string,
[[1]]
[1] 11
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE
the 11 is the start of the bracketed expression, the 5 is the length of
the match. (There may be other starts and lengths if there are multiple
bracketed expressions.) So use substr to extract the matches.
You need to be a little careful putting the string back together after
adding the extra brackets, because `substr<-` won't replace a string
with one of a different length. I use this version instead:
`mysubstr<-` <- function(x, start, stop, value)
paste0(substr(x, 1, start-1), value, substr(x, stop+1, nchar(x))
I'll leave the details of the substitutions to you...
Duncan Murdoch
More information about the R-help
mailing list