[R] applying strsplit to a whole column
David Winsemius
dwinsemius at comcast.net
Wed Aug 4 21:46:02 CEST 2010
On Aug 4, 2010, at 3:40 PM, Dimitri Liakhovitski wrote:
> Thanks a lot, David.
> It works perfectly. Of course, lapply is also a loop!
>
> So, your method is:
> z<-
> data
> .frame
> (nam1
> =
> c("bbb..aba","ccc..abb","ddd..abc","eee..abd"),stringsAsFactors=FALSE)
> z$nam2<-unlist(lapply( strsplit(z[[1]],split="\\.."), "[", 1))
> z$nam3<-unlist(lapply( strsplit(z[[1]],split="\\.."), "[", 2))
Unless you want to use the gsub method I later offered.
>
> And using the new package "stringr" (thank you for sharing!):
> y<-data.frame(nam1=c("aaa..aba","bbb..abb","ccc..abc","ddd..abd"),
> stringsAsFactors=FALSE)
> library(stringr)
> y$nam2<-as.data.frame(str_split_fixed(y$nam1, "\\..", 2))[[1]]
> y$nam3<-as.data.frame(str_split_fixed(y$nam1, "\\..", 2))[[2]]
> (y)
>
> One question - what exactly does the square bracket in your lapply
> code mean? Looks like a shortcut - I've not seen it before.
> lapply( strsplit(z[[1]],split="\\.."), "[", 1)
It is just the Extract function applied with an argument of 1 to each
successive member of the list, so it is simply the series:
> > strsplit(x[[1]],split="\\..")[[1]][1]
[1] "bbb"
> strsplit(x[[1]],split="\\..")[[2]][1]
[1] "ccc"
> strsplit(x[[1]],split="\\..")[[3]][1]
[1] "ddd"
>
> Thank you!
> Dimitri
>
> On Wed, Aug 4, 2010 at 3:31 PM, David Winsemius <dwinsemius at comcast.net
> > wrote:
>>
>> On Aug 4, 2010, at 3:03 PM, Dimitri Liakhovitski wrote:
>>
>>> I am sorry, someone said that strsplit automatically works on a
>>> column. How exactly does it work?
>>> For example, if I want to grab just the first (or the second) part
>>> of
>>> the string in nam1 that should be split based on ".."
>>> x<-data.frame(nam1=c("bbb..aba","ccc..abb","ddd..abc","eee..abd"),
>>> stringsAsFactors=FALSE)
>>> str(x)
>>> strsplit(x[[1]],split="\\..")
>>> str(strsplit(x[[1]],split="\\.."))
>>>
>>> I am getting a list - hence, it looks like I have to go in a
>>> loop...?
>>>
>>> lapply( strsplit(x[[1]],split="\\.."), "[", 1)
>> [[1]]
>> [1] "bbb"
>>
>> [[2]]
>> [1] "ccc"
>>
>> [[3]]
>> [1] "ddd"
>>
>> [[4]]
>> [1] "eee"
>>
>>> lapply( strsplit(x[[1]],split="\\.."), "[", 2)
>> [[1]]
>> [1] "aba"
>>
>> [[2]]
>> [1] "abb"
>>
>> [[3]]
>> [1] "abc"
>>
>> [[4]]
>> [1] "abd"
>>
>>> unlist(lapply( strsplit(x[[1]],split="\\.."), "[", 2) )
>> [1] "aba" "abb" "abc" "abd"
>>> unlist(lapply( strsplit(x[[1]],split="\\.."), "[", 1) )
>> [1] "bbb" "ccc" "ddd" "eee"
>>>
>>
>>
>>> Thank you!
>>> Dimitri
>>>
>>>
>>> On Wed, Aug 4, 2010 at 2:39 PM, Dimitri Liakhovitski
>>> <dimitri.liakhovitski at gmail.com> wrote:
>>>>
>>>> Thank you very much, everyone!
>>>> Dimitri
>>>>
>>>> On Wed, Aug 4, 2010 at 2:10 PM, David Winsemius <dwinsemius at comcast.net
>>>> >
>>>> wrote:
>>>>>
>>>>> On Aug 4, 2010, at 1:42 PM, Dimitri Liakhovitski wrote:
>>>>>
>>>>>> I am sorry, I'd like to split my column ("names") such that all
>>>>>> the
>>>>>> beginning of a string ("X..") is gone and only the rest of the
>>>>>> text is
>>>>>> left.
>>>>>
>>>>> I could not tell whether it was the string "X.." or the pattern
>>>>> "X.."
>>>>> that
>>>>> was your goal for matching and removal.
>>>>>>
>>>>>> x<-data.frame(names=c("X..aba","X..abb","X..abc","X..abd"))
>>>>>> x$names<-as.character(x$names)
>>>>>
>>>>> a) Instead of "names" which is heavily used function name, use
>>>>> something
>>>>> more specific. Otherwise you get:
>>>>>>
>>>>>> names(x)
>>>>>
>>>>> "names" # and thereby avoid list comments about canines.
>>>>>
>>>>> b) Instead of coercing a character vector back to a character
>>>>> vector,
>>>>> use
>>>>> stringsAsFactors = FALSE.
>>>>>
>>>>>> x<-data.frame(nam1=c("X..aba","X..abb","X..abc","X..abd"),
>>>>>> stringsAsFactors=FALSE)
>>>>>
>>>>> #Thus is the pattern version:
>>>>>
>>>>>> x$nam1 <- gsub("X..",'', x$nam1)
>>>>>> x
>>>>>
>>>>> nam1
>>>>> 1 aba
>>>>> 2 abb
>>>>> 3 abc
>>>>> 4 abd
>>>>>
>>>>> This is the string version:
>>>>>>
>>>>>> x<-data.frame(nam1=c("X......aba","X.y.abb","X..abc","X..abd"),
>>>>>> stringsAsFactors=FALSE)
>>>>>> x$nam1 <- gsub("X\\.+",'', x$nam1)
>>>>>> x
>>>>>
>>>>> nam1
>>>>> 1 aba
>>>>> 2 y.abb
>>>>> 3 abc
>>>>> 4 abd
>>>>>
>>>>>
>>>>>> (x)
>>>>>> str(x)
>>>>>>
>>>>>> Can't figure out how to apply strsplit in this situation -
>>>>>> without
>>>>>> using a loop. I hope it's possible to do it without a loop - is
>>>>>> it?
>>>>>
>>>>> --
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list