[R] spliting first 10 words in a string
David Winsemius
dwinsemius at comcast.net
Mon Nov 1 23:32:17 CET 2010
On Nov 1, 2010, at 5:52 PM, Phil Spector wrote:
> -
> Does this example do what you want?
>
>> mysentences = c('Here is a sentence that has a bunch of words in
>> it','Here is another sentence that also has a bunch of words','I
>> have yet another sentence and it also has a whole bunch of words')
>> data.frame(mysentences,do.call(rbind,lapply(strsplit(mysentences,'
>> +'),'[',1:10)))
>
> mysentences X1 X2
> 1 Here is a sentence that has a bunch of words in it
> Here is
> 2 Here is another sentence that also has a bunch of words
> Here is
> 3 I have yet another sentence and it also has a whole bunch of
> words I have
> X3 X4 X5 X6 X7 X8 X9 X10
> 1 a sentence that has a bunch of words
> 2 another sentence that also has a bunch of
> 3 yet another sentence and it also has a
Matevž;
Be on the alert for what the data.frame function does with character
vectors. Unless you forbid it from doing so it will convert any
character vector to a factor. (A major source of confusion for R-
newbies.) In the above version you could prevent this in Phil's
solution by:
data.frame(mysentences,do.call(rbind,lapply(strsplit(mysentences,'
+'),'[',1:10)), stringsAsFactors=FALSE)
Or if cbind were applied to my solution at the end of this email:
cbind(worddf, t(sapply(strsplit(worddf$words, " "), "[", 1:10) ) ,
stringsAsFactors=FALSE)
> str( cbind(worddf, t(sapply(strsplit(worddf$words, " "), "[",
1:10) ) , stringsAsFactors=FALSE) )
'data.frame': 3 obs. of 11 variables:
$ words: chr "I have a columnn with text that has quite a few words
in it." "I would like to split these words in separate columns" "but
just first ten words in the string. Is that possible in R?"
$ 1 : chr "I" "I" "but"
$ 2 : chr "have" "would" "just"
$ 3 : chr "a" "like" "first"
$ 4 : chr "columnn" "to" "ten"
$ 5 : chr "with" "split" "words"
$ 6 : chr "text" "these" "in"
$ 7 : chr "that" "words" "the"
$ 8 : chr "has" "in" "string."
$ 9 : chr "quite" "separate" "Is"
$ 10 : chr "a" "columns" "that"
cbind.data.frame is a method that would be invoked for that operation.
This result has the disadvantage that the column names will need to be
enclosed in quotes to access them with the "$" function since they
start with numerals.
(Or you could just deal with the factor type.)
--
David.
>
> - Phil Spector
> Statistical Computing Facility
> Department of Statistics
> UC Berkeley
> spector at stat.berkeley.edu
>
>
> On Mon, 1 Nov 2010, Matevž Pavlič wrote:
>
>> ...I would like i.e. split this sentence from field Opis in
>> data.frame :
>>
>> Opis : "I have a sentense with ten words", so that it would conver
>> to something like this :
>>
>> Opis : "I have a sentense with then words"; Column1 : "I";
>> Column2 : "have"; Column3 : "a"; Column4 : "sentense"; Column5:
>> "with"; Column6 :"ten";column7:"words"
>>
>> ....or in data.frame something like this (as I understand) :
>>
>> data.frame': xx obs. of 12 variables:
>> $ Opis : factor :"I have a sentense with then words";
>> $ Column1 : factor "I";
>> $ Column2 : factor "have";
>> $ Column3 : factor "a";
>> $ Column4 : factor "sentense";
>> $ Column5: factor "with";
>> $ Column6 : factor "ten";
>> $ Column7: factor"words"
>>
>> Hope that explains it better, I am still having some troubles
>> understanding R and all..
>> m
>>
>>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org
>> ] On Behalf Of Matevž Pavlič
>> Sent: Monday, November 01, 2010 10:34 PM
>> To: David Winsemius
>> Cc: r-help at r-project.org
>> Subject: Re: [R] spliting first 10 words in a string
>>
>> Hi,
>>
>> I am sorry, will try to be more exact from now on...
>>
>> I have a data.frame with a field called Opis. IT contains
>> sentenses that I would like to split in words or fields in
>> data.frame...when I say columns I mean as in Excel table. I would
>> like to split "Opis" into ten fields from the first ten words in
>> Opis field.
>> Here is an example of my data.frame.
>>
>> 'data.frame': 22928 obs. of 12 variables:
>> $ VrtinaID : int 1 1 1 1 2 2 2 2 2 2 ...
>> $ ZapStev : int 1 2 3 4 1 2 3 4 5 6 ...
>> $ GlobinaOd : num 0 0.8 9.2 10.1 0 0.9 2.6 4.9 6.8 7.3 ...
>> $ GlobinaDo : num 0.8 9.2 10.1 11 0.9 2.6 4.9 6.8 7.3 8.2 ...
>> $ Opis : Factor w/ 12754 levels "","(MIVKA) DROBEN
>> MELJAST PESEK, GOST, SIVORJAV",..: 2060 11588 2477 11660 7539 3182
>> 7884 9123 2500 4756 ...
>> $ ACklasifikacija : Factor w/ 290 levels "","(CL)","(CL)/(SC)",..:
>> 154 125 101 101 NA 106 125 80 106 101 ...
>> $ GeolNastOd : num 0 0.8 9.2 10.1 0 0.9 2.6 4.9 6.8 7.3 ...
>> $ GeolNastDo : num 0.8 9.2 10.1 11 0.9 2.6 4.9 6.8 7.3 8.2 ...
>> $ GeolNastOpis : Factor w/ 113 levels "","B. M. S.",..: 56 53 53
>> 53 56 53 53 53 53 53 ...
>> $ NacinVrtanjaOd : num 0e+00 1e+09 1e+09 1e+09 0e+00 ...
>> $ NacinVrtanjaDo : num 1.1e+01 1.0e+09 1.0e+09 1.0e+09 1.0e+01 ...
>> $ NacinVrtanjaOpis: Factor w/ 43 levels "","H. N.","IZKOP",..: 26 1
>> 1 1 26 1 1 1 1 1 ...
>>
>> Hope that explains better...
>> Thank you, m
>>
>> -----Original Message-----
>> From: David Winsemius [mailto:dwinsemius at comcast.net]
>> Sent: Monday, November 01, 2010 10:13 PM
>> To: Matevž Pavlič
>> Cc: r-help at r-project.org
>> Subject: Re: [R] spliting first 10 words in a string
>>
>>
>> On Nov 1, 2010, at 4:39 PM, Matevž Pavlič wrote:
>>
>>> Hi all,
>>>
>>>
>>>
>>> I have a columnn with text that has quite a few words in it. I would
>>> like to split these words in separate columns, but just first ten
>>> words in the string. Is that possible in R?
>>>
>>>
>>
>> Not sure what a column means to you. It's not a precisely defined R
>> type or class. (And you are requested to offered a concrete example
>> rather than making us guess.)
>>
>> >words <-"I have a columnn with text that has quite a few words in
>> it. I would like to split these words in separate columns, but just
>> first ten words in the string. Is that possible in R?"
>>
>> > strsplit(words, " ")[[1]][1:10]
>> [1] "I" "have" "a" "columnn" "with" "text"
>> "that" "has" "quite" "a"
>>
>>
>> Or if in a dataframe:
>>
>> > words <-c("I have a columnn with text that has quite a few words in
>> it.", "I would like to split these words in separate columns", "but
>> just first ten words in the string. Is that possible in R?")
>> > worddf <- data.frame(words=words)
>>
>> > t(sapply(strsplit(worddf$words, " "), "[", 1:10) )
>> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,
>> 8] [,9] [,10]
>> [1,] "I" "have" "a" "columnn" "with" "text" "that" "has"
>> "quite" "a"
>> [2,] "I" "would" "like" "to" "split" "these" "words" "in"
>> "separate" "columns"
>> [3,] "but" "just" "first" "ten" "words" "in" "the"
>> "string."
>> "Is" "that"
>>
>>
>> --
>> David Winsemius, MD
>> West Hartford, CT
>>
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list