[R] how to split row elements [1] and [2] of a string variable A via srtsplit and sapply

Aldi aldi at dsgmail.wustl.edu
Fri Sep 11 17:11:43 CEST 2015


Thank you Jim and Bert for your suggestions.

Following is the final version used:
### Original tiny test data from Aldi Kraja, 9.11.2015.
### Purpose: split A into element 1 and 2, not interested on 3d element 
of A. Assign element one and two to vectors C and D of the same data.frame.
### Do similar work that SAS SCAN function could have done: 
C=SCAN(x,1":") ; D=SCAN(x,2,":") ;
### Jim Holtman suggested

### temp <- strsplit(x$A, ":")
### x$C <- sapply(temp, '[[', 1)
### x$D <- sapply(temp, '[[', 2)

### Bert Gunter suggested:
### do.call(rbind,strsplit(x[[1]],":"))[,-3]

### Start of script: a full R solution:

x <- read.table(text = "A          B
   1:29439275 0.46773514
   5:85928892 0.81283052
   10:128341232 0.09332543
   1:106024283:ID 0.36307805
   3:62707519 0.42657952
   2:80464120 0.89125094", header = TRUE, as.is = TRUE)
  

x$A <- as.character(x$A)
temp <- strsplit(x$A,":")
x$C <- sapply(temp,'[[',1)
x$D <- sapply(temp,'[[',2)
x$C <- as.numeric(x$C)
x$D <- as.numeric(x$D)
### Final results:
x
### end of the script
# A          B  C         D
#1     1:29439275 0.46773514  1  29439275
#2     5:85928892 0.81283052  5  85928892
#3   10:128341232 0.09332543 10 128341232
#4 1:106024283:ID 0.36307805  1 106024283
#5     3:62707519 0.42657952  3  62707519
#6     2:80464120 0.89125094  2  80464120
With best wishes,

Aldi


On 9/10/2015 1:35 PM, Bert Gunter wrote:
> ...
> Alternatively, you can avoid the looping (i.e. sapply) altogether by:
>
> do.call(rbind,strsplit(x[[1]],":"))[,-3]
>
>
>       [,1] [,2]
> [1,] "1"  "29439275"
> [2,] "5"  "85928892"
> [3,] "10" "128341232"
> [4,] "1"  "106024283"
> [5,] "3"  "62707519"
> [6,] "2"  "80464120"
>
> These can then be added to the existing frame, converted to numeric, etc.
>
> Cheers,
> Bert
> Bert Gunter
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
>     -- Clifford Stoll
>
>
> On Thu, Sep 10, 2015 at 11:05 AM, jim holtman <jholtman at gmail.com> wrote:
>> try this:
>>
>>
>>> x <- read.table(text = "A          B
>> +  1:29439275 0.46773514
>> +  5:85928892 0.81283052
>> +  10:128341232 0.09332543
>> +  1:106024283:ID 0.36307805
>> +  3:62707519 0.42657952
>> +  2:80464120 0.89125094", header = TRUE, as.is = TRUE)
>>> temp <- strsplit(x$A, ":")
>>> x$C <- sapply(temp, '[[', 1)
>>> x$D <- sapply(temp, '[[', 2)
>>>
>>> x
>>                 A          B  C         D
>> 1     1:29439275 0.46773514  1  29439275
>> 2     5:85928892 0.81283052  5  85928892
>> 3   10:128341232 0.09332543 10 128341232
>> 4 1:106024283:ID 0.36307805  1 106024283
>> 5     3:62707519 0.42657952  3  62707519
>> 6     2:80464120 0.89125094  2  80464120
>>
>>
>>
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>> On Thu, Sep 10, 2015 at 1:46 PM, aldi <aldi at wustl.edu> wrote:
>>
>>> Hi,
>>> I have a data.frame x1, of which a variable A needs to be split by
>>> element 1 and element 2 where separator is ":". Sometimes could be three
>>> elements in A, but I do not need the third element.
>>>
>>> Since R does not have a SCAN function as in SAS, C=scan(A,1,":");
>>> D=scan(A,2,":");
>>> I am using a combination of strsplit and sapply. If I do not use the
>>> index [i] then R captures the full vector . Instead I need row by row
>>> capturing the first and the second element and from them create two new
>>> variables C and D.
>>> Right now as is somehow in the loop i C is captured correctly, but D is
>>> missing because the variables AA does not have it. Any suggestions?
>>> Thank you in advance, Aldi
>>>
>>> A          B
>>> 1:29439275 0.46773514
>>> 5:85928892 0.81283052
>>> 10:128341232 0.09332543
>>> 1:106024283:ID 0.36307805
>>> 3:62707519 0.42657952
>>> 2:80464120 0.89125094
>>>
>>> x1<-read.table(file='./test.txt',head=T,sep='\t')
>>> x1$A <- as.character(x1$A)
>>>
>>> for(i in 1:length(x1$A)){
>>>
>>> x1$AA[i] <- as.numeric(unlist(strsplit(x1$A[i],':')))
>>>
>>> x1$C[i] <- sapply(x1$AA[i],function(x)x[1])
>>> x1$D[i] <- sapply(x1$AA[i],function(x)x[2])
>>> }
>>>
>>> x1
>>>
>>>
>>>
>>>   > x1
>>>                  A          B AA  C  D
>>> 1     1:29439275 0.46773514  1  1 NA
>>> 2     5:85928892 0.81283052  5  5 NA
>>> 3   10:128341232 0.09332543 10 10 NA
>>> 4 1:106024283:ID 0.36307805  1  1 NA
>>> 5     3:62707519 0.42657952  3  3 NA
>>> 6     2:80464120 0.89125094  2  2 NA
>>>
>>>
>>> --
>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>          [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.


-- 


	[[alternative HTML version deleted]]



More information about the R-help mailing list