[R] Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions

João Azevedo Patrício joao.patricio at gmx.pt
Mon Jul 7 11:49:46 CEST 2014


Em 05-07-2014 03:35, John McKown escreveu:
> On Fri, Jul 4, 2014 at 7:50 AM, João Azevedo Patrício
> <joao.patricio  gmx.pt> wrote:
>> Hi,
>>
>> I've been trying to solve this issue but with no success.
>>
>> I have some data like this:
>>
>> 1 > TC  WC
>> 2 > 0   Instruments & Instrumentation; Nuclear Science & Technology;
>> Physics, Particles & Fields; Spectroscopy
>> 3 > 0   Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
>> Physics, Applied
>> 4 > 2   Physics, Nuclear; Physics, Particles & Fields
>> 5 > 0   Chemistry, Inorganic & Nuclear
>> 6 > 2   Chemistry, Physical; Materials Science, Multidisciplinary;
>> Metallurgy & Metallurgical Engineering
>>
>> And I need to have this:
>>
>> 1 > TC  WC
>> 2 > 0   Instruments & Instrumentation
>> 2 > 0   Nuclear Science & Technology
>> 2 > 0   Physics, Particles & Fields
>> 2 > 0   Spectroscopy
>> 3 > 0   Nanoscience & Nanotechnology
>> 3 > 0   Materials Science, Multidisciplinary
>> 3 > 0   Physics, Applied
>> 4 > 2   Physics, Nuclear
>> 4 > 2   Physics, Particles & Fields
>> 5 > 0   Chemistry, Inorganic & Nuclear
>> 6 > 2   Chemistry, Physical
>> 6 > 2   Materials Science, Multidisciplinary
>> 6 > 2   Metallurgy & Metallurgical Engineering
>>
>> This means repeat the row for each element in WC and keeping the same value
>> in TC. The goal is to check how many TC (sum) there are by WC, when WC is
>> multiple.
>>
>> i've tried to separate the column using strsplt but then I cannot keep the
>> track of TC.
>>
>> thanks in advance.
>> --
>> João Azevedo Patrício
> Best that I've come up with, which seems to give the result desired
> from the example data given.
>
> splitAtSemiColon <- function(input) {
>      z <- strsplit(input$WC,';');
>      result <- data.table(TC=rep(input$TC,sapply(z,length)), WC=unlist(z));
>      return(result);
> }
>
> flatted.data <- splitAtSemiColon(original.data);
>
> <transcript>
>> print(original.data,right=FALSE)
>    TC
> 1 0
> 2 0
> 3 2
> 4 0
> 5 2
>    WC
> 1 Instruments & Instrumentation; Nuclear Science & Technology;
> Physics, Particles & Fields; Spectroscopy
> 2 Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
> Physics, Applied
> 3 Physics, Nuclear; Physics, Particles & Fields
> 4 Chemistry, Inorganic & Nuclear
> 5 Chemistry, Physical; Materials Science, Multidisciplinary;
> Metallurgy & Metallurgical Engineering
>>> print(splitAtSemiColon,right=FALSE);
> function(x) {
>      z=strsplit(x$WC,';');
>      result3=data.frame(TC=rep(x$TC,sapply(z,length)),WC=unlist(z));
>      return(result3);
> }
>> print(splitAtSemiColon(original.data),right=FALSE);
>     TC WC
> 1  0  Instruments & Instrumentation
> 2  0   Nuclear Science & Technology
> 3  0   Physics, Particles & Fields
> 4  0   Spectroscopy
> 5  0  Nanoscience & Nanotechnology
> 6  0   Materials Science, Multidisciplinary
> 7  0   Physics, Applied
> 8  2  Physics, Nuclear
> 9  2   Physics, Particles & Fields
> 10 0  Chemistry, Inorganic & Nuclear
> 11 2  Chemistry, Physical
> 12 2   Materials Science, Multidisciplinary
> 13 2   Metallurgy & Metallurgical Engineering
>
> Note that I still have a problem in that the WC data can have leading
> and/or trailing blanks due to the say that strsplit works. The easiest
> way to fix this is to use the strtrim() function from the stringr
> package.
>
>
Yes also have that problem. Tried to work it ou using "sub" but didn't 
work at all.

-- 
João Azevedo Patrício
Tel.: +31 91 400 53 63
Portugal
@ http://tripaforra.bl.ee

"Take 2 seconds to think before you act"



More information about the R-help mailing list