[R] Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions

John McKown john.archie.mckown at gmail.com
Sat Jul 5 04:35:58 CEST 2014


On Fri, Jul 4, 2014 at 7:50 AM, João Azevedo Patrício
<joao.patricio at gmx.pt> wrote:
> Hi,
>
> I've been trying to solve this issue but with no success.
>
> I have some data like this:
>
> 1 > TC  WC
> 2 > 0   Instruments & Instrumentation; Nuclear Science & Technology;
> Physics, Particles & Fields; Spectroscopy
> 3 > 0   Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
> Physics, Applied
> 4 > 2   Physics, Nuclear; Physics, Particles & Fields
> 5 > 0   Chemistry, Inorganic & Nuclear
> 6 > 2   Chemistry, Physical; Materials Science, Multidisciplinary;
> Metallurgy & Metallurgical Engineering
>
> And I need to have this:
>
> 1 > TC  WC
> 2 > 0   Instruments & Instrumentation
> 2 > 0   Nuclear Science & Technology
> 2 > 0   Physics, Particles & Fields
> 2 > 0   Spectroscopy
> 3 > 0   Nanoscience & Nanotechnology
> 3 > 0   Materials Science, Multidisciplinary
> 3 > 0   Physics, Applied
> 4 > 2   Physics, Nuclear
> 4 > 2   Physics, Particles & Fields
> 5 > 0   Chemistry, Inorganic & Nuclear
> 6 > 2   Chemistry, Physical
> 6 > 2   Materials Science, Multidisciplinary
> 6 > 2   Metallurgy & Metallurgical Engineering
>
> This means repeat the row for each element in WC and keeping the same value
> in TC. The goal is to check how many TC (sum) there are by WC, when WC is
> multiple.
>
> i've tried to separate the column using strsplt but then I cannot keep the
> track of TC.
>
> thanks in advance.
> --
> João Azevedo Patrício

Best that I've come up with, which seems to give the result desired
from the example data given.

splitAtSemiColon <- function(input) {
    z <- strsplit(input$WC,';');
    result <- data.table(TC=rep(input$TC,sapply(z,length)), WC=unlist(z));
    return(result);
}

flatted.data <- splitAtSemiColon(original.data);

<transcript>
> print(original.data,right=FALSE)
  TC
1 0
2 0
3 2
4 0
5 2
  WC
1 Instruments & Instrumentation; Nuclear Science & Technology;
Physics, Particles & Fields; Spectroscopy
2 Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
Physics, Applied
3 Physics, Nuclear; Physics, Particles & Fields
4 Chemistry, Inorganic & Nuclear
5 Chemistry, Physical; Materials Science, Multidisciplinary;
Metallurgy & Metallurgical Engineering
>
>> print(splitAtSemiColon,right=FALSE);
function(x) {
    z=strsplit(x$WC,';');
    result3=data.frame(TC=rep(x$TC,sapply(z,length)),WC=unlist(z));
    return(result3);
}
> print(splitAtSemiColon(original.data),right=FALSE);
   TC WC
1  0  Instruments & Instrumentation
2  0   Nuclear Science & Technology
3  0   Physics, Particles & Fields
4  0   Spectroscopy
5  0  Nanoscience & Nanotechnology
6  0   Materials Science, Multidisciplinary
7  0   Physics, Applied
8  2  Physics, Nuclear
9  2   Physics, Particles & Fields
10 0  Chemistry, Inorganic & Nuclear
11 2  Chemistry, Physical
12 2   Materials Science, Multidisciplinary
13 2   Metallurgy & Metallurgical Engineering

Note that I still have a problem in that the WC data can have leading
and/or trailing blanks due to the say that strsplit works. The easiest
way to fix this is to use the strtrim() function from the stringr
package.


-- 
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown



More information about the R-help mailing list