[R] Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions
John McKown
john.archie.mckown at gmail.com
Sat Jul 5 04:35:58 CEST 2014
On Fri, Jul 4, 2014 at 7:50 AM, João Azevedo Patrício
<joao.patricio at gmx.pt> wrote:
> Hi,
>
> I've been trying to solve this issue but with no success.
>
> I have some data like this:
>
> 1 > TC WC
> 2 > 0 Instruments & Instrumentation; Nuclear Science & Technology;
> Physics, Particles & Fields; Spectroscopy
> 3 > 0 Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
> Physics, Applied
> 4 > 2 Physics, Nuclear; Physics, Particles & Fields
> 5 > 0 Chemistry, Inorganic & Nuclear
> 6 > 2 Chemistry, Physical; Materials Science, Multidisciplinary;
> Metallurgy & Metallurgical Engineering
>
> And I need to have this:
>
> 1 > TC WC
> 2 > 0 Instruments & Instrumentation
> 2 > 0 Nuclear Science & Technology
> 2 > 0 Physics, Particles & Fields
> 2 > 0 Spectroscopy
> 3 > 0 Nanoscience & Nanotechnology
> 3 > 0 Materials Science, Multidisciplinary
> 3 > 0 Physics, Applied
> 4 > 2 Physics, Nuclear
> 4 > 2 Physics, Particles & Fields
> 5 > 0 Chemistry, Inorganic & Nuclear
> 6 > 2 Chemistry, Physical
> 6 > 2 Materials Science, Multidisciplinary
> 6 > 2 Metallurgy & Metallurgical Engineering
>
> This means repeat the row for each element in WC and keeping the same value
> in TC. The goal is to check how many TC (sum) there are by WC, when WC is
> multiple.
>
> i've tried to separate the column using strsplt but then I cannot keep the
> track of TC.
>
> thanks in advance.
> --
> João Azevedo Patrício
Best that I've come up with, which seems to give the result desired
from the example data given.
splitAtSemiColon <- function(input) {
z <- strsplit(input$WC,';');
result <- data.table(TC=rep(input$TC,sapply(z,length)), WC=unlist(z));
return(result);
}
flatted.data <- splitAtSemiColon(original.data);
<transcript>
> print(original.data,right=FALSE)
TC
1 0
2 0
3 2
4 0
5 2
WC
1 Instruments & Instrumentation; Nuclear Science & Technology;
Physics, Particles & Fields; Spectroscopy
2 Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
Physics, Applied
3 Physics, Nuclear; Physics, Particles & Fields
4 Chemistry, Inorganic & Nuclear
5 Chemistry, Physical; Materials Science, Multidisciplinary;
Metallurgy & Metallurgical Engineering
>
>> print(splitAtSemiColon,right=FALSE);
function(x) {
z=strsplit(x$WC,';');
result3=data.frame(TC=rep(x$TC,sapply(z,length)),WC=unlist(z));
return(result3);
}
> print(splitAtSemiColon(original.data),right=FALSE);
TC WC
1 0 Instruments & Instrumentation
2 0 Nuclear Science & Technology
3 0 Physics, Particles & Fields
4 0 Spectroscopy
5 0 Nanoscience & Nanotechnology
6 0 Materials Science, Multidisciplinary
7 0 Physics, Applied
8 2 Physics, Nuclear
9 2 Physics, Particles & Fields
10 0 Chemistry, Inorganic & Nuclear
11 2 Chemistry, Physical
12 2 Materials Science, Multidisciplinary
13 2 Metallurgy & Metallurgical Engineering
Note that I still have a problem in that the WC data can have leading
and/or trailing blanks due to the say that strsplit works. The easiest
way to fix this is to use the strtrim() function from the stringr
package.
--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan
Maranatha! <><
John McKown
More information about the R-help
mailing list