[R] splitting multiple data in one column into multiple rows with one entry per column

Sun Jul 26 23:14:53 CEST 2009

Try this:

> x <- read.table(textConnection("snp  ensembl_gene_id
+  rs8032583
+  rs1071600                                 ENSG00000101605
+  rs13406898                                 ENSG00000167165
+  rs7030479                                 ENSG00000107249
+  rs1244414                                 ENSG00000165629
+  rs1005636                                 ENSG00000230681
+  rs927913                 ENSG00000151655;ENSG00000227546
+  rs4832680
+  rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817
+  rs7035549
+  rs12707538                                 ENSG00000186472"),
header=TRUE, fill=TRUE)
> closeAllConnections()
> x.new <- do.call(rbind, apply(x, 1, function(.row){
+     .ids <- unlist(strsplit(.row[2], ';'))
+     # check for no data in second column; substitute a blank
+     if (length(.ids) == 0) return(cbind(.row[1], ""))
+     else return(cbind(.row[1], .ids))
+ }))
> x.new
                              .ids
snp              "rs8032583"  ""
snp              "rs1071600"  "ENSG00000101605"
snp              "rs13406898" "ENSG00000167165"
snp              "rs7030479"  "ENSG00000107249"
snp              "rs1244414"  "ENSG00000165629"
snp              "rs1005636"  "ENSG00000230681"
ensembl_gene_id1 "rs927913"   "ENSG00000151655"
ensembl_gene_id2 "rs927913"   "ENSG00000227546"
snp              "rs4832680"  ""
ensembl_gene_id1 "rs4435168"  "ENSG00000229164"
ensembl_gene_id2 "rs4435168"  "ENSG00000225227"
ensembl_gene_id3 "rs4435168"  "ENSG00000211817"
snp              "rs7035549"  ""
snp              "rs12707538" "ENSG00000186472"
>

On Sun, Jul 26, 2009 at 3:26 PM, Felix
Müller-Sarnowski<drflxms at googlemail.com> wrote:
> Dear R colleagues,
>
> I annotated a list of single nuclotide polymorphiosms (SNP) with the
> corresponding genes using biomaRt. The result is the following
> data.frame (pasted from R):
>
> snp                                 ensembl_gene_id
> 1      rs8032583
> 2      rs1071600                                 ENSG00000101605
> 3      rs13406898                                 ENSG00000167165
> 4      rs7030479                                 ENSG00000107249
> 5      rs1244414                                 ENSG00000165629
> 6      rs1005636                                 ENSG00000230681
> 7      rs927913                 ENSG00000151655;ENSG00000227546
> 8      rs4832680
> 9      rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817
> 10     rs7035549
> 11     rs12707538                                 ENSG00000186472
>
> As you can see, the SNP with the identifier rs4435168 corresponds to 3
> gene ids, rs927913 corresponds to 2 gene ids. As I'd like to perform a
> join of several data.frames using the ensembl_gene_id later on, I'd
> like to split columns with multiple gene identifiers into rows with
> only one ensembl gene identifier each. So for the example of rs4435168
> it should look like this (faked output):
>
> snp                   ensembl_gene_id
> ...
> 9      rs4435168 ENSG00000229164
> 10    rs4435168 ENSG00000225227
> 11    rs4435168 ENSG00000211817
> ...
>
> This is just a simple example. Finally there will be a lot of other
> columns, which should be replicated like the snp column.
>
> Does anyone know how to do this? I tried strsplit, which splits nicely
> the multiple entries in column ensembl_gene_id. But how to go on?
>
> I'd appreciate any kind of help very much!
> Best regards from Munich,
> Felix
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?