[R] splitting multiple data in one column into multiple rows with one entry per column
jim holtman
jholtman at gmail.com
Sun Jul 26 23:14:53 CEST 2009
Try this:
> x <- read.table(textConnection("snp ensembl_gene_id
+ rs8032583
+ rs1071600 ENSG00000101605
+ rs13406898 ENSG00000167165
+ rs7030479 ENSG00000107249
+ rs1244414 ENSG00000165629
+ rs1005636 ENSG00000230681
+ rs927913 ENSG00000151655;ENSG00000227546
+ rs4832680
+ rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817
+ rs7035549
+ rs12707538 ENSG00000186472"),
header=TRUE, fill=TRUE)
> closeAllConnections()
> x.new <- do.call(rbind, apply(x, 1, function(.row){
+ .ids <- unlist(strsplit(.row[2], ';'))
+ # check for no data in second column; substitute a blank
+ if (length(.ids) == 0) return(cbind(.row[1], ""))
+ else return(cbind(.row[1], .ids))
+ }))
> x.new
.ids
snp "rs8032583" ""
snp "rs1071600" "ENSG00000101605"
snp "rs13406898" "ENSG00000167165"
snp "rs7030479" "ENSG00000107249"
snp "rs1244414" "ENSG00000165629"
snp "rs1005636" "ENSG00000230681"
ensembl_gene_id1 "rs927913" "ENSG00000151655"
ensembl_gene_id2 "rs927913" "ENSG00000227546"
snp "rs4832680" ""
ensembl_gene_id1 "rs4435168" "ENSG00000229164"
ensembl_gene_id2 "rs4435168" "ENSG00000225227"
ensembl_gene_id3 "rs4435168" "ENSG00000211817"
snp "rs7035549" ""
snp "rs12707538" "ENSG00000186472"
>
On Sun, Jul 26, 2009 at 3:26 PM, Felix
Müller-Sarnowski<drflxms at googlemail.com> wrote:
> Dear R colleagues,
>
> I annotated a list of single nuclotide polymorphiosms (SNP) with the
> corresponding genes using biomaRt. The result is the following
> data.frame (pasted from R):
>
> snp ensembl_gene_id
> 1 rs8032583
> 2 rs1071600 ENSG00000101605
> 3 rs13406898 ENSG00000167165
> 4 rs7030479 ENSG00000107249
> 5 rs1244414 ENSG00000165629
> 6 rs1005636 ENSG00000230681
> 7 rs927913 ENSG00000151655;ENSG00000227546
> 8 rs4832680
> 9 rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817
> 10 rs7035549
> 11 rs12707538 ENSG00000186472
>
> As you can see, the SNP with the identifier rs4435168 corresponds to 3
> gene ids, rs927913 corresponds to 2 gene ids. As I'd like to perform a
> join of several data.frames using the ensembl_gene_id later on, I'd
> like to split columns with multiple gene identifiers into rows with
> only one ensembl gene identifier each. So for the example of rs4435168
> it should look like this (faked output):
>
> snp ensembl_gene_id
> ...
> 9 rs4435168 ENSG00000229164
> 10 rs4435168 ENSG00000225227
> 11 rs4435168 ENSG00000211817
> ...
>
> This is just a simple example. Finally there will be a lot of other
> columns, which should be replicated like the snp column.
>
> Does anyone know how to do this? I tried strsplit, which splits nicely
> the multiple entries in column ensembl_gene_id. But how to go on?
>
> I'd appreciate any kind of help very much!
> Best regards from Munich,
> Felix
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
More information about the R-help
mailing list