[R] splitting a dataframe in R based on multiple gene names in a specific column
Jeff Newmiller
jdnewmil at dcn.davis.ca.us
Fri Aug 25 21:26:34 CEST 2017
If row numbers can be dispensed with, then tidyr makes this easy with
the unnest function:
#####
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(purrr)
library(tidyr)
df.sample.gene<-read.table(
text="Chr Start End Ref Alt Func.refGene Gene.refGene
284 chr2 16080996 16080996 C T ncRNA_exonic GACAT3
448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194
465 chr2 131279347 131279347 C G ncRNA_exonic LOC440910
525 chr2 223777758 223777758 T A exonic AP1S3
626 chr3 99794575 99794575 G A exonic COL8A1
643 chr3 132601066 132601066 A G exonic ACKR4
655 chr3 132601999 132601999 A G exonic BCDF5,CDFG6",
header=TRUE,stringsAsFactors=FALSE)
df.sample.out <- ( df.sample.gene
%>% mutate( Gene.refGene = strsplit( Gene.refGene
, ","
)
)
%>% unnest( Gene.refGene )
)
df.sample.out
#> Chr Start End Ref Alt Func.refGene Gene.refGene
#> 1 chr2 16080996 16080996 C T ncRNA_exonic GACAT3
#> 2 chr2 113979920 113979920 C T ncRNA_exonic LINC01191
#> 3 chr2 113979920 113979920 C T ncRNA_exonic LOC100499194
#> 4 chr2 131279347 131279347 C G ncRNA_exonic LOC440910
#> 5 chr2 223777758 223777758 T A exonic AP1S3
#> 6 chr3 99794575 99794575 G A exonic COL8A1
#> 7 chr3 132601066 132601066 A G exonic ACKR4
#> 8 chr3 132601999 132601999 A G exonic BCDF5
#> 9 chr3 132601999 132601999 A G exonic CDFG6
#####
On Wed, 23 Aug 2017, Jim Lemon wrote:
> Hi Bogdan,
> Messy, and very specific to your problem:
>
> df.sample.gene<-read.table(
> text="Chr Start End Ref Alt Func.refGene Gene.refGene
> 284 chr2 16080996 16080996 C T ncRNA_exonic GACAT3
> 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194
> 465 chr2 131279347 131279347 C G ncRNA_exonic LOC440910
> 525 chr2 223777758 223777758 T A exonic AP1S3
> 626 chr3 99794575 99794575 G A exonic COL8A1
> 643 chr3 132601066 132601066 A G exonic ACKR4
> 655 chr3 132601999 132601999 A G exonic BCDF5,CDFG6",
> header=TRUE,stringsAsFactors=FALSE)
>
> multgenes<-grep(",",df.sample.gene$Gene.refGene)
> rep_genes<-strsplit(df.sample.gene$Gene.refGene[multgenes],",")
> ngenes<-unlist(lapply(rep_genes,length))
> dup_row<-function(x) {
> newrows<-x
> lastcol<-dim(x)[2]
> rep_genes<-unlist(strsplit(x[,lastcol],","))
> for(i in 2:length(rep_genes)) newrows<-rbind(newrows,x)
> newrows$Gene.refGene<-rep_genes
> return(newrows)
> }
> for(multgene in multgenes)
> df.sample.gene<-rbind(df.sample.gene,dup_row(df.sample.gene[multgene,]))
> df.sample.gene<-df.sample.gene[-multgenes,]
> df.sample.gene
>
> I added a second line with multiple genes to make sure that it would
> work with more than one line.
>
> Jim
>
>
> On Wed, Aug 23, 2017 at 9:57 AM, Bogdan Tanasa <tanasa at gmail.com> wrote:
>> I would appreciate please a suggestion on how to do the following :
>>
>> i'm working with a dataframe in R that contains in a specific column
>> multiple gene names, eg :
>>
>>> df.sample.gene[15:20,2:8]
>> Chr Start End Ref Alt Func.refGene
>> Gene.refGene284 chr2 16080996 16080996 C T ncRNA_exonic
>> GACAT3448 chr2 113979920 113979920 C T ncRNA_exonic
>> LINC01191,LOC100499194465 chr2 131279347 131279347 C G
>> ncRNA_exonic LOC440910525 chr2 223777758 223777758 T
>> A exonic AP1S3626 chr3 99794575 99794575 G
>> A exonic COL8A1643 chr3 132601066 132601066 A
>> G exonic ACKR4
>>
>> How could I obtain a dataframe where each line that has multiple gene names
>> (in the field Gene.refGene) is replicated with only one gene name ? i.e.
>>
>> for the second row :
>>
>> 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194
>>
>> we shall get in the final output (that contains all the rows) :
>>
>> 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191
>> 448 chr2 113979920 113979920 C T ncRNA_exonic LOC100499194
>>
>> thanks a lot !
>>
>> -- bogdan
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
More information about the R-help
mailing list