[R] splitting a dataframe in R based on multiple gene names in a specific column
Jim Lemon
drjimlemon at gmail.com
Wed Aug 23 02:50:52 CEST 2017
Hi Bogdan,
Messy, and very specific to your problem:
df.sample.gene<-read.table(
text="Chr Start End Ref Alt Func.refGene Gene.refGene
284 chr2 16080996 16080996 C T ncRNA_exonic GACAT3
448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194
465 chr2 131279347 131279347 C G ncRNA_exonic LOC440910
525 chr2 223777758 223777758 T A exonic AP1S3
626 chr3 99794575 99794575 G A exonic COL8A1
643 chr3 132601066 132601066 A G exonic ACKR4
655 chr3 132601999 132601999 A G exonic BCDF5,CDFG6",
header=TRUE,stringsAsFactors=FALSE)
multgenes<-grep(",",df.sample.gene$Gene.refGene)
rep_genes<-strsplit(df.sample.gene$Gene.refGene[multgenes],",")
ngenes<-unlist(lapply(rep_genes,length))
dup_row<-function(x) {
newrows<-x
lastcol<-dim(x)[2]
rep_genes<-unlist(strsplit(x[,lastcol],","))
for(i in 2:length(rep_genes)) newrows<-rbind(newrows,x)
newrows$Gene.refGene<-rep_genes
return(newrows)
}
for(multgene in multgenes)
df.sample.gene<-rbind(df.sample.gene,dup_row(df.sample.gene[multgene,]))
df.sample.gene<-df.sample.gene[-multgenes,]
df.sample.gene
I added a second line with multiple genes to make sure that it would
work with more than one line.
Jim
On Wed, Aug 23, 2017 at 9:57 AM, Bogdan Tanasa <tanasa at gmail.com> wrote:
> I would appreciate please a suggestion on how to do the following :
>
> i'm working with a dataframe in R that contains in a specific column
> multiple gene names, eg :
>
>> df.sample.gene[15:20,2:8]
> Chr Start End Ref Alt Func.refGene
> Gene.refGene284 chr2 16080996 16080996 C T ncRNA_exonic
> GACAT3448 chr2 113979920 113979920 C T ncRNA_exonic
> LINC01191,LOC100499194465 chr2 131279347 131279347 C G
> ncRNA_exonic LOC440910525 chr2 223777758 223777758 T
> A exonic AP1S3626 chr3 99794575 99794575 G
> A exonic COL8A1643 chr3 132601066 132601066 A
> G exonic ACKR4
>
> How could I obtain a dataframe where each line that has multiple gene names
> (in the field Gene.refGene) is replicated with only one gene name ? i.e.
>
> for the second row :
>
> 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194
>
> we shall get in the final output (that contains all the rows) :
>
> 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191
> 448 chr2 113979920 113979920 C T ncRNA_exonic LOC100499194
>
> thanks a lot !
>
> -- bogdan
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list