[R] string splitting and testing for enrichment
Gabor Grothendieck
ggrothendieck at gmail.com
Sat Jun 20 17:12:53 CEST 2009
Try this. We read in data and split TFBS on "(" or ") " or ")"
giving s and reform s into a matrix prepending the Gene name as
column 1. Convert that to a data frame and make the third
column numeric.
Lines <- "Gene,TFBS
NUDC,PPARA(1) HNF4(20) HNF4(96) AHRARNT(104) CACBINDINGPROTEIN(149)
T3R(167) HLF(191)
RPA2,STAT4(57) HEB(251)
TAF12,PAX3(53) YY1(92) BRCA(99) GLI(101)
EIF3I,NERF(10) P300(10)
TRAPPC3,HIC1(3) PAX5(17) PAX5(110) NRF1(119) HIC1(122)
TRAPPC3,EGR(26) ZNF219(27) SP3(32) EGR(32) NFKAPPAB65(89) NFKAPPAB(89)
RFX(121) ZTA(168)
NDUFS5,WHN(14) ATF(57) EGR3(59) PAX5(99) SF1(108) NRSE(146)
TIE1,NRSE(129)"
DF <- read.csv(textConnection(Lines), as.is = TRUE)
s <- strsplit(DF$TFBS, "\\(|\\) |\\)")
f <- function(i) cbind(DF[i, "Gene"], matrix(s[[i]], nc = 2, byrow = TRUE))
DF2 <- as.data.frame(do.call(rbind, lapply(seq_along(s), f)))
DF2[[3]] <- as.numeric(DF2[[3]])
View(DF2)
On Sat, Jun 20, 2009 at 10:28 AM, Iain
Gallagher<iaingallagher at btopenworld.com> wrote:
> Hi List
>
> I have data in the following form:
>
> Gene TFBS
> NUDC PPARA(1) HNF4(20) HNF4(96) AHRARNT(104) CACBINDINGPROTEIN(149) T3R(167) HLF(191)
> RPA2 STAT4(57) HEB(251)
> TAF12 PAX3(53) YY1(92) BRCA(99) GLI(101)
> EIF3I NERF(10) P300(10)
> TRAPPC3 HIC1(3) PAX5(17) PAX5(110) NRF1(119) HIC1(122)
> TRAPPC3 EGR(26) ZNF219(27) SP3(32) EGR(32) NFKAPPAB65(89) NFKAPPAB(89) RFX(121) ZTA(168)
> NDUFS5 WHN(14) ATF(57) EGR3(59) PAX5(99) SF1(108) NRSE(146)
> TIE1 NRSE(129)
>
> I would like to test the 2nd column (each value has letters followed by numbers in brackets) here for enrichment via fisher.test.
>
> To that end I am trying to create two factors made up of column 1 (Gene) and column 2 (TFBS) where each Gene would have several entries matching each TFBS.
>
> My main problem just now is that I can't split the TFBS column into separate strings (at the moment that 2nd column is all one string for each Gene).
>
> Here's where I am just now:
>
> test<-as.character(dataIn[,2]) # convert the 2nd column from factor to character
> test2<-unlist(strsplit(test[1], ' ')) # split the first element into individual strings (only the first element just now because I'm joust trying to get things working)
> test3<-unlist(strsplit(test2, '\\([0-9]\\)')) # get rid of numbers and brackets
>
> now this does not behave as I hoped - it gives me:
>
>> test3
> [1] "PPARA" "HNF4(20)" "HNF4(96)"
> [4] "AHRARNT(104)" "CACBINDINGPROTEIN(149)" "T3R(167)"
> [7] "HLF(191)"
>
> ie it only removes the numbers and brackets from the first entry and not the others.
>
> Could someone point out my mistake please?
>
> Once I have all the TFBS (letters only) for each Gene I would then count how often a TFBS occurs and use this data for a fisher.test testing for enrichment of TFBS in the list I have. I'm a rather muddled here though and would appreciate advice on whether this is the right approach.
>
> Thanks
>
> Iain
>
>> sessionInfo()
> R version 2.9.0 (2009-04-17)
> x86_64-pc-linux-gnu
>
> locale:
> LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list