[BioC] undefined columns selected error when using bagging{ipred}
Valerie Obenchain
vobencha at fhcrc.org
Sun Sep 9 16:45:46 CEST 2012
Hi Constanze,
The problems appears to be with how bagging() deals with the column
names of the sample data frame. The immediate solution is to change the
column names to non-numbers,
> bagg <- bagging(response ~., data = exprDF[,selected], ntrees = 100)
Error in `[.data.frame`(m, attr(Terms, "term.labels")) :
undefined columns selected
> dat <- exprDF[,selected]
> colnames(dat) <- paste0("A", 1:ncol(dat))
> bagg <- bagging(response ~., data = dat, ntrees = 100)
> bagg
Bagging survival trees with 25 bootstrap replications
Call: bagging.data.frame(formula = response ~ ., data = df, ntrees = 100)
As you've seen from error messages as you've worked through these
examples, several packages are no longer maintained and many functions
have evolved since the book was written. ipred is currently maintained
and it is the package that bagging() comes from. I'm cc'ing the
maintainer because this issue may be a bug.
Hi Torsten,
It looks like bagging() does not like colnames that are numeric coerced
to character. Using an modified example from ?bagging,
data(DLBCL)
## first example works fine
mod <- bagging(Surv(time,cens) ~ ., data=DLBCL, coob=TRUE)
## change the column names of the data.frame
names(DLBCL) <- c("DLCL.Sample", "Gene.Expression", "time", "cens",
"IPI", 1:10)
> names(DLBCL)
[1] "DLCL.Sample" "Gene.Expression" "time" "cens"
[5] "IPI" "1" "2" "3"
[9] "4" "5" "6" "7"
[13] "8" "9" "10"
> mod <- bagging(Surv(time,cens) ~ ., data=DLBCL, coob=TRUE)
Error in `[.data.frame`(m, attr(Terms, "term.labels")) :
undefined columns selected
The error is thrown from this line in the irpart() function,
isord <- unlist(lapply(m[attr(Terms, "term.labels")], tfun))
When the 'Terms' variable is created, the term labels are created with
an extra backslash "`" which prevents them from being matched to the
column names of the data.frame (m),
debugging in: irpart(y ~ ., data = mydata, control = control, bcontrol =
list(nbagg = nbagg,
ns = ns, replace = REPLACE))
...
Browse[2]>
debug: Terms <- attr(m, "terms")
...
Browse[2]> attr(Terms, "term.labels")
[1] "DLCL.Sample" "Gene.Expression" "IPI" "`1`"
[5] "`2`" "`3`" "`4`" "`5`"
[9] "`6`" "`7`" "`8`" "`9`"
[13] "`10`"
...
Browse[2]> colnames(m)
[1] "y" "DLCL.Sample" "Gene.Expression" "IPI"
[5] "1" "2" "3" "4"
[9] "5" "6" "7" "8"
[13] "9" "10"
Valerie
On 09/05/12 08:21, Constanze [guest] wrote:
> Dear All,
>
> i'm trying to reproduce the results of the survival analysis in Capter 17, p.307 of "Bioinformatics and Computational Biology Solutions using R and Bioconductor" using the code chunks from http://www.bioconductor.org/help/publications/books/bioinformatics-and-computational-biology-solutions/chapter-code/Computational_Inference.R
> The call to the bagging function throws an error, although i decreased the amount of variables selected to p=25 (so the model fit wouldn't be over-determined). The code is below.
>
> Thanks a lot,
>
> Constanze
>
>
>> library("exactRankTests")
> Package ‘exactRankTests’ is no longer under development.
> Please consider using package ‘coin’ instead.
>
>> # library("coin")
>> library("ipred")
> Lade nötiges Paket: rpart
> Lade nötiges Paket: MASS
> Lade nötiges Paket: mlbench
> Lade nötiges Paket: nnet
> Lade nötiges Paket: class
>> library("kidpack")
> *** Deprecation warning ***:
> The package 'kidpack' is deprecated and will not be supported after Bioconductor release 2.1.
>
>
>> data(eset)
>> var_selection<- function(indx, expressions, response, p = 100) {
> +
> + y<- switch(class(response),
> + "factor" = { model.matrix(~ response - 1)[indx, ,drop = FALSE] },
> + "Surv" = { matrix(cscores(response[indx]), ncol = 1) },
> + "numeric" = { matrix(rank(response[indx]), ncol = 1) }
> + )
> +
> + x<- expressions[,indx, drop = FALSE]
> + n<- nrow(y)
> + linstat<- x %*% y
> + Ey<- matrix(colMeans(y), nrow = 1)
> + Vy<- matrix(rowMeans((t(y) - as.vector(Ey))^2), nrow = 1)
> +
> + rSx<- matrix(rowSums(x), ncol = 1)
> + rSx2<- matrix(rowSums(x^2), ncol = 1)
> + E<- rSx %*% Ey
> + V<- n / (n - 1) * kronecker(Vy, rSx2)
> + V<- V - 1 / (n - 1) * kronecker(Vy, rSx^2)
> +
> + stats<- abs(linstat - E) / sqrt(V)
> + stats<- do.call("pmax", as.data.frame(stats))
> + return(which(stats> sort(stats)[length(stats) - p]))
> + }
>>
>> remove<- is.na(eset$survival.time)
>> seset<- eset[,!remove]
>> response<- Surv(seset$survival.time, seset$died)
>> response[response[,1] == 0]<- 1
>> expressions<- t(apply(exprs(seset), 1, rank))
>> exprDF<- as.data.frame(t(expressions))
>>
>> I<- nrow(exprDF)
>> Iindx<- 1:I
>> selected<- var_selection(Iindx, expressions, response,p=25)
>> bagg<- bagging(response ~., data = exprDF[,selected],ntrees = 100)
> Fehler in `[.data.frame`(m, attr(Terms, "term.labels")) :
> undefined columns selected
>
>
> -- output of sessionInfo():
>
> R version 2.15.1 (2012-06-22)
> Platform: i486-pc-linux-gnu (32-bit)
>
> locale:
> [1] LC_CTYPE=de_DE.utf8 LC_NUMERIC=C
> [3] LC_TIME=de_DE.utf8 LC_COLLATE=de_DE.utf8
> [5] LC_MONETARY=de_DE.utf8 LC_MESSAGES=de_DE.utf8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=de_DE.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] splines stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] kidpack_1.5.10 ipred_0.8-8 class_7.3-4
> [4] nnet_7.3-4 mlbench_2.1-1 MASS_7.3-21
> [7] rpart_3.1-54 exactRankTests_0.8-22 affy_1.26.0
> [10] Biobase_2.8.0 survival_2.36-14
>
> loaded via a namespace (and not attached):
> [1] affyio_1.16.0 preprocessCore_1.10.0 tools_2.15.1
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list