[BioC] How to do k-fold validation using SVM
Ramon Diaz
rdiaz@cnio.es
Mon, 27 Jan 2003 15:20:37 +0100
On Monday 27 January 2003 14:55, Stephen Henderson wrote:
> That looks good--thanks
> A question and a suggestion follow.
>
> 1. You say you used a linear kernel--Did you find this to be best after
> testing and/or optimising the other kernels?
Not really. At least, not after extensive testing. But with the data I am
working linear seemed better than radial, and the people I am working with
preferred linear than radial (and I do too: I don't really understand linear
SVMs, much less other kernels).
>
> 2. A set of wrapper functions (for multtest, ipred, and e1071) that
> consistently interface the affy data object with a few good feature
> selection methods and classification methods might be a useful new addition
> to BioC.
I guess that is an invitation for me to write that. I actually like the
suggestion (and I think it should be fairly easy to get my boss to regard it
as a good idea too). So I'll probably go for it (but will be a few months
before I can do it). [A disclaimer, though: I am not a particularly gifted R
programmer; for example, I've managed to avoid learning anything about S4
classes, and have no idea how Sweave works; I suppose this is an opportunity
to learn some of this stuff]
Best,
>
> Stephen Henderson
>
>
> -----Original Message-----
> From: Ramon Diaz [mailto:rdiaz@cnio.es]
> Sent: Monday, January 27, 2003 1:38 PM
> To: bioconductor@stat.math.ethz.ch
> Cc: amateos@cnio.es
> Subject: Re: [BioC] How to do k-fold validation using SVM
>
> Sorry for jumping into the thread so late: I just read the posting today.
> Anyway, I have used the following code a couple of times, and maybe it is
> of
>
> some help to you. In each "round" (for each set of training data) I select
> the best K genes (where best means with largest abs. value of t statistic)
> and then fit the svm using those K genes.
>
>
> For a couple of reasons, I use mt.maxT (from the multtest library) to get
> the
> t statistic, but you can modify the function at your convenience (like an
> ANOVA for > 3 or whatever you want). Note also that I use a linear kernel.
>
> Hope it helps,
>
> Ramón
>
> gene.select <- function(data, class, size = NULL, threshold = NULL) {
> # t.stat <- apply(data, 2, function(x) {abs(t.test(x ~
> class)$statistic)})
> this is slower than
> tmp <- mt.maxT(t(data), class, B= 1)[, c(1, 2)]
> selected <- tmp[seq(1:size), 1]
> return(selected)
> }
>
>
> cross.valid.svm <- function(data, y, knumber = 10, size = 200) {
> ## data is structured as subjects in rows, genes in columns
> ## (and thus is transposed inside gene.select to be fed to mt.maxT).
>
> ## If you want leave-one-out, set knumber = NULL
> ## size is the number of genes used when building the classifier.
> ## those are the "best" genes, based on a t-statistic.
>
>
> ## First, selecting the data subsets for cross-validation
> if(is.null(knumber)) {
> knumber <- length(y)
> leave.one.out <- TRUE
> }
> else leave.one.out <- FALSE
> N <- length(y)
> if(knumber > N) stop(message = "knumber has to be <= number of
> subjects")
> reps <- floor(N/knumber)
> reps.last <- N - (knumber-1)*reps
> index.select <- c( rep(seq(from = 1, to = (knumber - 1)), reps),
> rep(knumber, reps.last))
> index.select <- sample(index.select)
>
> cv.errors <- matrix(-99, nrow = knumber, ncol = 4)
>
> ## Fit model for each data set.
> for(sample.number in 1:knumber) {
> ## gene selection
> gene.subset <- gene.select(data[index.select != sample.number, ],
> y[index.select != sample.number],
> size = size)
>
>
> ## predict from svm on that subset
> y.f <- factor(y)
> test.set <- data[index.select == sample.number, gene.subset]
> if(is.null(dim(test.set))) test.set <- matrix(test.set, nrow = 1)
> ##
>
> for leave-one-out
> predicted <- predict(svm(data[index.select != sample.number,
> gene.subset],
> y.f[index.select != sample.number],
> kernel = "linear"), test.set)
>
> cv.errors[sample.number, 1] <- length(which((y.f[index.select ==
> sample.number] == 1)
> & predicted == 0))
> cv.errors[sample.number, 2] <- length(which((y.f[index.select ==
> sample.number] == 0)
> & predicted == 1))
> cv.errors[sample.number, 3] <- length(which(y.f[index.select ==
> sample.number] != predicted)) cv.errors[sample.number, 4] <-
> length(predicted)
>
> }
> cv.errors <- data.frame(cv.errors)
> names(cv.errors) <- c("true.1.pred.0", "true.0.pred.1", "total.error",
> "number.predicted") average.error.rate <- sum(cv.errors[,
> 3])/sum(cv.errors[, 4])
> return(list(cv.errors = cv.errors, average.error.rate =
> average.error.rate))
>
> }
>
>
> ## An example code:
> cross.valid.svm(matrix.covar, class.vector, k = 10, size = 10)
--
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
http://bioinfo.cnio.es/~rdiaz