[R] Need some suggestions for outlier detection in a matrix
arun
smartpink111 at yahoo.com
Wed Jan 15 19:07:32 CET 2014
Hi Vivek,
chisq.out.test(as.numeric(mat1[1,]))$alternative
#[1] "highest value 3516 is an outlier"
as.numeric(gsub("[[:alpha:]]","",chisq.out.test(as.numeric(mat1[1,]))$alternative))
#[1] 3516
#removes the alphabetic characters so that only number remain.
Also, remember that it is just the alternative hypothesis. If you wanted to subset the outliers, you have to compare the pvalue with the cut-off alpha. If I take a cut-off limit as 0.15 (as none of the values are <0.05)
mat2 <-cbind(mat1,t(apply(mat1,1,function(x) {test <- chisq.out.test(as.numeric(x)); possible_outLier <- as.numeric(gsub("[[:alpha:]]","",test$alternative)); Pval=test$p.value; outLier <- if(Pval < 0.15 & !is.na(Pval)) possible_outLier else NA; c(Possible_outLier=possible_outLier,Pval=Pval,outLier=outLier)})))
sum(!is.na(mat2[,"outLier"]))
#[1] 5208
head(mat2,6)
Sample_118z.0 Sample_132z.0 Sample_141z.0 Sample_183z.0
XLOC_000001 626 3516 1277 770
XLOC_000002 82 342 185 72
XLOC_000003 361 2000 867 438
XLOC_000004 30 143 67 37
XLOC_000010 1 7 5 3
XLOC_000011 10 63 19 15
Possible_outLier Pval outLier
XLOC_000001 3516 0.1423296 3516
XLOC_000002 342 0.1707215 NA
XLOC_000003 2000 0.1517236 NA
XLOC_000004 143 0.1538803 NA
XLOC_000010 7 0.2452781 NA
XLOC_000011 63 0.1381038 63
A.K.
On Wednesday, January 15, 2014 12:15 PM, Vivek Das <vd4mmind at gmail.com> wrote:
Thanks a lot Arun,
I understood the function but am not being able to understand what does the pattern recognition is happening with gsub("[[:
alpha:]]","",test$alternative)
what is the alpha doing here. Can you please let me know why you did this pattern matching with gsub taking :alpha: as the pattern?
----------------------------------------------------------
Vivek Das
On Wed, Jan 15, 2014 at 5:33 PM, arun <smartpink111 at yahoo.com> wrote:
Hi,
>Try:
>dat1 <- read.table("ZvsPGRT_frag_0filt.txt",sep="\t",header=TRUE,row.names=1)
>dat_Z <- dat1[,1:4] ## unnecessary to do cbind() here
>mat1 <- as.matrix(dat_Z)
> head(mat1,2)
># Sample_118z.0 Sample_132z.0 Sample_141z.0 Sample_183z.0
>#XLOC_000001 626 3516 1277 770
>#XLOC_000002 82 342 185 72
>library(outliers)
> ctest_mat1 <- t(apply(mat1,1,function(x) {test <- chisq.out.test(as.numeric(x)); c(outLier=as.numeric(gsub("[[:alpha:]]","",test$alternative)), Pval=test$p.value)}))
> mat2 <- cbind(mat1,ctest_mat1)
>head(mat2,2)
># Sample_118z.0 Sample_132z.0 Sample_141z.0 Sample_183z.0 outLier
>#XLOC_000001 626 3516 1277 770 3516
>#XLOC_000002 82 342 185 72 342
># Pval
>#XLOC_000001 0.1423296
>#XLOC_000002 0.1707215
>
>
>A.K.
>
>
>
>
>
>On Wednesday, January 15, 2014 7:12 AM, Vivek Das <vd4mmind at gmail.com> wrote:
>
>HI Arun,
>
>I was wondering how to use the package outliers. There is a package which can help me identifying outliers for each row. So I have a matrix with rownames for first column and next 4 colmns have values. for each row I want to find the outlier and also the test statistic of it. So there is a package ‘outliers’. Which has this test chisq.out.test that performs a chisquared test for detection of one outlier in a vector. So now I want to apply this for my matrix. and want to find out for each row which is the outlier and then what is the p.value associated to it. I was using the below code
>
>
>data<-read.table("my_file.txt",,sep='\t', header=T)
>## Selecting only the centers
>data_Z<-cbind(data[,1:5])
>mat1<- as.matrix(data_Z[,2:5])
>row.names(mat1)<- data_Z[,1]
>head(mat1)
>
> Sample_118z.0 Sample_132z.0 Sample_141z.0 Sample_183z.0
>XLOC_000001 626 3516 1277 770
>XLOC_000002 82 342 185 72
>XLOC_000003 361 2000 867 438
>XLOC_000004 30 143 67 37
>XLOC_000010 1 7 5 3
>XLOC_000011 10 63 19 15
>
>ctest_mat1<-c()
>
>for (i in 1:length(mat1[,1]))
>{
>ctest_mat1<-c(ctest_mat1,chisq.out.test(as.numeric(mat1[i,])))
>
>}
>
>But this does not give me the outlier for each row. I mean it should be ideally but when am trying to combine it with the matrix mat1 with below command I get the error
>
>res <-cbind(mat1,ctest_mat1)
>Warning message:
>In .Method(..., deparse.level = deparse.level) :
> number of rows of result is not a multiple of vector length (arg 2)
>
>I want my matrix with the mat1 and also the columns for each row saying which is the outlier and the p- value associated with it. I mean when I
>
>head(ctest_mat1)
>$statistic
>X-squared
> 2.152591
>
>$alternative
>[1] "highest value 3516 is an outlier"
>
>$p.value
>[1] 0.1423296
>
>$method
>[1] "chi-squared test for outlier"
>
>$data.name
>[1] "as.numeric(mat1[i, ])"
>
>$statistic
>X-squared
> 1.876596
>
>I get only the following for the first row. I want it was a matrix for all the rows and combine it with my mat1 so that I can then evaluate. Can you help me with that? I am also attaching the matrix. I hope you understood my point.
>
>
>
>----------------------------------------------------------
>
>Vivek Das
>
More information about the R-help
mailing list