[BioC] Reading Affy CEL files
Ranjani R
ranjanir at uw.edu
Mon Jun 3 16:52:03 CEST 2013
Ranjani R [guest] <guest at ...> writes:
>
>
> I am a newbie to Affy. Thanks for your help.
>
> I am processing CEL files through R (Affy package) and am having some
basic issues that I am not finding
> satisfactory answers to (have googled).
> The chip used is hugene11stv1. I also am using the hugene11stprobeset.db
to try to do probeset â>
> Symbol translation.
> Essentially, I want to create a file with gene expression data, with
genes * samples as my final matrix.
>
> Code:
> setwd(wDir);
> Data <- ReadAffy();
> eset <- rma(Data);
> write.exprs(eset,file="geneExpData.txt", sep="\t", quote = F);
>
> When I analyze the file written, I see that the number of columns is as I
expect(number samples) but there are
> 33,297 genes.
> Please help me understand a few fundamental aspects here:
>
> 1. I tried translating these Affy IDs to gene symbols to see if that would
make my analysis easier.
> Here are some things I tried
>
> Try 1:
> symbols <- getSYMBOL(as.character(expr.matrix[,1]),
"hugene11stprobeset"); â> Not quite
> working. Only ~175 of the probeset IDs are getting translated.
> Try 2:
> symbs <- mget(featureNames(eset), hugene11stprobesetSYMBOL, ifnotfound
=NA);
> symbs <- unlist(symbs)
> mat <- eset; # make a copy
> featureNames(mat) <- ifelse(!is.na(symbs), symbs, featureNames(mat))
>
> Many NAs.
>
> Can you please help me understand what is happening here.
>
> -- output of sessionInfo():
>
> R version 2.15.3 (2013-03-01)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] hugene11stv1cdf_2.3.0 affy_1.36.1 Biobase_2.18.0
> [4] BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] affyio_1.26.0 BiocInstaller_1.8.3 preprocessCore_1.20.0
> [4] tools_2.15.3 zlibbioc_1.4.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at ...
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
Hi,
Thanks for the response. I tried using oligo +
hugene11sttranscriptcluster.db to annotate my data. Here is what I've tried
to do (after some googling).
With the code I executed, 33,197 genes got collapsed to about ~21,000+
genes.
I had some questions:
I found no vignette for how to do this 'summing up of probesets'. Is this
approach taken right (code attached)?
If anybody has a better approach/code snippet they can share - that would be
great too.
Essentially, I need to take the data, run RMA on it - so that I can then do
my differential expression analysis (amongst other approaches)
Can you please guide me here. I have attached my code snippets below.
Thanks much!
Ranjani
_________________________________________
Code executed:
--------------
1. Convert the probeset to gene Names
gMap <- hugene11sttranscriptclusterSYMBOL
mProbes <- mappedkeys(gMap)
probNames <- rownames(expr.matrix)
gMap1 <- as.list(gMap[mProbes])
2. Get an expression matrix of mapped probes
mappedProbes <- sort(which( (probNames %in% names(gMap1)) == T))
any( mappedProbes != sort(mappedProbes)) # FALSE
notMappedProbes <- which( (probNames %in% names(gMap1)) == F)
mappedNames <- as.character(gMap1[ which(names(gMap1) %in%
rownames(expr.matrix)[mappedProbes]) ])
rownames(expr.matrix)[mappedProbes] <- mappedNames
expr.matrix <- expr.matrix[-notMappedProbes,,drop=F]
3. Get unique genes. Here, if a gene occurs more than once, I take the mean
expression value (is this right)?
# get unique genes matrix by taking mean expression of gene names that
appear more than one time
gn.nms <- rownames(expr.matrix) <- toupper(rownames(expr.matrix))
expr.matrix.unique <- matrix(0, nrow = length(unique(gn.nms)), ncol =
ncol(expr.matrix))
seen.gns <- character();
cnt <- 0;
for (i in 1:dim(expr.matrix)[1]){
cr.gn <- gn.nms[i]
if (cr.gn %in% seen.gns){
next
} else{
seen.gns <- c(seen.gns,cr.gn)
ix <- which(gn.nms %in% cr.gn)
if(length(ix)==0) {
stop("length(ix) = 0 at i=",ix[1]," gene name is:",
cr.gn)
}
if(length(ix)>1){
cnt <- cnt + 1;
print("Taking column means for gene");
print(cr.gn);
expr.matrix.unique[cnt,] <- colMeans(d[ix,])
} else {
cnt <- cnt + 1
expr.matrix.unique[cnt,] <- d[ix,]
}
}
}
rownames(expr.matrix.unique) <- seen.gns
4. Now expr.matrix.uniqe should have what I want. Genes * Samples.
More information about the Bioconductor
mailing list