[BioC] Summarisation in oligo package after exclusion of probes containing SNPs

Daniel Bottomly bottomly at ohsu.edu
Tue Nov 5 17:59:01 CET 2013

Hi Jimmy:

1.) We do variations on this procedure quite frequently in the field of
mouse genetics and do find it to be useful in practice. The most common
way that I have seen is to remove the probes ahead of time prior to
background correction.  That being said, it would be interesting to
compare the effect of probe removal ahead of time to probe removal after
background correction and normalization.  I suppose there will be cases
where the removal of a probe from a probeset would remove an assignment to
an Entrez gene ID such as the case where the probeset is removed in its
entirety.  You could always use (re)alignments of the probes to directly
determine which genes are interrogated.

2.) I will defer to Benilton or others in terms of the best way to do
this.  If you are interested and willing to get your hands a little dirty
I could point you to our approach based on the oligo package which is
still under development but does contain documentation, unit tests, etc.



On 11/3/13 7:00 AM, "Jimmy Peters [guest]" <guest at bioconductor.org> wrote:

>Dear Benilton/BioC list,
>Query: Is it possible to perform summarisation using the oligo package
>after exclusion of probes containing SNPs?
>Background: I've performed an eQTL analysis where the expression data has
>been obtained on Affymetrix HuGene ST 1.1 microarrays.
>After reading in the CEL files, I pre-processed the GeneFeatureSet doing
>background correction, quantile normalisation, and summarisation to
>gene-level using the rma function from the oligo package.
># What my real data looks like:
>( cels <- read.celfiles(pathToFiles) ) # generates a GeneFeatureSet
>GeneFeatureSet (storageMode: lockedEnvironment)
>assayData: 1178100 features, 90 samples
>element names: exprs
>rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ... triad0078_H7_498_CD16.CEL
>(90 total)
>varLabels: exprs dates
>varMetadata: labelDescription channel
>rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ... triad0078_H7_498_CD16.CEL
>(90 total)
>varLabels: cell_type info.batch.name ... age (29 total)
>varMetadata: labelDescription channel
>featureData: none
>experimentData: use 'experimentData(object)'
>Annotation: pd.hugene.1.1.st.v1
>#Example workfow using oligoData for reproducibilty:
>affyGeneFS        # a GeneFeatureSet
>GeneFeatureSet (storageMode: lockedEnvironment)
>assayData: 1102500 features, 33 samples
>element names: exprs
>rowNames: TisMap_Brain_01_v1_WTGene1.CEL TisMap_Brain_02_v1_WTGene1.CEL
>... TisMap_Thyroid_03_v1_WTGene1.CEL (33 total)
>varLabels: exprs dates
>varMetadata: labelDescription channel
>rowNames: TisMap_Brain_01_v1_WTGene1.CEL TisMap_Brain_02_v1_WTGene1.CEL
>... TisMap_Thyroid_03_v1_WTGene1.CEL (33 total)
>varLabels: index
>varMetadata: labelDescription channel
>featureData: none
>experimentData: use 'experimentData(object)'
>Annotation: pd.hugene.1.0.st.v1
># generate gene-level expression values:
>eset <- rma(affyGeneFS)
># equivalent to:
>bgCorrected <- backgroundCorrect(affyGeneFS)                          #
>Background correct
>normalized <- normalize(bgCorrected,  method="quantile")              #
>Quantile normalise
>eset2 <- rma(normalized, background=F, normalize=F, subset=NULL)      #
>Summarize with median polish
>I would like to re-run the eQTL analysis, this time excluding probes from
>the expression data that contain
>SNPs in >1% of CEU individuals to see whether this has any substantial
>effect on my findings. My concern is
>the possibility of apparent eQTLs which are in fact artefacts due to less
>efficient binding between probes
>and transcripts containing the minor allele.
>My questions are- 
>1) Having obtained a list of the probes to exclude, is it valid to simply
>perform BG correction and quantile normalisation on the whole
>but to then remove probes containing SNPs prior to summarisation? Would
>the validity of mapping the resulting probesets to Entrez ids etc.
>be questionable if some probes had been excluded?
>2) If such an approach is reasonable, how can I map Affymetrix
>probe-level identifiers to the GeneFeatureSet?
># I know how to get feature ids for PM probes and their groupings into
>gene-level probesets:
>featureInfo <- oligo:::stArrayPmInfo(normalized, target = 'core')
>       fid  fsetid
>1   116371 7892501
>2   943979 7892501
>3   493089 7892501
>4   907039 7892501
>5  1033309 7892502
>6   653512 7892502
>7   690769 7892502
>8   997409 7892502
>9   379979 7892503
>10  469846 7892503
>The fsetid s correspond to gene-level probesets. Are the fid s row
>indices in the GeneFeatureSet,
>and how can they be linked to Affy probe level ids?
>Assuming I can map Affy probe ids to the fid column, here is my proposed
>solution for
>summarisation after removal of selected probes, but I wonder if there is
>a more straightforward way?
># for demo purposes generate a random list of fids to remove
># some fid s are duplicated in the data.frame as they map to multiple
>probesets, so use 'unique'
>fidToCut <- sample(unique(featureInfo$fid), 10000, replace=F)
># remove rows from GeneFeatureSet corressponding to fid s we want to
>normalized2 <- normalized[-fidToCut, ]
># create mapping for the new GeneFeatureSet of how old row indices map to
>new indices
>map <- cbind(new_ind= 1:nrow(normalized2), old_ind=
>featureNames(normalized2) )
># pmi gives row indices of pm probes in 'normalized', not 'normalized2'
>pmi <-  featureInfo[["fid"]]
># re-order dataframe so 'old_ind' column is the same as pmi
>map1 <- map[match(pmi, map[,"old_ind"]),]
>map1 <- cbind(map1, featureInfo)
>       new_ind old_ind    fid  fsetid
>861488  961136  969962 969962 8180418
>861489  208831  210719 210719 8180418
>861490  870109  878113 878113 8180418
>861491  598190  603636 603636 8180418
>861492  858752  866642 866642 8180418
>861493  713834  720353 720353 8180418
># we need to get rid of rows with NA... e.g. where 'fid' but no 'new_ind'
># effectively recreating a new 'featureInfo' dataframe
>map1 <- na.omit(map1)
># now we can get the new row indices, and the probesets they belong to
>pmi <-  as.numeric( as.character( map1[["new_ind"]] )) # 'new_ind' got
>coerced to a factor
>pnVec <- as.character(map1[["fsetid"]])
># subset the normalized probe-level values to keep probes by indexes in
>pms <- exprs(normalized2)[pmi, , drop = FALSE]
>dimnames(pms) <- NULL
>colnames(pms) <- sampleNames(normalized2)
># get a matrix of summarized values, which can be used to create an
>theExprs <- basicRMA(pms, pnVec, normalize=F, background=F)
>I'd greatly appreciate any advice. I'm a biologist by background so
>apologies if this is rather basic.
>Thanks very much
>Jimmy Peters, Smith Lab, Cambridge Institute for Medical Research.
> -- output of sessionInfo():
>R version 3.0.2 (2013-09-25)
>Platform: x86_64-apple-darwin10.8.0 (64-bit)
>[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>attached base packages:
>[1] parallel  stats     graphics  grDevices utils     datasets  methods
>other attached packages:
> [1] pd.hugene.1.0.st.v1_3.8.0 oligoData_1.8.0           biomaRt_2.18.0
>         pd.hugene.1.1.st.v1_3.8.0 RSQLite_0.11.4            DBI_0.2-7
> [7] oligo_1.26.0              Biobase_2.22.0
>oligoClasses_1.24.0       BiocGenerics_0.8.0
>loaded via a namespace (and not attached):
> [1] affxparser_1.34.0     affyio_1.30.0         BiocInstaller_1.12.0
>Biostrings_2.30.0     bit_1.1-10            codetools_0.2-8
> [8] foreach_1.4.1         GenomicRanges_1.14.1  IRanges_1.20.0
>iterators_1.0.6       preprocessCore_1.24.0 RCurl_1.95-4.1
>[15] stats4_3.0.2          tools_3.0.2           XML_3.95-0.2
>XVector_0.2.0         zlibbioc_1.8.0
>Sent via the guest posting facility at bioconductor.org.
>Bioconductor mailing list
>Bioconductor at r-project.org
>Search the archives:

More information about the Bioconductor mailing list