[BioC] Oligo package annotation
James W. MacDonald
jmacdon at uw.edu
Fri Dec 14 22:20:37 CET 2012
Hi Tim,
On 12/14/2012 1:04 PM, Tim Triche, Jr. wrote:
> Ok I guess I'm a bit dense, but bear with me here. I have used crlmm
> in the past on SNP6 arrays, but not on HuEx arrays (yet).
>
> oligo and oligoClasses are both used by packages like crlmm, charm,
> etc. where the locations of interrogated sequences are very important.
> So the pdInfo packages do (ought to?) have information about what
> probesets interrogate what subsequences along an overall transcript or
> genomic region (typically this will be complete with strand, though
> not always). What would be involved in taking the above information
> and turning it into a GRanges/GRangesList suitable for mapping large
> sparse Matrix objects?
> Just wondering. For 3' arrays, many SNP arrays, and DNA methylation
> arrays, the process is fairly trivial (i.e. "I needed to resolve the
> problem so I wrote code to coerce GEO data into the appropriate
> container"). Is there a general overarching logic that could be
> harnessed to dump existing results into memory-efficient data
> structures of the sort now available through GenomicRanges (and SEs)?
>
> In particular, for the HuEx arrays, it just dawned on me that a
> GRangesList of probesets and transcript mappings could be really handy.
>
> How onerous of a task would this, in your (James') estimation, be?
Depends on what you want, I suppose. What Affy supply are the regions of
the genome that a probeset is intended to query, rather than what each
probe itself is supposed to interrogate. So if you want to know the
region that the probesets are intended to query, and for the transcripts
you want to know the entirety of the region they are querying (e.g.,
from the start of the first probeset to the end of the last probeset
that were collated into a transcript level probeset), then it isn't too
difficult.
fun <- function(db.file, probeset = TRUE){
require(db.file, character.only = TRUE, quietly = TRUE)
require(GenomicRanges)
con <- db(get(db.file))
if(probeset){
x <- dbGetQuery(con, paste("select
fsetid,chrom,start,stop,strand,exon_id",
"from featureSet where type='1' and
chrom not NULL;"))
y <- tapply(1:nrow(x), x[,2], function(z) x[z,])
y <- lapply(y, function(z) GRanges(z[,2], IRanges(start=z[,3],
end=z[,4]), exon_id=z[,6], fsetid=z[,1]))
out <- do.call("GRangesList", y)
}else{
x <- dbGetQuery(con, paste("select
meta_fsetid,featureSet.fsetid,chrom,start,stop,strand",
"from featureSet inner join core_mps
on featureSet.fsetid=core_mps.fsetid",
"where type='1' and chrom not null;"))
y <- tapply(1:nrow(x), x[,1], function(z) x[z,])
crushit <- function(d.f){
if(nrow(d.f) == 1) return(d.f)
else d.f <- data.frame(d.f[1,1:3], start = min(d.f$start),
stop = max(d.f$stop),
strand = d.f$strand[1])
}
y <- do.call("rbind", lapply(y, crushit))
y <- tapply(1:nrow(y), y$chrom, function(z) y[z,])
y <- lapply(y, function(z) GRanges(z[,3], IRanges(start =
z[,4], end = z[,5])))
out <- do.call("GRangesList", y)
}
dbDisconnect(con)
out
}
The output from this function is a GRangesList where the individual
GRanges are by chromosome. This makes sense for the probeset level.
However, note that the transcript level summarization just involves
taking one or more probesets and piling them into a single
transcript-level probeset.
So at the transcript level, there is a question about what the GRanges
items should be. Should they still be by chromosome, and the IRanges
entries just give the range of the transcript that Affy says is being
interrogated (e.g., we just take the start of the 'first' probeset and
the end of the 'last' one)? Or would it be better to have each GRanges
item contain all the probesets that make up the transcript?
If the latter, that gets a bit inefficient the way I tried it, as I was
trying to make 29K individual GRanges. But maybe there is a much smarter
way to do things.
Best,
Jim
>
>
>
> On Fri, Dec 14, 2012 at 6:58 AM, James W. MacDonald <jmacdon at uw.edu
> <mailto:jmacdon at uw.edu>> wrote:
>
> Please don't take conversations off-list. We like to think of the
> list archives as a repository of information.
>
> On 12/14/2012 5:45 AM, Bruno Giotti wrote:
>
> Ok thanks, but what should i do to query the
> pd.hugene.1.1.st.v1 annotation pack and retrieving some useful
> IDs? I could use the package you suggested me but i'd like to
> first understand how to use this one ( pd.hugene.1.1.st.v1).
>
>
> The pd.hugene.1.1.st.v1 package is NOT an annotation package.
> Instead, it maps the locations of probes on the array to different
> probesets. This package is used by oligo to decide which probes go
> into which probeset, so you can summarize at different levels
> (e.g., for the HuGene arrays, at the 'probeset' level, which is
> roughly exon-level, or at the transcript level).
>
> Unless you have a real need to know where things are on the chip,
> the pd.hugene.1.1.st.v1 package is not of much use. Well, let me
> take that back. I have found that the intronic background controls
> have a really bad habit of popping up in lists of differentially
> expressed genes. There are any number of hypotheses that I can
> come up with that would explain why this is so, but in the end I
> haven't found any end users who really care. So I use the
> pd.hugene.1.1.st.v1 package to figure out which probesets are not
> controls, and exclude them prior to selecting differentially
> expressed genes. The getMainProbes() function in affycoretools is
> useful in this respect.
>
> So back to the story at hand. Since the pd.hugene.1.1.st.v1
> package doesn't do annotations, you need to use the
> hugene11sttranscriptcluster.db package. It does use a SQLite
> database as its backend, but unless you like to do SQL queries
> this is of no relevance.
>
> The canonical reference for using these annotation packages is the
> Intro to Annotation Packages, which can be accessed by
>
> library(hugene11sttranscriptcluster.db)
> openVignette()
>
> and then choosing
>
> AnnotationDbi - AnnotationDbi: Introduction To Bioconductor
> Annotation Packages
>
> if you care about the internals, you can read
>
> AnnotationDbi - How to use bimaps from the ".db" annotation packages
>
> And if you just want to create annotated output, take a look at
> the annaffy package, which automates these things.
>
> Best,
>
> Jim
>
>
> Thaniks again
>
> > Date: Thu, 13 Dec 2012 11:53:52 -0500
> > From: jmacdon at uw.edu <mailto:jmacdon at uw.edu>
> > To: guest at bioconductor.org <mailto:guest at bioconductor.org>
> > CC: bioconductor at r-project.org
> <mailto:bioconductor at r-project.org>; latini18 at hotmail.com
> <mailto:latini18 at hotmail.com>; Benilton.Carvalho at cancer.org.uk
> <mailto:Benilton.Carvalho at cancer.org.uk>
> > Subject: Re: [BioC] Oligo package annotation
>
> >
> >
> >
> > On 12/13/2012 11:48 AM, Bruno [guest] wrote:
> > > Hi all,
> > > My question is quite straight-forward: how do i retrieve
> EntrezId or geneSymbol for pd.hugene.1.1.st.v1 to merge into
> my gene expression matrix? I havent found any vignettes
> explaining this. I know that the annotation file is a SQLite
> DB which i have to query. However im failing to find the
> tables i need. Sorry if i persevere in not explaining myself
> enough.
> >
> > It depends on what level you used for summarization.
> Assuming that you
> > used transcript-level summarization (which I would highly
> recommend),
> > you want to use the hugene11sttranscriptcluster.db package.
> If you did
> > something like
> >
> > rma(<filename>, target="probeset")
> >
> > then you want the hugene11stprobeset.db
> >
> > Best,
> >
> > Jim
> >
> >
> > >
> > >
> > > -- output of sessionInfo():
> > >
> > > R version 2.15.1 (2012-06-22)
> > > Platform: x86_64-pc-mingw32/x64 (64-bit)
> > >
> > > locale:
> > > [1] LC_COLLATE=English_United Kingdom.1252
> LC_CTYPE=English_United Kingdom.1252
> LC_MONETARY=English_United Kingdom.1252
> > > [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
> > >
> > > attached base packages:
> > > [1] stats graphics grDevices utils datasets methods base
> > >
> > > other attached packages:
> > > [1] pd.hugene.1.1.st.v1_3.8.0 oligo_1.22.0 affyPLM_1.34.0
> preprocessCore_1.20.0 latticeExtra_0.6-24
> > > [6] lattice_0.20-10 RColorBrewer_1.0-5 BiocInstaller_1.8.3
> simpleaffy_2.34.0 gcrma_2.30.0
> > > [11] genefilter_1.40.0 affy_1.36.0 limma_3.14.3
> RSQLite_0.11.2 DBI_0.2-5
> > > [16] Biobase_2.18.0 oligoClasses_1.20.0 BiocGenerics_0.4.0
> > >
> > > loaded via a namespace (and not attached):
> > > Error in x[["Version"]] : subscript out of bounds
> > > In addition: Warning message:
> > > In FUN(c("affxparser", "affyio", "annotate",
> "AnnotationDbi", "Biostrings", :
> > > DESCRIPTION file of package 'survival' is missing or broken
> > >
> > > --
> > > Sent via the guest posting facility at bioconductor.org
> <http://bioconductor.org>.
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > --
> > James W. MacDonald, M.S.
> > Biostatistician
> > University of Washington
> > Environmental and Occupational Health Sciences
> > 4225 Roosevelt Way NE, # 100
> > Seattle WA 98105-6099
> >
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
> --
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list