[BioC] Oligo package annotation

Fri Dec 14 22:20:37 CET 2012

Hi Tim,

On 12/14/2012 1:04 PM, Tim Triche, Jr. wrote:
> Ok I guess I'm a bit dense, but bear with me here.  I have used crlmm 
> in the past on SNP6 arrays, but not on HuEx arrays (yet).
>
> oligo and oligoClasses are both used by packages like crlmm, charm, 
> etc. where the locations of interrogated sequences are very important. 
>  So the pdInfo packages do (ought to?) have information about what 
> probesets interrogate what subsequences along an overall transcript or 
> genomic region (typically this will be complete with strand, though 
> not always).   What would be involved in taking the above information 
> and turning it into a GRanges/GRangesList suitable for mapping large 
> sparse Matrix objects?
> Just wondering.  For 3' arrays, many SNP arrays, and DNA methylation 
> arrays, the process is fairly trivial (i.e. "I needed to resolve the 
> problem so I wrote code to coerce GEO data into the appropriate 
> container").  Is there a general overarching logic that could be 
> harnessed to dump existing results into memory-efficient data 
> structures of the sort now available through GenomicRanges (and SEs)?
>
> In particular, for the HuEx arrays, it just dawned on me that a 
> GRangesList of probesets and transcript mappings could be really handy.
>
> How onerous of a task would this, in your (James') estimation, be?

Depends on what you want, I suppose. What Affy supply are the regions of 
the genome that a probeset is intended to query, rather than what each 
probe itself is supposed to interrogate. So if you want to know the 
region that the probesets are intended to query, and for the transcripts 
you want to know the entirety of the region they are querying (e.g., 
from the start of the first probeset to the end of the last probeset 
that were collated into a transcript level probeset), then it isn't too 
difficult.

fun <- function(db.file, probeset = TRUE){
     require(db.file, character.only = TRUE, quietly = TRUE)
     require(GenomicRanges)
     con <- db(get(db.file))
     if(probeset){
         x <- dbGetQuery(con, paste("select 
fsetid,chrom,start,stop,strand,exon_id",
                                    "from featureSet where type='1' and 
chrom not NULL;"))
         y <- tapply(1:nrow(x), x[,2], function(z) x[z,])
         y <- lapply(y, function(z) GRanges(z[,2], IRanges(start=z[,3], 
end=z[,4]), exon_id=z[,6], fsetid=z[,1]))
         out <- do.call("GRangesList", y)
     }else{
         x <- dbGetQuery(con, paste("select 
meta_fsetid,featureSet.fsetid,chrom,start,stop,strand",
                                    "from featureSet inner join core_mps 
on featureSet.fsetid=core_mps.fsetid",
                                    "where type='1' and chrom not null;"))
         y <- tapply(1:nrow(x), x[,1], function(z) x[z,])
         crushit <- function(d.f){
             if(nrow(d.f) == 1) return(d.f)
             else d.f <- data.frame(d.f[1,1:3], start = min(d.f$start), 
stop = max(d.f$stop),
                                    strand = d.f$strand[1])
         }
         y <- do.call("rbind", lapply(y, crushit))
         y <- tapply(1:nrow(y), y$chrom, function(z) y[z,])
         y <- lapply(y, function(z) GRanges(z[,3], IRanges(start = 
z[,4], end = z[,5])))
         out <- do.call("GRangesList", y)
     }
     dbDisconnect(con)
     out
}

The output from this function is a GRangesList where the individual 
GRanges are by chromosome. This makes sense for the probeset level. 
However, note that the transcript level summarization just involves 
taking one or more probesets and piling them into a single 
transcript-level probeset.

So at the transcript level, there is a question about what the GRanges 
items should be. Should they still be by chromosome, and the IRanges 
entries just give the range of the transcript that Affy says is being 
interrogated (e.g., we just take the start of the 'first' probeset and 
the end of the 'last' one)? Or would it be better to have each GRanges 
item contain all the probesets that make up the transcript?

If the latter, that gets a bit inefficient the way I tried it, as I was 
trying to make 29K individual GRanges. But maybe there is a much smarter 
way to do things.

Best,

Jim

>
>
>
> On Fri, Dec 14, 2012 at 6:58 AM, James W. MacDonald <jmacdon at uw.edu 
> <mailto:jmacdon at uw.edu>> wrote:
>
>     Please don't take conversations off-list. We like to think of the
>     list archives as a repository of information.
>
>     On 12/14/2012 5:45 AM, Bruno Giotti wrote:
>
>         Ok thanks, but what should i do to query the
>         pd.hugene.1.1.st.v1 annotation pack and retrieving some useful
>         IDs? I could use the package you suggested me but i'd like to
>         first understand how to use this one ( pd.hugene.1.1.st.v1).
>
>
>     The pd.hugene.1.1.st.v1 package is NOT an annotation package.
>     Instead, it maps the locations of probes on the array to different
>     probesets. This package is used by oligo to decide which probes go
>     into which probeset, so you can summarize at different levels
>     (e.g., for the HuGene arrays, at the 'probeset' level, which is
>     roughly exon-level, or at the transcript level).
>
>     Unless you have a real need to know where things are on the chip,
>     the pd.hugene.1.1.st.v1 package is not of much use. Well, let me
>     take that back. I have found that the intronic background controls
>     have a really bad habit of popping up in lists of differentially
>     expressed genes. There are any number of hypotheses that I can
>     come up with that would explain why this is so, but in the end I
>     haven't found any end users who really care. So I use the
>     pd.hugene.1.1.st.v1 package to figure out which probesets are not
>     controls, and exclude them prior to selecting differentially
>     expressed genes. The getMainProbes() function in affycoretools is
>     useful in this respect.
>
>     So back to the story at hand. Since the pd.hugene.1.1.st.v1
>     package doesn't do annotations, you need to use the
>     hugene11sttranscriptcluster.db package. It does use a SQLite
>     database as its backend, but unless you like to do SQL queries
>     this is of no relevance.
>
>     The canonical reference for using these annotation packages is the
>     Intro to Annotation Packages, which can be accessed by
>
>     library(hugene11sttranscriptcluster.db)
>     openVignette()
>
>     and then choosing
>
>      AnnotationDbi - AnnotationDbi: Introduction To Bioconductor
>     Annotation Packages
>
>     if you care about the internals, you can read
>
>     AnnotationDbi - How to use bimaps from the ".db" annotation packages
>
>     And if you just want to create annotated output, take a look at
>     the annaffy package, which automates these things.
>
>     Best,
>
>     Jim
>
>
>         Thaniks again
>
>         > Date: Thu, 13 Dec 2012 11:53:52 -0500
>         > From: jmacdon at uw.edu <mailto:jmacdon at uw.edu>
>         > To: guest at bioconductor.org <mailto:guest at bioconductor.org>
>         > CC: bioconductor at r-project.org
>         <mailto:bioconductor at r-project.org>; latini18 at hotmail.com
>         <mailto:latini18 at hotmail.com>; Benilton.Carvalho at cancer.org.uk
>         <mailto:Benilton.Carvalho at cancer.org.uk>
>         > Subject: Re: [BioC] Oligo package annotation
>
>         >
>         >
>         >
>         > On 12/13/2012 11:48 AM, Bruno [guest] wrote:
>         > > Hi all,
>         > > My question is quite straight-forward: how do i retrieve
>         EntrezId or geneSymbol for pd.hugene.1.1.st.v1 to merge into
>         my gene expression matrix? I havent found any vignettes
>         explaining this. I know that the annotation file is a SQLite
>         DB which i have to query. However im failing to find the
>         tables i need. Sorry if i persevere in not explaining myself
>         enough.
>         >
>         > It depends on what level you used for summarization.
>         Assuming that you
>         > used transcript-level summarization (which I would highly
>         recommend),
>         > you want to use the hugene11sttranscriptcluster.db package.
>         If you did
>         > something like
>         >
>         > rma(<filename>, target="probeset")
>         >
>         > then you want the hugene11stprobeset.db
>         >
>         > Best,
>         >
>         > Jim
>         >
>         >
>         > >
>         > >
>         > > -- output of sessionInfo():
>         > >
>         > > R version 2.15.1 (2012-06-22)
>         > > Platform: x86_64-pc-mingw32/x64 (64-bit)
>         > >
>         > > locale:
>         > > [1] LC_COLLATE=English_United Kingdom.1252
>         LC_CTYPE=English_United Kingdom.1252
>         LC_MONETARY=English_United Kingdom.1252
>         > > [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
>         > >
>         > > attached base packages:
>         > > [1] stats graphics grDevices utils datasets methods base
>         > >
>         > > other attached packages:
>         > > [1] pd.hugene.1.1.st.v1_3.8.0 oligo_1.22.0 affyPLM_1.34.0
>         preprocessCore_1.20.0 latticeExtra_0.6-24
>         > > [6] lattice_0.20-10 RColorBrewer_1.0-5 BiocInstaller_1.8.3
>         simpleaffy_2.34.0 gcrma_2.30.0
>         > > [11] genefilter_1.40.0 affy_1.36.0 limma_3.14.3
>         RSQLite_0.11.2 DBI_0.2-5
>         > > [16] Biobase_2.18.0 oligoClasses_1.20.0 BiocGenerics_0.4.0
>         > >
>         > > loaded via a namespace (and not attached):
>         > > Error in x[["Version"]] : subscript out of bounds
>         > > In addition: Warning message:
>         > > In FUN(c("affxparser", "affyio", "annotate",
>         "AnnotationDbi", "Biostrings", :
>         > > DESCRIPTION file of package 'survival' is missing or broken
>         > >
>         > > --
>         > > Sent via the guest posting facility at bioconductor.org
>         <http://bioconductor.org>.
>         > >
>         > > _______________________________________________
>         > > Bioconductor mailing list
>         > > Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         > > https://stat.ethz.ch/mailman/listinfo/bioconductor
>         > > Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>         >
>         > --
>         > James W. MacDonald, M.S.
>         > Biostatistician
>         > University of Washington
>         > Environmental and Occupational Health Sciences
>         > 4225 Roosevelt Way NE, # 100
>         > Seattle WA 98105-6099
>         >
>
>
>     -- 
>     James W. MacDonald, M.S.
>     Biostatistician
>     University of Washington
>     Environmental and Occupational Health Sciences
>     4225 Roosevelt Way NE, # 100
>     Seattle WA 98105-6099
>
>     _______________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>     Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
> -- 
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper 
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099