[Bioc-devel] Illumina Methylation annotations

Sat Dec 18 16:23:29 CET 2010

Hi Tim

Thanks for your long and good comments! It took me a while to finish reading
your email. :) Following are my comments.

> Speaking of annotations...
> 
> The 450k methylation chips have arrived and many of the probes map to
> multiple accessions.  I have modified the 27k annotations from Sean to
> handle bead mapping IDs (thanks Sean!) and started building probe sequence
> packages in the style of matchprobes/beadarray, but I was wondering how
> people would like to handle multiple-accession mappings and/or NuID encoding
> of the 450k probes.  As I'm sure many of you are aware, the denser platform
> contains many more probes with multiple CpG sites, which affects both
> preprocessing and the manner in which it makes sense to apply NuID
> translations. There are other 'interesting' aspects of the 450k arrays,
> along with consequences arising from the design of the platform, but a good
> first step would be to get the data playing nicely with methylumi and lumi,
> hence some agreement upon the annotations.

As you know, the benefit of nuID is that we can directly know the probe
sequence without checking any table. But Illumina Probe ID, as the
manufacturer ID, is the most widely used in public. So I think one
alternative way is just to add an additional Bimap table of IlluminaID and
nuID in the current Infinium methylation library. As an option, I will add a
mapping function to convert data between Illumina ID and nuID. But by
default, data will be IlluminaID identified.

As for multiple mappings, I am not sure how Illumina 450k reports them. For
easier maintenance in the long run, we can just keep the same way as
Illumina do. Illumina has improved their annotation maintenance. They make
regular updates of their annotations now.

> Also, given the large number of controls on the platform, a means of keeping
> track of (e.g.) bisulfite conversion controls and non-polymorphic probes, as
> well as the 600 or so negative controls, belongs in MethyLumiM.  How this
> should best be accomplished is less clear to me -- it's trivial to pull the
> control probe intensities from the .IDAT files that are always emitted from
> every scan of a 27k or 450k chip, but the representation within the object
> is more troublesome.  One of the things that made combining and subsetting
> MethyLumiSet objects quirky on occasion was the eSet-within-an-eSet
> representation for control probes, so I understand your (plural) reluctance
> to continue that model.  However, the only other thing I can come up with is
> to use an additional couple of data.frame() objects to hold the Cy5 and Cy3
> control intensities.  That would be fine too.
> 
> This raises the question of how to store information about the control (and
> analytic!) probes.

Can you send me some example control probe data? One option is keeping the
same way as LumiBatch-class to store control data information.

> Two package types make sense:
> 
> 1) IlluminaHumanMethylationXXk.db -- probe annotations like Sean has built
> for the 27k and GG arrays (small and compact)
> 2) illuminaHumanMethylationXXkProbe.db -- addresses and probe sequences for
> the analytic and control probes (larger .db)

This sounds good. The probe sequences can be kept as nuID format to save
storage space. But again, long term maintenance is a issue.

> 
> Thoughts?  I mostly work from .IDAT files these days, and the Cambridge guys
> have expressed interest in making bead-level tools play nicely with
> methylumi/lumi.  Regardless of where the data comes from, it would be nice
> to standardize on representations, and leave room for using low-level data
> as appropriate.  (I've seen referees ask paper authors whether they looked
> at bisulfite control probes, for example -- not an unreasonable question,
> but the current MethyLumiM can't answer it.)
> 
> I've been impressed with the cleanliness of the Lumi methylation codebase,
> and it would be great if a consensus arose about data representations and
> annotation, so that this cleanliness is not disrupted.  For example, I have
> built 27k and 450k probe packages, and the 450k annotation package can be
> built without much trouble if a consensus can be reached as far as mapping
> multiple accessions to probes (or perhaps using a GRanges object or
> GenomicFeatures?!?).

Thanks for your comments of lumi methylation codebase. Lots of improvement
work still needs to be done. In the long rong, using GenomicFeatures object
definitely will be helpful to integration with other data, like NGS sequence
data, but it is not compatible with other microarray data at current stage.
So maybe it can be a long term plan for the future.

> Patching the current  MethyLumiM object to handle
> subsetting of a couple additional data.frames is no problem either.  But I'd
> like to get your thoughts on the matter before I go off and hack up your
> code.  :-)
> 
If you can send me some example control data, I can play with it and update
the MethyLumiM class at the end of this year. If possible, please also send
me one or two samples of 450K data with annotation information.

Thanks for all the comments and support!

Happy holidays!

Pan