[BioC] Excluding probes during methylation analysis

Victoria Svinti victoria.svinti at igmm.ed.ac.uk
Mon Jul 8 20:59:57 CEST 2013


Hi Tim,

Thanks so much for your thorough response. I will work through your suggestions and report back if I encounter any problems. 

Much appreciated, 
Victoria

--
Victoria Svinti
Colon Cancer Genetics Group
MRC Human Genetics Unit, IGMM
University of Edinburgh, Western General Hospital,
Crewe Road, Edinburgh, EH4 2XU


On 6 Jul 2013, at 18:01, "Tim Triche, Jr." <tim.triche at gmail.com> wrote:

> > I'm working with methyLumiM or MethyLumiSet objects, is there a way to convert them to a SummarizedExperiment object? 
> 
> If you are using a recent version of 'methylumi' (2.7.5 is ideal), you can coerce them:
> 
> my.SE <- as(my.MethyLumiSet, 'SummarizedExperiment')
> 
> You can also coerce them to minfi-style data structures if you wish:
> 
> my.RGChannelSet <- as(my.MethyLumiSet, 'RGChannelSet')  ## only works if you have all of the output from a methylumIDAT() run
> 
> my.gmSet <- mapToGenome(my.MethyLumiSet)  ## gmSet is a subclass of SummarizedExperiment and essentially behaves the same
> 
> The coercion functions deserve better documentation than what I've currently written; I'll add some as soon as I have a chance. 
> 
> 
> > Should I only be looking at probes in CpGs? 
> 
> That depends a great deal on your experimental design.  If you care about non-CpG methylation, you should look at the ch probes too. The rs (SNP) probes can also be helpful in some circumstances, such as matching against SNP arrays, exome sequences, etc. but it does not make sense to include those in your methylation analysis directly (i.e., don't run tests on SNPs or CpG SNPs!).  This latter point has come about in regards to minfi's default summarization functions, which (IMHO) are more elegant than those in methylumi, but on occasion surprising to the user.   In my opinion, the most sensible processing pipeline would/will combine minfi and illuminaio's speed with the existing methylumi defaults for background processing and SNP probe handling, emitting a SummarizedExperiment-based container for downstream genomic analysis.  GenomicMethylSet and GenomicRatioSet are both subclasses of SummarizedExperiment, so the latter problem is long since solved. The former problem is not so much a problem as an issue of time to write small code patches.
> 
> 
> > How do I go about finding probes that map to repetitive regions, 
> 
> RepeatMasker is a useful tool for this.  I will ask if we can't make the TCGA hm450/hm27 masks public, since (after internal discussion) sensible defaults for probe masking were arrived at, and some computationally obnoxious bits are now water under the bridge for TCGA.
> 
> 
> > or map to multiple regions in the genome?
> 
> For this, you need to use either the probe sequence (including degenerate probes, for Infinium II designs) or the forward design target and align the probes.  BowTie, BWA, or Rsubread, for example, can handle this task if you just write out the probe sequences as a FASTA file.  However, it's not necessarily a good use of your time; other groups have already done this, e.g. http://www.epigeneticsandchromatin.com/content/6/1/4, aka http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL16304
> 
> Additional FDb-style or minfi-style annotation packages would be a more sensible method to represent the above additions, and in fact both might be trivial enough for me to do them in the near future (I would like to update FDb.InfiniumMethylation.hg19, and I'd like to become more familiar with the guts of minfi-style annotation packages, in the event those are a better choice for future development).
> 
> 
> > Do I use the resulting SummarisedExperiment object, or the get450k() in FDb.InfiniumMethylation.hg19, or ..? 
> 
> Either will work, since get450k() is used by setAs('MethyLumiSet', 'SummarizedExperiment') to map the probes.  Likewise, all of the previously mentioned approaches work for minfi's GenomicMethylSet and GenomicRatioSet classes.  Kasper was receptive to input regarding the downstream representation of minfi results as SummarizedExperiments, which is a big help in methylation array analysis.  (You can represent RRBS, eRRBS, WGBS, or RNAseq results as SummarizedExperiments, too, which makes integration much easier)  The important part, IMHO, is that you get all of the power of the GenomicRanges infrastructure for "free" when you go this route. 
> 
> 
> >  found your previous posts on using the GPL13534 helpful, though it seems like things changed since then. 
> 
> Not so much, but the FDb packages attempt to automate some of the irritations related to using GPL entries.  The latter are still a good starting point for in-house annotations, transforming GEO datasets into SummarizedExperiment/GenomicMethylSet/GenomicRatioSet objects, or any number of other manipulations (e.g. turning existing expression data into SummarizedExperiments, mapping against RNAseq data, ...).   My only real concern is the veracity of some of the annotations, whether it be the target assembly, the symbols (Illumina's GPL entries have a smattering of "Excel genes", as one of my coworkers discovered one day), or their provenance. 
> 
> If you can verify the sequence target(s) and the resulting coordinates in a given genomic assembly, most of the remaining annotations are trivially accomplished using GenomicRanges, Biostrings, VariantAnnotation, and the like.  Plus then you know where they came from!  It's not an issue of mistrusting vendors, so much as it is an issue of "doveryai no proveryai" -- trust, but verify.  That way, there there are no nasty surprises.
> 
> This is also why I include a script to rebuild the packages from the manifest coordinates (using GEOquery!), verifying the target sequence mappings along the way, in the FDb.InfiniumMethylation packages.  Similarly, for the snp135common and snp137common packages,  the manipulations required to turn UCSC's curiously-indexed SNP results into SINGLE nucleotide hits (!) are included in inst/build/.   I don't expect anyone else to trust me without verification, either.  
> 
> Best,
> 
> --t
> 
> 
> 
> 
> On Sat, Jul 6, 2013 at 2:22 AM, Victoria Svinti <victoria.svinti at igmm.ed.ac.uk> wrote:
> Hi Tim,
> 
> Thank you for the very helpful pointers. 
> 
> I have some other questions remaining ... 
>  
> I'm working with methyLumiM or MethyLumiSet objects, is there a way to convert them to a SummarizedExperiment object? 
> 
> Should I only be looking at probes in CpGs? 
> 
> How do I go about finding probes that map to repetitive regions, or map to multiple regions in the genome? Do I use the resulting SummarisedExperiment object, or the get450k() in FDb.InfiniumMethylation.hg19, or ..? 
> 
> Sorry for all the questions, hopefully this would be useful to other people in the future. I found your previous posts on using the GPL13534 helpful, though it seems like things changed since then. 
> 
> Thanks again,
> Victoria
> 
> 
> --
> Victoria Svinti
> Colon Cancer Genetics Group
> MRC Human Genetics Unit, IGMM
> University of Edinburgh, Western General Hospital,
> Crewe Road, Edinburgh, EH4 2XU
> 
> 
> 
> On 5 Jul 2013, at 23:57, "Tim Triche, Jr." <tim.triche at gmail.com> wrote:
> 
>> http://www.bioconductor.org/packages/2.12/data/annotation/html/FDb.UCSC.snp137common.hg19.html ,
>> http://www.bioconductor.org/packages/2.13/data/annotation/html/FDb.UCSC.snp135common.hg19.html ,
>> or http://www.bioconductor.org/packages/release/data/annotation/html/SNPlocs.Hsapiens.dbSNP.20120608.html may be handy.   The first two can simply be overlapped, but are 'common' (MAF > 0.01) SNPs only.  If you want all of the SNPs that have been submitted to dbSNP, you need the SNPlocs package.
>> 
>> The UCSC snp13[5|7]common packages are compiled from newer builds of dbSNP than the manifest, which had some bizarre inclusions (SNPs which are > 1bp 3' to the targeted locus, for example) when we looked.  I personally screen out common SNPs that overlap the targeted or extension base, using the most recent build available to me, but that's just my preference.  There are arguments to be made for SNPs anywhere from 1 to 49 bases 5' to the target based on melting temperature of the oligos, and there are arguments to be made for genotyping all of your subjects and screening individually for SNPs. 
>> 
>> Anyways, it is straightforward to dump out the probes that get hit by common SNPs:
>> 
>> library(FDb.UCSC.snp137common.hg19)
>> commonSNPs <- features(FDb.UCSC.snp137common.hg19)
>> 
>> ## load the data: a SummarizedExperiment is like an eSet, but with a GRanges describing the features
>> my.SE <- readRDS('my.SummarizedExperiment.rds')
>> dim(my.SE)
>> ## [1] 485577     11
>> 
>> ## mask common SNPs that overlap the targeted CpG (or CpH, or SNP) site
>> my.SE.noCpgSNPs <- my.SE[ countOverlaps(my.SE, commonSNPs) < 1, ] 
>> dim(my.SE.noCpgSNPs)
>> ## [1] 468211     11
>> 
>> ## retain only CpG probes, and only those that do not overlap a common SNP
>> my.SE.noCpgSnps.onlyCpGs <- my.SE.noCpgSNPs[which(substr(rownames(my.SE.noCpgSNPs),1,2)== 'cg') , ] 
>> dim(my.SE.noCpgSnps.onlyCpGs)
>> ## [1] 465130     11
>> 
>> I prefer to work on SummarizedExperiments (hence the .SE), as it makes life a bit easier; it also happens to be the parent class for GenomicMethylSet, GenomicRatioSet, etc. in minfi, so the steps are the same for those.  Working on genomic coordinates is (almost?) universally preferable in this respect.
>> 
>> YMMV...
>> 
>> 
>> 
>> 
>> On Fri, Jul 5, 2013 at 9:31 AM, Victoria Svinti <victoria.svinti at igmm.ed.ac.uk> wrote:
>> Hi there,
>> 
>> I decided to post after searching the forums for a few days, in hope that somebody can point me in the right direction.
>> 
>> I am analysing a 450k methylation array to look for differentially methylated sites, and got as far as having normalised data. Various resources suggest that I need to drop probes with know SNPs residing in the sequence, microsattelites, those that anneal to multiple genomic locations etc.
>> 
>> I have looked into the FDb.InfiniumMethylation.hg19 package (get450k), but I don't see the annotation regarding SNPs (could be due to my unfamiliarity with GRanges). I finally have acquired a list of these from the GEO, Illumina GPL13534, but wonder if it's outdated and if there is a better way of doing this.
>> 
>> Does someone know of a good/any tutorial for this workflow?
>> 
>> Many thanks,
>> Victoria
>> 
>> --
>> Victoria Svinti
>> Colon Cancer Genetics Group
>> MRC Human Genetics Unit, IGMM
>> University of Edinburgh, Western General Hospital,
>> Crewe Road, Edinburgh, EH4 2XU
>> 
>> 
>> 
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> 
>> 
>> 
>> -- 
>> A model is a lie that helps you see the truth.
>> 
>> Howard Skipper
> 
> 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 
> 
> 
> -- 
> A model is a lie that helps you see the truth.
> 
> Howard Skipper

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20130708/0e72737d/attachment.pl>


More information about the Bioconductor mailing list