[BioC] IlluminaHumanMethylation450kprobe for nearest TSS calculation question
Florence Cavalli
florence at ebi.ac.uk
Thu Mar 7 17:45:30 CET 2013
Hello,
I looked at this code as well and I would have two comments about it.
To get the closest TSS, I understand why we use precede() and the fact
that the argument ignore.strand is by default as FALSE (for GRanges
object, which is a
good thing), so the strand of the gene is taken into account.
However to compute the distance to the TSS for the gene on the reverse
strand, should we not use the coordinate of the end of the gene (as
stored in the GRanges object) ?
i.e from
TSSs.rev = start(TSS.reverse[nearest.rev[-notfound]])
to
TSSs.rev = end(TSS.reverse[nearest.rev[-notfound]])
Then the computed distance
distToTSS.rev[-notfound] = start(CpGs.unstranded)[-notfound] - TSSs.rev
will still be positive since "we are walking up the opposite strand"
but the length of the gene would not be taken into account as if
start(TSS.reverse[nearest.rev[-notfound]]) were used.
Toward the end, the shorter distance to TSS and the corresponding
Entrez ID are added for each probe.
It appears that the subselection is not done appropiately:
IlluminaHumanMethylation450kprobe$ENTREZ[FWD.CLOSER] =
IlluminaHumanMethylation450kprobe$fwd.gene_id
IlluminaHumanMethylation450kprobe$ENTREZ[REV.CLOSER] =
IlluminaHumanMethylation450kprobe$rev.gene_id
The original code return a warning.
> head(IlluminaHumanMethylation450kprobe) (with original code)
Probe_ID chr strand start end site probe.sequence
cg00000029 cg00000029 16 - 53468112 53468161 53468112 <NA>
cg00000108 cg00000108 3 + 37459206 37459255 37459206 <NA>
cg00000109 cg00000109 3 - 171916037 171916086 171916037 <NA>
cg00000165 cg00000165 1 + 91194626 91194675 91194674 <NA>
cg00000236 cg00000236 8 + 42263246 42263295 42263294 <NA>
cg00000289 cg00000289 14 - 69341139 69341188 69341139 <NA>
source.sequence forward.genomic.sequence
cg00000029 <NA> CGAAACCTTCACACGTCAGTGTCTTTTGGACATTTTCTCGTCAGTACAGC
cg00000108 <NA> CGGCCAGGATGACAGCGGAGCCAGGATCACCCCAGGTCTGTCTCATTGCA
cg00000109 <NA> CGTATTTAGAAGCCAAGATCTGTGGGGGGGTACATGTGCCTGTTAGTATT
cg00000165 <NA> CGATGTGTGCCTCAGCTGTTCCATCAAAAGCCACTGTACTAACAGATCCT
cg00000236 <NA> CGTGATGTACAAACTGGTGGGTCAGATCGTCTCCTCTAACATGACGCTAC
cg00000289 <NA> CGACTCCCACACCAAAATGGACATGAGATTGGAGAAATGAATACAGCAGA
CpGs fwd.dist fwd.gene_id rev.dist rev.gene_id DISTTOTSS ENTREZ
cg00000029 3 -239 5934 64838 643802 239 5934
cg00000108 2 -34607 3680 365089 9209 34607 3680
cg00000109 1 -552438 1894 597842 5337 552438 1894
cg00000165 1 -771730 8317 17095 343472 17095 643802
cg00000236 3 -133004 114926 31708 27121 31708 9209
cg00000289 1 -105260 161159 86767 677 86767 5337
We can see this, first for the CpG cg00000165 for which the gene on
the reverse strand is the closest, entrez should be 343472.
I would modify it to:
IlluminaHumanMethylation450kprobe$ENTREZ[FWD.CLOSER] =
IlluminaHumanMethylation450kprobe$fwd.gene_id[FWD.CLOSER]
IlluminaHumanMethylation450kprobe$ENTREZ[REV.CLOSER] =
IlluminaHumanMethylation450kprobe$rev.gene_id[REV.CLOSER]
One id stays as NA since the distance to the closest TSS on both
strands is equal.
> table(is.na(IlluminaHumanMethylation450kprobe$ENTREZ))
FALSE TRUE
485511 1
This happens for 17 CpG after applying the first modification for the
distance computation I mentioned above.
For this I am not sure how to chose between the two genes, one can
arbitrarily assigned the gene of the positive strand:
IlluminaHumanMethylation450kprobe[which(is.na(IlluminaHumanMethylation450kprobe$ENTREZ)),"ENTREZ"]=IlluminaHumanMethylation450kprobe[which(is.na(IlluminaHumanMethylation450kprobe$ENTREZ)),"fwd.gene_id"]
or
IlluminaHumanMethylation450kprobe$ENTREZ[which(is.na(IlluminaHumanMethylation450kprobe$ENTREZ))]=IlluminaHumanMethylation450kprobe$fwd.gene_id[which(is.na(IlluminaHumanMethylation450kprobe$ENTREZ))]
Please let me know if these comments make sense.
Thanks,
Florence
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] IlluminaHumanMethylation450kprobe_2.0.6
[2] GenomicFeatures_1.10.1
[3] AnnotationDbi_1.20.3
[4] Biobase_2.18.0
[5] GenomicRanges_1.10.6
[6] IRanges_1.16.4
[7] BiocGenerics_0.4.0
loaded via a namespace (and not attached):
[1] biomaRt_2.14.0 Biostrings_2.26.2 bitops_1.0-4.2 BSgenome_1.26.1
[5] DBI_0.2-5 parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2
[9] RSQLite_0.11.2 rtracklayer_1.18.2 stats4_2.15.2 tools_2.15.2
[13] XML_3.95-0.1 zlibbioc_1.4.0
2013/3/7 Watkins, Dale (Health) <Dale.Watkins at health.sa.gov.au>:
> Thanks Tim, that makes perfect sense. Cheers Dale
>
> ________________________________________
> From: Tim Triche, Jr. [tim.triche at gmail.com]
> Sent: Thursday, 7 March 2013 5:19 PM
> To: Dale Watkins [guest]
> Cc: bioconductor at r-project.org; Watkins, Dale (Health)
> Subject: Re: [BioC] IlluminaHumanMethylation450kprobe for nearest TSS calculation question
>
> the other 65 probes are SNP probes, not CpGs or CpHs
>
>
> On Wed, Mar 6, 2013 at 10:07 PM, Dale Watkins [guest] <guest at bioconductor.org<mailto:guest at bioconductor.org>> wrote:
>
> Hi,
> I have been using the IlluminaHumanMethylation450kprobe reference manual to find the nearest TSS and corresponding EntrezGene ID information for analysis of my 450k data, but I have a question.
>
> The distance to TSS file I have generated contains 485512 rows, but the 450k array covers 485577 CpGs - why is there a discrepancy? I used the code from the reference manual as follows:
>
> # find the nearest TSS and its corresponding EntrezGene ID
> library(GenomicFeatures)
> CpGs.unstranded = CpGs.450k
> strand(CpGs.unstranded) = "*"
> refgene.TxDb = makeTranscriptDbFromUCSC("refGene", genome="hg19")
> # nearest forward TSS
> TSS.forward = transcripts(refgene.TxDb,
> vals=list(tx_strand="+"),
> columns="gene_id")
> nearest.fwd = precede(CpGs.unstranded, TSS.forward)
> nearest.fwd.eg<http://nearest.fwd.eg> = nearest.fwd # to keep dimensions right
> notfound = which(is.na<http://is.na>(nearest.fwd)) # track for later
> nearest.fwd.eg<http://nearest.fwd.eg>[-notfound] = as.character(elementMetadata(TSS.forward)$gene_id[nearest.fwd[-notfound]])
>
> TSSs.fwd = start(TSS.forward[nearest.fwd[-notfound]])
> distToTSS.fwd = nearest.fwd # to keep dimensions right
> distToTSS.fwd[-notfound] = start(CpGs.unstranded)[-notfound] - TSSs.fwd
> # note that these are NEGATIVE -- which is correct!
>
>
> # nearest reverse TSS
> TSS.reverse = transcripts(refgene.TxDb,
> vals=list(tx_strand="-"),
> columns="gene_id")
> nearest.rev = precede(CpGs.unstranded, TSS.reverse)
> nearest.rev.eg<http://nearest.rev.eg> = nearest.rev # to keep dimensions right
> notfound = which(is.na<http://is.na>(nearest.rev)) # track for later
> nearest.rev.eg<http://nearest.rev.eg>[-notfound] = as.character(elementMetadata(TSS.reverse)$gene_id[nearest.rev[-notfound]])
>
> TSSs.rev = start(TSS.reverse[nearest.rev[-notfound]])
> distToTSS.rev = nearest.rev # to keep dimensions right
> distToTSS.rev[-notfound] = start(CpGs.unstranded)[-notfound] - TSSs.rev
> # now these are POSITIVE: we are walking up the opposite strand.
>
> # tabulate and link these together for the annotation package:
> IlluminaHumanMethylation450kprobe$fwd.dist <- distToTSS.fwd
> IlluminaHumanMethylation450kprobe$fwd.gene_id <- nearest.fwd.eg<http://nearest.fwd.eg>
> IlluminaHumanMethylation450kprobe$rev.dist <- distToTSS.rev
> IlluminaHumanMethylation450kprobe$rev.gene_id <- nearest.rev.eg<http://nearest.rev.eg>
>
> FWD.CLOSER = with(IlluminaHumanMethylation450kprobe,
> union( which( abs(fwd.dist) < abs(rev.dist) ),
> which( is.na<http://is.na>(rev.dist) ) ) )
> REV.CLOSER = with(IlluminaHumanMethylation450kprobe,
> union( which( abs(fwd.dist) > abs(rev.dist) ),
> which( is.na<http://is.na>(fwd.dist) ) ) )
>
> IlluminaHumanMethylation450kprobe$DISTTOTSS = pmin(abs(IlluminaHumanMethylation450kprobe$fwd.dist), abs(IlluminaHumanMethylation450kprobe$rev.dist))
> IlluminaHumanMethylation450kprobe$ENTREZ = NA
> IlluminaHumanMethylation450kprobe$ENTREZ[FWD.CLOSER] = IlluminaHumanMethylation450kprobe$fwd.gene_id
> IlluminaHumanMethylation450kprobe$ENTREZ[REV.CLOSER] = IlluminaHumanMethylation450kprobe$rev.gene_id
>
>
> write.table(IlluminaHumanMethylation450kprobe$DISTTOTSS, "DistToTSS.csv", sep=",")
> write.table(IlluminaHumanMethylation450kprobe$ENTREZ, "ENTREZ.csv", sep=",")
> write.table(IlluminaHumanMethylation450kprobe$ENTREZ[FWD.CLOSER], "Fwd.Closer.csv", sep=",")
> write.table(IlluminaHumanMethylation450kprobe$ENTREZ[REV.CLOSER], "Rev.Closer.csv", sep=",")
>
> Thanks in advance.
>
> -- output of sessionInfo():
>
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252
> [4] LC_NUMERIC=C LC_TIME=English_Australia.1252
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 IlluminaHumanMethylation450kprobe_2.0.6
> [3] IlluminaHumanMethylation450kannotation.ilmn.v1.2_0.1.3 IlluminaHumanMethylation450k.db_1.4.7
> [5] org.Hs.eg.db_2.8.0 RSQLite_0.11.2
> [7] DBI_0.2-5 IlluminaHumanMethylation450kmanifest_0.4.0
> [9] BSgenome.Hsapiens.UCSC.hg19_1.3.19 BSgenome_1.26.1
> [11] GEOquery_2.24.1 GenomicFeatures_1.10.2
> [13] AnnotationDbi_1.20.5 minfi_1.4.0
> [15] Biostrings_2.26.3 GenomicRanges_1.10.7
> [17] IRanges_1.16.6 reshape_0.8.4
> [19] plyr_1.8 lattice_0.20-13
> [21] Biobase_2.18.0 BiocGenerics_0.4.0
> [23] BiocInstaller_1.8.3
>
> loaded via a namespace (and not attached):
> [1] affyio_1.26.0 annotate_1.36.0 beanplot_1.1 biomaRt_2.14.0 bit_1.1-9
> [6] bitops_1.0-5 codetools_0.2-8 crlmm_1.16.9 ellipse_0.3-7 ff_2.2-10
> [11] foreach_1.4.0 genefilter_1.40.0 grid_2.15.2 iterators_1.0.6 limma_3.14.4
> [16] MASS_7.3-23 Matrix_1.0-11 matrixStats_0.6.2 mclust_4.0 multtest_2.14.0
> [21] mvtnorm_0.9-9994 nor1mix_1.1-3 oligoClasses_1.20.0 preprocessCore_1.20.0 R.methodsS3_1.4.2
> [26] RColorBrewer_1.0-5 RcppEigen_0.3.1.2.1 RCurl_1.95-3 Rsamtools_1.10.2 rtracklayer_1.18.2
> [31] siggenes_1.32.0 splines_2.15.2 stats4_2.15.2 survival_2.36-14 tools_2.15.2
> [36] XML_3.95-0.1 xtable_1.7-1 zlibbioc_1.4.0
>
> --
> Sent via the guest posting facility at bioconductor.org<http://bioconductor.org>.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> --
> A model is a lie that helps you see the truth.
>
> Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list