[BioC] IlluminaHumanMethylation450kprobe for nearest TSS calculation question

Thu Mar 7 17:45:30 CET 2013

Hello,

I looked at this code as well and I would have two comments about it.
To get the closest TSS, I understand why we use precede() and the fact
that the argument ignore.strand is by default as FALSE (for GRanges
object, which is a
good thing), so the strand of the gene is taken into account.
However to compute the distance to the TSS for the gene on the reverse
strand, should we not use the coordinate of the end of the gene (as
stored in the GRanges object) ?
i.e from
TSSs.rev = start(TSS.reverse[nearest.rev[-notfound]])
to
TSSs.rev = end(TSS.reverse[nearest.rev[-notfound]])

Then the computed distance
distToTSS.rev[-notfound] = start(CpGs.unstranded)[-notfound] - TSSs.rev
will still be positive since "we are walking up the opposite strand"
but the length of the gene would not be taken into account as if
start(TSS.reverse[nearest.rev[-notfound]]) were used.

Toward the end, the shorter distance to TSS and the corresponding
Entrez ID are added for each probe.
It appears that the subselection is not done appropiately:
IlluminaHumanMethylation450kprobe$ENTREZ[FWD.CLOSER] =
IlluminaHumanMethylation450kprobe$fwd.gene_id
IlluminaHumanMethylation450kprobe$ENTREZ[REV.CLOSER] =
IlluminaHumanMethylation450kprobe$rev.gene_id
The original code return a warning.

> head(IlluminaHumanMethylation450kprobe) (with original code)
             Probe_ID chr strand     start       end      site probe.sequence
cg00000029 cg00000029  16      -  53468112  53468161  53468112           <NA>
cg00000108 cg00000108   3      +  37459206  37459255  37459206           <NA>
cg00000109 cg00000109   3      - 171916037 171916086 171916037           <NA>
cg00000165 cg00000165   1      +  91194626  91194675  91194674           <NA>
cg00000236 cg00000236   8      +  42263246  42263295  42263294           <NA>
cg00000289 cg00000289  14      -  69341139  69341188  69341139           <NA>
           source.sequence                           forward.genomic.sequence
cg00000029            <NA> CGAAACCTTCACACGTCAGTGTCTTTTGGACATTTTCTCGTCAGTACAGC
cg00000108            <NA> CGGCCAGGATGACAGCGGAGCCAGGATCACCCCAGGTCTGTCTCATTGCA
cg00000109            <NA> CGTATTTAGAAGCCAAGATCTGTGGGGGGGTACATGTGCCTGTTAGTATT
cg00000165            <NA> CGATGTGTGCCTCAGCTGTTCCATCAAAAGCCACTGTACTAACAGATCCT
cg00000236            <NA> CGTGATGTACAAACTGGTGGGTCAGATCGTCTCCTCTAACATGACGCTAC
cg00000289            <NA> CGACTCCCACACCAAAATGGACATGAGATTGGAGAAATGAATACAGCAGA
           CpGs fwd.dist fwd.gene_id rev.dist rev.gene_id DISTTOTSS ENTREZ
cg00000029    3     -239        5934    64838      643802       239   5934
cg00000108    2   -34607        3680   365089        9209     34607   3680
cg00000109    1  -552438        1894   597842        5337    552438   1894
cg00000165    1  -771730        8317    17095      343472     17095 643802
cg00000236    3  -133004      114926    31708       27121     31708   9209
cg00000289    1  -105260      161159    86767         677     86767   5337
We can see this, first for the CpG cg00000165 for which the gene on
the reverse strand is the closest, entrez should be 343472.

I would modify it to:
IlluminaHumanMethylation450kprobe$ENTREZ[FWD.CLOSER] =
IlluminaHumanMethylation450kprobe$fwd.gene_id[FWD.CLOSER]
IlluminaHumanMethylation450kprobe$ENTREZ[REV.CLOSER] =
IlluminaHumanMethylation450kprobe$rev.gene_id[REV.CLOSER]

One id stays as NA since the distance to the closest TSS on both
strands is equal.
> table(is.na(IlluminaHumanMethylation450kprobe$ENTREZ))
 FALSE   TRUE
485511      1

This happens for 17 CpG after applying the first modification for the
distance computation I mentioned above.
For this I am not sure how to chose between the two genes, one can
arbitrarily assigned the gene of the positive strand:

IlluminaHumanMethylation450kprobe[which(is.na(IlluminaHumanMethylation450kprobe$ENTREZ)),"ENTREZ"]=IlluminaHumanMethylation450kprobe[which(is.na(IlluminaHumanMethylation450kprobe$ENTREZ)),"fwd.gene_id"]
or
IlluminaHumanMethylation450kprobe$ENTREZ[which(is.na(IlluminaHumanMethylation450kprobe$ENTREZ))]=IlluminaHumanMethylation450kprobe$fwd.gene_id[which(is.na(IlluminaHumanMethylation450kprobe$ENTREZ))]

Please let me know if these comments make sense.
Thanks,
Florence

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] IlluminaHumanMethylation450kprobe_2.0.6
[2] GenomicFeatures_1.10.1
[3] AnnotationDbi_1.20.3
[4] Biobase_2.18.0
[5] GenomicRanges_1.10.6
[6] IRanges_1.16.4
[7] BiocGenerics_0.4.0

loaded via a namespace (and not attached):
 [1] biomaRt_2.14.0     Biostrings_2.26.2  bitops_1.0-4.2     BSgenome_1.26.1
 [5] DBI_0.2-5          parallel_2.15.2    RCurl_1.95-3       Rsamtools_1.10.2
 [9] RSQLite_0.11.2     rtracklayer_1.18.2 stats4_2.15.2      tools_2.15.2
[13] XML_3.95-0.1       zlibbioc_1.4.0

2013/3/7 Watkins, Dale (Health) <Dale.Watkins at health.sa.gov.au>:
> Thanks Tim, that makes perfect sense. Cheers Dale
>
> ________________________________________
> From: Tim Triche, Jr. [tim.triche at gmail.com]
> Sent: Thursday, 7 March 2013 5:19 PM
> To: Dale Watkins [guest]
> Cc: bioconductor at r-project.org; Watkins, Dale (Health)
> Subject: Re: [BioC] IlluminaHumanMethylation450kprobe for nearest TSS calculation question
>
> the other 65 probes are SNP probes, not CpGs or CpHs
>
>
> On Wed, Mar 6, 2013 at 10:07 PM, Dale Watkins [guest] <guest at bioconductor.org<mailto:guest at bioconductor.org>> wrote:
>
> Hi,
> I have been using the IlluminaHumanMethylation450kprobe reference manual to find the nearest TSS and corresponding EntrezGene ID information for analysis of my 450k data, but I have a question.
>
> The distance to TSS file I have generated contains 485512 rows, but the 450k array covers 485577 CpGs - why is there a discrepancy? I used the code from the reference manual as follows:
>
> # find the nearest TSS and its corresponding EntrezGene ID
> library(GenomicFeatures)
> CpGs.unstranded = CpGs.450k
> strand(CpGs.unstranded) = "*"
> refgene.TxDb = makeTranscriptDbFromUCSC("refGene", genome="hg19")
> # nearest forward TSS
> TSS.forward = transcripts(refgene.TxDb,
>                           vals=list(tx_strand="+"),
>                           columns="gene_id")
> nearest.fwd = precede(CpGs.unstranded, TSS.forward)
> nearest.fwd.eg<http://nearest.fwd.eg> = nearest.fwd # to keep dimensions right
> notfound = which(is.na<http://is.na>(nearest.fwd)) # track for later
> nearest.fwd.eg<http://nearest.fwd.eg>[-notfound] = as.character(elementMetadata(TSS.forward)$gene_id[nearest.fwd[-notfound]])
>
> TSSs.fwd = start(TSS.forward[nearest.fwd[-notfound]])
> distToTSS.fwd = nearest.fwd # to keep dimensions right
> distToTSS.fwd[-notfound] = start(CpGs.unstranded)[-notfound] - TSSs.fwd
> # note that these are NEGATIVE -- which is correct!
>
>
> # nearest reverse TSS
> TSS.reverse = transcripts(refgene.TxDb,
>                           vals=list(tx_strand="-"),
>                           columns="gene_id")
> nearest.rev = precede(CpGs.unstranded, TSS.reverse)
> nearest.rev.eg<http://nearest.rev.eg> = nearest.rev # to keep dimensions right
> notfound = which(is.na<http://is.na>(nearest.rev)) # track for later
> nearest.rev.eg<http://nearest.rev.eg>[-notfound] = as.character(elementMetadata(TSS.reverse)$gene_id[nearest.rev[-notfound]])
>
> TSSs.rev = start(TSS.reverse[nearest.rev[-notfound]])
> distToTSS.rev = nearest.rev # to keep dimensions right
> distToTSS.rev[-notfound] = start(CpGs.unstranded)[-notfound] - TSSs.rev
> # now these are POSITIVE: we are walking up the opposite strand.
>
> # tabulate and link these together for the annotation package:
> IlluminaHumanMethylation450kprobe$fwd.dist <- distToTSS.fwd
> IlluminaHumanMethylation450kprobe$fwd.gene_id <- nearest.fwd.eg<http://nearest.fwd.eg>
> IlluminaHumanMethylation450kprobe$rev.dist <- distToTSS.rev
> IlluminaHumanMethylation450kprobe$rev.gene_id <- nearest.rev.eg<http://nearest.rev.eg>
>
> FWD.CLOSER = with(IlluminaHumanMethylation450kprobe,
>                   union( which( abs(fwd.dist) < abs(rev.dist) ),
>                          which( is.na<http://is.na>(rev.dist) ) ) )
> REV.CLOSER = with(IlluminaHumanMethylation450kprobe,
>                   union( which( abs(fwd.dist) > abs(rev.dist) ),
>                          which( is.na<http://is.na>(fwd.dist) ) ) )
>
> IlluminaHumanMethylation450kprobe$DISTTOTSS = pmin(abs(IlluminaHumanMethylation450kprobe$fwd.dist), abs(IlluminaHumanMethylation450kprobe$rev.dist))
> IlluminaHumanMethylation450kprobe$ENTREZ = NA
> IlluminaHumanMethylation450kprobe$ENTREZ[FWD.CLOSER] = IlluminaHumanMethylation450kprobe$fwd.gene_id
> IlluminaHumanMethylation450kprobe$ENTREZ[REV.CLOSER] = IlluminaHumanMethylation450kprobe$rev.gene_id
>
>
> write.table(IlluminaHumanMethylation450kprobe$DISTTOTSS, "DistToTSS.csv", sep=",")
> write.table(IlluminaHumanMethylation450kprobe$ENTREZ, "ENTREZ.csv", sep=",")
> write.table(IlluminaHumanMethylation450kprobe$ENTREZ[FWD.CLOSER], "Fwd.Closer.csv", sep=",")
> write.table(IlluminaHumanMethylation450kprobe$ENTREZ[REV.CLOSER], "Rev.Closer.csv", sep=",")
>
> Thanks in advance.
>
>  -- output of sessionInfo():
>
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252
> [4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>  [1] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6                 IlluminaHumanMethylation450kprobe_2.0.6
>  [3] IlluminaHumanMethylation450kannotation.ilmn.v1.2_0.1.3 IlluminaHumanMethylation450k.db_1.4.7
>  [5] org.Hs.eg.db_2.8.0                                     RSQLite_0.11.2
>  [7] DBI_0.2-5                                              IlluminaHumanMethylation450kmanifest_0.4.0
>  [9] BSgenome.Hsapiens.UCSC.hg19_1.3.19                     BSgenome_1.26.1
> [11] GEOquery_2.24.1                                        GenomicFeatures_1.10.2
> [13] AnnotationDbi_1.20.5                                   minfi_1.4.0
> [15] Biostrings_2.26.3                                      GenomicRanges_1.10.7
> [17] IRanges_1.16.6                                         reshape_0.8.4
> [19] plyr_1.8                                               lattice_0.20-13
> [21] Biobase_2.18.0                                         BiocGenerics_0.4.0
> [23] BiocInstaller_1.8.3
>
> loaded via a namespace (and not attached):
>  [1] affyio_1.26.0         annotate_1.36.0       beanplot_1.1          biomaRt_2.14.0        bit_1.1-9
>  [6] bitops_1.0-5          codetools_0.2-8       crlmm_1.16.9          ellipse_0.3-7         ff_2.2-10
> [11] foreach_1.4.0         genefilter_1.40.0     grid_2.15.2           iterators_1.0.6       limma_3.14.4
> [16] MASS_7.3-23           Matrix_1.0-11         matrixStats_0.6.2     mclust_4.0            multtest_2.14.0
> [21] mvtnorm_0.9-9994      nor1mix_1.1-3         oligoClasses_1.20.0   preprocessCore_1.20.0 R.methodsS3_1.4.2
> [26] RColorBrewer_1.0-5    RcppEigen_0.3.1.2.1   RCurl_1.95-3          Rsamtools_1.10.2      rtracklayer_1.18.2
> [31] siggenes_1.32.0       splines_2.15.2        stats4_2.15.2         survival_2.36-14      tools_2.15.2
> [36] XML_3.95-0.1          xtable_1.7-1          zlibbioc_1.4.0
>
> --
> Sent via the guest posting facility at bioconductor.org<http://bioconductor.org>.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> --
> A model is a lie that helps you see the truth.
>
> Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor