[BioC] makecdfenv and multi-mapping probes on Mouse Gene 1.0 ST array
Sean Davis
sdavis2 at mail.nih.gov
Sun May 11 15:28:47 CEST 2008
On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at ebi.ac.uk> wrote:
> Dear all,
>
> I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used
> the makecdfenv package to build a cdf environment based on the file
> MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage.
>
> That worked without any problems, but out of curiosity I tried taking
> a closer look at the format of the array, to see how many probes were
> in each probe set etc.
>
> I'm aware that some probes map to multiple probe sets and are removed
> when the cdfenv is produced, which seems to be the case for about 8%
> of the probes. My question is exactly how this happens? I would
> expect the multiple-mapping probes to be removed from all probe sets,
> but this doesn't seem to be the case.
I believe that the probes are kept in the first or last probeset (not
sure which) seen. Someone with a little more affy experience can
comment more fully.
Sean
> Example with the two overlapping probe sets 10344719 and 10353008,
> where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned
> into only tab-delimited info and read into R, and "INDEX" being a
> unique probe identifier (the same as index-1 in the cdf env):
>
> > cdf[cdf$QUAL=="10344719","INDEX"]
> [1] 7543 661828 575792 962890 963940 140756 337977
> 510591 860722 968182 387524 386474
> [13] 385518 384468 1076441 1075391 850724 51881 957657 100610
> 862535 506651 505601 82272
> [25] 83322 692860 691810 494417 932343 689216 836826 894914
> 715393 421443 92496 485600
> [37] 253868 352083 594288 1049892 370822 369772 416675 928371
> 505790 506840 135781
> > cdf[cdf$QUAL=="10353008","INDEX"]
> [1] 506840 505790 928371 416675 369772 370822 1049892
> 485600 92496 421443 715393 894914
> [13] 1073586 110809 836826 689216 932343 494417 691810
> 83322 82272 505601 506651 862535
> [25] 100610 957657 51881 850724 1075391 1076441 384468 385518
> 386474 387524 968182 860722
> [37] 510591 337977 140756 963940 962890 575792 661828 7543
> > indexProbes(raw, genenames="10344719")
> $`10344719`
> [1] 692861 253869 352084 594289 135782
> > indexProbes(raw, genenames="10353008")
> $`10353008`
> [1] 506841 505791 928372 416676 369773 370823 1049893
> 485601 92497 421444 715394 894915
> [13] 1073587 110810 836827 689217 932344 494418 691811
> 83323 82273 505602 506652 862536
> [25] 100611 957658 51882 850725 1075392 1076442 384469 385519
> 386475 387525 968183 860723
> [37] 510592 337978 140757 963941 962891 575793 661829 7544
>
> So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of
> which are overlapping. In the cdf environment 10344719 appears to
> have the 42 overlapping probes removed, but they're still present in
> 10353008.
>
> A similar situation is seen for e.g. the overlapping probe sets
> 10461391 and 10487930 with 41 probes each, 40 of which are identical:
>
> > cdf[cdf$QUAL=="10461391","INDEX"]
> [1] 483268 1022846 409057 703153 328783 372162 882399
> 569942 765746 868615 948367 413614
> [13] 830931 434763 970910 600221 599171 135798 6746 455659
> 799186 912319 469313 145393
> [25] 872191 126758 801051 774196 773146 965810 272742 19445
> 585800 999188 1012776 823868
> [37] 156514 210874 645037 799505 1075142
> > cdf[cdf$QUAL=="10487930","INDEX"]
> [1] 1075142 799505 645037 210874 156514 823868 1012776
> 999188 585800 19445 272742 965810
> [13] 773146 774196 801051 126758 872191 145393 469313 912319
> 799186 839098 6746 135798
> [25] 599171 600221 970910 434763 830931 413614 948367 868615
> 765746 569942 882399 372162
> [37] 328783 703153 409057 1022846 483268
> > indexProbes(raw, genenames="10461391")
> $`10461391`
> [1] 455660
> > indexProbes(raw, genenames="10487930")
> $`10487930`
> [1] 1075143 799506 645038 210875 156515 823869 1012777
> 999189 585801 19446 272743 965811
> [13] 773147 774197 801052 126759 872192 145394 469314 912320
> 799187 839099 6747 135799
> [25] 599172 600222 970911 434764 830932 413615 948368 868616
> 765747 569943 882400 372163
> [37] 328784 703154 409058 1022847 483269
>
> Any comments on this or on exactly how the cdf environment is created
> would be much appreciated.
>
> Thanks
> \Heidi
>
> > sessionInfo()
> R version 2.7.0 Under development (unstable) (2008-02-12 r44439)
> i386-apple-darwin8.10.1
>
> locale:
> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] tools stats graphics grDevices utils datasets
> methods base
>
> other attached packages:
> [1] makecdfenv_1.17.0 affy_1.17.3 preprocessCore_1.1.5
> affyio_1.7.17
> [5] Biobase_1.99.4
>
>
> ------------<<>>------------
> Heidi Dvinge
>
> EMBL-European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton, Cambridge
> CB10 1SD
> Mail: heidi at ebi.ac.uk
> Phone: +44 (0) 1223 494 444
> ------------<<>>------------
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list