[BioC] makecdfenv and multi-mapping probes on Mouse Gene 1.0 ST array
Heidi Dvinge
heidi at ebi.ac.uk
Sun May 11 16:38:56 CEST 2008
On 11 May 2008, at 14:28, Sean Davis wrote:
> On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at ebi.ac.uk> wrote:
>> Dear all,
>>
>> I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used
>> the makecdfenv package to build a cdf environment based on the file
>> MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage.
>>
>> That worked without any problems, but out of curiosity I tried taking
>> a closer look at the format of the array, to see how many probes were
>> in each probe set etc.
>>
>> I'm aware that some probes map to multiple probe sets and are removed
>> when the cdfenv is produced, which seems to be the case for about 8%
>> of the probes. My question is exactly how this happens? I would
>> expect the multiple-mapping probes to be removed from all probe sets,
>> but this doesn't seem to be the case.
>
> I believe that the probes are kept in the first or last probeset (not
> sure which) seen. Someone with a little more affy experience can
> comment more fully.
>
I figured it was probably something along those lines, but what's the
reason for not just removing them completely, instead of keeping them
in a 'random' probe set? Most probes that map multiple times map to >
2 probe sets. And in some cases it's large chunks of probe sets that
'overlap', whereas in other cases it's just a few or a single probe
that 'jumps around'.
\Heidi
> Sean
>
>> Example with the two overlapping probe sets 10344719 and 10353008,
>> where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned
>> into only tab-delimited info and read into R, and "INDEX" being a
>> unique probe identifier (the same as index-1 in the cdf env):
>>
>>> cdf[cdf$QUAL=="10344719","INDEX"]
>> [1] 7543 661828 575792 962890 963940 140756 337977
>> 510591 860722 968182 387524 386474
>> [13] 385518 384468 1076441 1075391 850724 51881 957657 100610
>> 862535 506651 505601 82272
>> [25] 83322 692860 691810 494417 932343 689216 836826 894914
>> 715393 421443 92496 485600
>> [37] 253868 352083 594288 1049892 370822 369772 416675 928371
>> 505790 506840 135781
>>> cdf[cdf$QUAL=="10353008","INDEX"]
>> [1] 506840 505790 928371 416675 369772 370822 1049892
>> 485600 92496 421443 715393 894914
>> [13] 1073586 110809 836826 689216 932343 494417 691810
>> 83322 82272 505601 506651 862535
>> [25] 100610 957657 51881 850724 1075391 1076441 384468 385518
>> 386474 387524 968182 860722
>> [37] 510591 337977 140756 963940 962890 575792 661828 7543
>>> indexProbes(raw, genenames="10344719")
>> $`10344719`
>> [1] 692861 253869 352084 594289 135782
>>> indexProbes(raw, genenames="10353008")
>> $`10353008`
>> [1] 506841 505791 928372 416676 369773 370823 1049893
>> 485601 92497 421444 715394 894915
>> [13] 1073587 110810 836827 689217 932344 494418 691811
>> 83323 82273 505602 506652 862536
>> [25] 100611 957658 51882 850725 1075392 1076442 384469 385519
>> 386475 387525 968183 860723
>> [37] 510592 337978 140757 963941 962891 575793 661829 7544
>>
>> So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of
>> which are overlapping. In the cdf environment 10344719 appears to
>> have the 42 overlapping probes removed, but they're still present in
>> 10353008.
>>
>> A similar situation is seen for e.g. the overlapping probe sets
>> 10461391 and 10487930 with 41 probes each, 40 of which are identical:
>>
>>> cdf[cdf$QUAL=="10461391","INDEX"]
>> [1] 483268 1022846 409057 703153 328783 372162 882399
>> 569942 765746 868615 948367 413614
>> [13] 830931 434763 970910 600221 599171 135798 6746 455659
>> 799186 912319 469313 145393
>> [25] 872191 126758 801051 774196 773146 965810 272742 19445
>> 585800 999188 1012776 823868
>> [37] 156514 210874 645037 799505 1075142
>>> cdf[cdf$QUAL=="10487930","INDEX"]
>> [1] 1075142 799505 645037 210874 156514 823868 1012776
>> 999188 585800 19445 272742 965810
>> [13] 773146 774196 801051 126758 872191 145393 469313 912319
>> 799186 839098 6746 135798
>> [25] 599171 600221 970910 434763 830931 413614 948367 868615
>> 765746 569942 882399 372162
>> [37] 328783 703153 409057 1022846 483268
>>> indexProbes(raw, genenames="10461391")
>> $`10461391`
>> [1] 455660
>>> indexProbes(raw, genenames="10487930")
>> $`10487930`
>> [1] 1075143 799506 645038 210875 156515 823869 1012777
>> 999189 585801 19446 272743 965811
>> [13] 773147 774197 801052 126759 872192 145394 469314 912320
>> 799187 839099 6747 135799
>> [25] 599172 600222 970911 434764 830932 413615 948368 868616
>> 765747 569943 882400 372163
>> [37] 328784 703154 409058 1022847 483269
>>
>> Any comments on this or on exactly how the cdf environment is created
>> would be much appreciated.
>>
>> Thanks
>> \Heidi
>>
>>> sessionInfo()
>> R version 2.7.0 Under development (unstable) (2008-02-12 r44439)
>> i386-apple-darwin8.10.1
>>
>> locale:
>> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>>
>> attached base packages:
>> [1] tools stats graphics grDevices utils datasets
>> methods base
>>
>> other attached packages:
>> [1] makecdfenv_1.17.0 affy_1.17.3 preprocessCore_1.1.5
>> affyio_1.7.17
>> [5] Biobase_1.99.4
>>
>>
>> ------------<<>>------------
>> Heidi Dvinge
>>
>> EMBL-European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton, Cambridge
>> CB10 1SD
>> Mail: heidi at ebi.ac.uk
>> Phone: +44 (0) 1223 494 444
>> ------------<<>>------------
>>
>>
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/
>> gmane.science.biology.informatics.conductor
>>
More information about the Bioconductor
mailing list