[BioC] makecdfenv and multi-mapping probes on Mouse Gene 1.0 ST array
Kasper Daniel Hansen
khansen at stat.Berkeley.EDU
Mon May 12 18:47:07 CEST 2008
On May 12, 2008, at 10:03 AM, lgautier at altern.org wrote:
>> On Sun, May 11, 2008 at 10:38 AM, Heidi Dvinge <heidi at ebi.ac.uk>
>> wrote:
>>> On 11 May 2008, at 14:28, Sean Davis wrote:
>>>> On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at ebi.ac.uk>
>>>> wrote:
>>>>> Dear all,
>>>>> I'm currently looking at some Mouse Gene 1.0 ST arrays, and have
>>>>> used
> the makecdfenv package to build a cdf environment based on the file
> MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage.
>>>>> That worked without any problems, but out of curiosity I tried
>>>>> taking
> a closer look at the format of the array, to see how many probes
> were in
> each probe set etc.
>>>>> I'm aware that some probes map to multiple probe sets and are
>>>>> removed
> when the cdfenv is produced, which seems to be the case for about 8%
> of
> the probes. My question is exactly how this happens? I would
> expect the multiple-mapping probes to be removed from all probe
> sets, but
> this doesn't seem to be the case.
>>>> I believe that the probes are kept in the first or last probeset
>>>> (not
> sure which) seen. Someone with a little more affy experience can
> comment more fully.
>>> I figured it was probably something along those lines, but what's
>>> the
> reason
>>> for not just removing them completely, instead of keeping them in a
> 'random'
>>> probe set? Most probes that map multiple times map to > 2 probe
>>> sets.
> And in
>>> some cases it's large chunks of probe sets that 'overlap', whereas
>>> in
> other
>>> cases it's just a few or a single probe that 'jumps around'.
>>
>> I think this probe "removal" is a side effect of the way the original
> affy package and affy chips were designed. Before these newer arrays,
> there were no probes that mapped to multiple probe sets, so there was
> never a mechanism for "removing" probes or even maintain multiple
> mappings. So, the current behavior is due to the fact that there is
> not a
> way to maintain the many-to-many mapping, if I understand it
> correctly and
> is not really in any particular way optimal. Again, someone with more
> affy experience might have more to say.
>
> The original use case was to be able to retrieve the probes in a given
> probe set, without further consideration. The need for possible
> alternative mappings was nevertheless considered, and it was made
> possible
> to replace the mapping used to process data at any given time (there
> is a
> vignette talking about that).
>
> Regarding many-to-many association between probes and probesets,
> this is
> indeed an annoying case (as in the original design, it was somehow
> assumed
> that this is a perfect world). It is not at all impossible to have
> "many-to-many" association, but it is certainly making it for a
> difficult
> analysis of the data. To keep things simple, the recommendation
> would be
> "each probe goes into one probe set"... and get rid of the rest.
One problem with having a probe in multiple probesets is that certain
functions assume it does not happen. For example the pm method simply
takes all the pm indices for the various probesets and stacks them. If
you have a probe in multiple probesets, this means it is included
multiple times in the resulting output from pm. And in many cases,
functions using pm assumes that this is not the case.
Kasper
> The package "altcdfenvs" is also proposing extensions to the CDF
> environments, with methods and functions to work with them.
>
>
>> Sean
>>
>>
>>>>> Example with the two overlapping probe sets 10344719 and 10353008,
> where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned
> into only tab-delimited info and read into R, and "INDEX" being a
> unique probe identifier (the same as index-1 in the cdf env):
>>>>>> cdf[cdf$QUAL=="10344719","INDEX"]
>>>>> [1] 7543 661828 575792 962890 963940 140756 337977
>>>>> 510591 860722 968182 387524 386474
>>>>> [13] 385518 384468 1076441 1075391 850724 51881 957657
>>>>> 100610
> 862535 506651 505601 82272
>>>>> [25] 83322 692860 691810 494417 932343 689216 836826
>>>>> 894914
> 715393 421443 92496 485600
>>>>> [37] 253868 352083 594288 1049892 370822 369772 416675
>>>>> 928371
> 505790 506840 135781
>>>>>> cdf[cdf$QUAL=="10353008","INDEX"]
>>>>> [1] 506840 505790 928371 416675 369772 370822 1049892
>>>>> 485600 92496 421443 715393 894914
>>>>> [13] 1073586 110809 836826 689216 932343 494417 691810
>>>>> 83322 82272 505601 506651 862535
>>>>> [25] 100610 957657 51881 850724 1075391 1076441 384468
>>>>> 385518
> 386474 387524 968182 860722
>>>>> [37] 510591 337977 140756 963940 962890 575792 661828
>>>>> 7543
>>>>>> indexProbes(raw, genenames="10344719")
>>>>> $`10344719`
>>>>> [1] 692861 253869 352084 594289 135782
>>>>>> indexProbes(raw, genenames="10353008")
>>>>> $`10353008`
>>>>> [1] 506841 505791 928372 416676 369773 370823 1049893
>>>>> 485601 92497 421444 715394 894915
>>>>> [13] 1073587 110810 836827 689217 932344 494418 691811
>>>>> 83323 82273 505602 506652 862536
>>>>> [25] 100611 957658 51882 850725 1075392 1076442 384469
>>>>> 385519
> 386475 387525 968183 860723
>>>>> [37] 510592 337978 140757 963941 962891 575793 661829
>>>>> 7544
> So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of
> which
> are overlapping. In the cdf environment 10344719 appears to have
> the 42
> overlapping probes removed, but they're still present in 10353008.
>>>>> A similar situation is seen for e.g. the overlapping probe sets
> 10461391 and 10487930 with 41 probes each, 40 of which are identical:
>>>>>> cdf[cdf$QUAL=="10461391","INDEX"]
>>>>> [1] 483268 1022846 409057 703153 328783 372162 882399
>>>>> 569942 765746 868615 948367 413614
>>>>> [13] 830931 434763 970910 600221 599171 135798 6746
>>>>> 455659
> 799186 912319 469313 145393
>>>>> [25] 872191 126758 801051 774196 773146 965810 272742
>>>>> 19445
> 585800 999188 1012776 823868
>>>>> [37] 156514 210874 645037 799505 1075142
>>>>>> cdf[cdf$QUAL=="10487930","INDEX"]
>>>>> [1] 1075142 799505 645037 210874 156514 823868 1012776
>>>>> 999188 585800 19445 272742 965810
>>>>> [13] 773146 774196 801051 126758 872191 145393 469313
>>>>> 912319
> 799186 839098 6746 135798
>>>>> [25] 599171 600221 970910 434763 830931 413614 948367
>>>>> 868615
> 765746 569942 882399 372162
>>>>> [37] 328783 703153 409057 1022846 483268
>>>>>> indexProbes(raw, genenames="10461391")
>>>>> $`10461391`
>>>>> [1] 455660
>>>>>> indexProbes(raw, genenames="10487930")
>>>>> $`10487930`
>>>>> [1] 1075143 799506 645038 210875 156515 823869 1012777
>>>>> 999189 585801 19446 272743 965811
>>>>> [13] 773147 774197 801052 126759 872192 145394 469314
>>>>> 912320
> 799187 839099 6747 135799
>>>>> [25] 599172 600222 970911 434764 830932 413615 948368
>>>>> 868616
> 765747 569943 882400 372163
>>>>> [37] 328784 703154 409058 1022847 483269
>>>>> Any comments on this or on exactly how the cdf environment is
>>>>> created
> would be much appreciated.
>>>>> Thanks
>>>>> \Heidi
>>>>>> sessionInfo()
>>>>> R version 2.7.0 Under development (unstable) (2008-02-12 r44439)
> i386-apple-darwin8.10.1
>>>>> locale:
>>>>> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
> attached base packages:
>>>>> [1] tools stats graphics grDevices utils datasets
> methods base
>>>>> other attached packages:
>>>>> [1] makecdfenv_1.17.0 affy_1.17.3 preprocessCore_1.1.5
> affyio_1.7.17
>>>>> [5] Biobase_1.99.4
>>>>> ------------<<>>------------
>>>>> Heidi Dvinge
>>>>> EMBL-European Bioinformatics Institute
>>>>> Wellcome Trust Genome Campus
>>>>> Hinxton, Cambridge
>>>>> CB10 1SD
>>>>> Mail: heidi at ebi.ac.uk
>>>>> Phone: +44 (0) 1223 494 444
>>>>> ------------<<>>------------
>>>>> [[alternative HTML version deleted]]
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list