[BioC] how to find probes' names in probeset

Fri Aug 27 21:41:53 CEST 2010

Hi Galina,

On 8/27/2010 11:45 AM, Glazko, Galina wrote:
> Dear list,
>
> I would appreciate if someone can clarify for me this - seemingly - simple issue:
>
> I have probes for the probe set:
> probes1<- subset(drosophila2probe, Probe.Set.Name == "1631333_s_at")
>> as.data.frame(probes1)
>                          sequence   x   y Probe.Set.Name Probe.Interrogation.Position Target.Strandedness
> 119715 CTCACATTCTTCTCCTAATACGATA   2 273   1631333_s_at                         1011           Antisense
> 119716 CGGCCATTCTGGACTTCTGGGACAA   4 289   1631333_s_at                          490           Antisense
> 119717 GGTCCCGGTGGTATCATCTGCAACA 564 535   1631333_s_at                          525           Antisense
> 119718 ATCTGCAACATTGGATCCGTCACTG 656  39   1631333_s_at                          540           Antisense
> 119719 GGATTCAATGCCATCTACCAGGTGC 467 543   1631333_s_at                          564           Antisense
> 119720 CGGCGTGACGGCTTACACTGTGAAC  40 289   1631333_s_at                          659           Antisense
> 119721 TGGTGCACACGTTCAACTCCTGGTT 682 591   1631333_s_at                          706           Antisense
> 119722 ACTCCTGGTTGGATGTTGAGCCTCA 573 145   1631333_s_at                          721           Antisense
> 119723 TTGAGCCTCAGGTTGCCGAGAAGCT  93 725   1631333_s_at                          736           Antisense
> 119724 GAACTTCGTCAAGGCTATCGAGCTG 670 383   1631333_s_at                          800           Antisense
> 119725 GGAAACTGGACTTGGGCACCCTGGA 399 559   1631333_s_at                          844           Antisense
> 119726 TGGAGGCCATCCAGTGGACCAAGCA 249 589   1631333_s_at                          865           Antisense
> 119727 CTGGGACTCCGGCATCTAAGAAGTG 311 285   1631333_s_at                          890           Antisense
> 119728 AAGGCTGATTCGATGCACACTCACA 612 225   1631333_s_at                          992           Antisense
>
> on the other hand, if
>> dat<-ReadAffy()
>> y=log2(pm(dat),geneNames(dat)))
>> ind<-grep("^1631333_s_at",rownames(y))
>> sub<-y[ind,]
>> sub
>                 E1E1_DrosophilaGenome2.0.CEL
> 1631333_s_at1

     14.11578
> 1631333_s_at2                          14.22671
> 1631333_s_at3                          14.16891
> 1631333_s_at4                          13.29505
> 1631333_s_at5                          14.28973
> 1631333_s_at6                          13.73725
> 1631333_s_at7                          14.33371
> 1631333_s_at8                          14.15979
> 1631333_s_at9                          14.30442
> 1631333_s_at10                         14.70169
> 1631333_s_at11                         14.25695
> 1631333_s_at12                         14.39359
> 1631333_s_at13                         14.51533
> 1631333_s_at14                         13.42114
>
> Where do the probe numbers 1-14 come from?

These come from the fact that you cannot have duplicate row names for a 
data.frame, so R mangles the names by adding sequential numbers on the end.

I would like to be able to relate them to the information in 'probes1'.
> For example, that 1631333_s_at1 is actually N1 in probes1, or something like this.
> I thought may be the numbering of probes (119715:119728) in probes1 has something to do with the numbering in y, but this is not the case:
>> ind
>   [1] 118091 118092 118093 118094 118095 118096 118097 118098 118099 118100 118101 118102 118103 118104

You can line things up using the (x, y) coordinates from the probe 
package, along with the (x, y) coordinates from the cdf package.

As an example, let's use the hgu95av2 chip, since I already have the 
required packages installed.

On this chip, 10193 of the probesets are ordered the same for both the 
cdf and probe package. But there are > 12k probesets, so that isn't 
close enough.

This one matches:
 > indices2xy(get("100_g_at", hgu95av2cdf)[,1], cdf="hgu95av2cdf")
         x   y
  [1,] 497 273
  [2,] 208 557
  [3,] 495 355
  [4,] 478 371
  [5,] 612 429
  [6,] 563 317
  [7,] 223 559
  [8,] 523 575
  [9,] 551 445
[10,] 509 475
[11,] 576 249
[12,] 568 349
[13,] 523 441
[14,] 562 421
[15,] 622 473
[16,] 567 607

 > a <- as.data.frame(hgu95av2probe)
 > a[a$Probe.Set.Name == "100_g_at",c("x","y")]
       x   y
449 497 273
450 208 557
451 495 355
452 478 371
453 612 429
454 563 317
455 223 559
456 523 575
457 551 445
458 509 475
459 576 249
460 568 349
461 523 441
462 562 421
463 622 473
464 567 607

This one does not:

 > indices2xy(get("1002_f_at", hgu95av2cdf)[,1], cdf="hgu95av2cdf")
         x   y
  [1,] 309 555
  [2,] 195 583
  [3,] 375 585
  [4,] 341 403
  [5,] 629 153
  [6,] 619 379
  [7,] 480 471
  [8,] 439 475
  [9,] 410 391
[10,] 619 491
[11,] 537 237
[12,] 510 255
[13,] 500 275
[14,] 381 521
[15,] 366 541
[16,] 449 357

 > a[a$Probe.Set.Name == "1002_f_at",c("x","y")]
        x   y
2870 449 357
2871 309 555
2872 195 583
2873 375 585
2874 341 403
2875 629 153
2876 619 379
2877 480 471
2878 439 475
2879 410 391
2880 619 491
2881 537 237
2882 510 255
2883 500 275
2884 381 521
2885 366 541

So lining up the data is just a matter of extracting the data, and 
re-ordering based on the (x, y) coordinate information.

Best,

Jim

>
> Thank you!
> best regards
> Galina
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues