[BioC] how to find probes' names in probeset
James W. MacDonald
jmacdon at med.umich.edu
Fri Aug 27 21:41:53 CEST 2010
Hi Galina,
On 8/27/2010 11:45 AM, Glazko, Galina wrote:
> Dear list,
>
> I would appreciate if someone can clarify for me this - seemingly - simple issue:
>
> I have probes for the probe set:
> probes1<- subset(drosophila2probe, Probe.Set.Name == "1631333_s_at")
>> as.data.frame(probes1)
> sequence x y Probe.Set.Name Probe.Interrogation.Position Target.Strandedness
> 119715 CTCACATTCTTCTCCTAATACGATA 2 273 1631333_s_at 1011 Antisense
> 119716 CGGCCATTCTGGACTTCTGGGACAA 4 289 1631333_s_at 490 Antisense
> 119717 GGTCCCGGTGGTATCATCTGCAACA 564 535 1631333_s_at 525 Antisense
> 119718 ATCTGCAACATTGGATCCGTCACTG 656 39 1631333_s_at 540 Antisense
> 119719 GGATTCAATGCCATCTACCAGGTGC 467 543 1631333_s_at 564 Antisense
> 119720 CGGCGTGACGGCTTACACTGTGAAC 40 289 1631333_s_at 659 Antisense
> 119721 TGGTGCACACGTTCAACTCCTGGTT 682 591 1631333_s_at 706 Antisense
> 119722 ACTCCTGGTTGGATGTTGAGCCTCA 573 145 1631333_s_at 721 Antisense
> 119723 TTGAGCCTCAGGTTGCCGAGAAGCT 93 725 1631333_s_at 736 Antisense
> 119724 GAACTTCGTCAAGGCTATCGAGCTG 670 383 1631333_s_at 800 Antisense
> 119725 GGAAACTGGACTTGGGCACCCTGGA 399 559 1631333_s_at 844 Antisense
> 119726 TGGAGGCCATCCAGTGGACCAAGCA 249 589 1631333_s_at 865 Antisense
> 119727 CTGGGACTCCGGCATCTAAGAAGTG 311 285 1631333_s_at 890 Antisense
> 119728 AAGGCTGATTCGATGCACACTCACA 612 225 1631333_s_at 992 Antisense
>
> on the other hand, if
>> dat<-ReadAffy()
>> y=log2(pm(dat),geneNames(dat)))
>> ind<-grep("^1631333_s_at",rownames(y))
>> sub<-y[ind,]
>> sub
> E1E1_DrosophilaGenome2.0.CEL
> 1631333_s_at1
14.11578
> 1631333_s_at2 14.22671
> 1631333_s_at3 14.16891
> 1631333_s_at4 13.29505
> 1631333_s_at5 14.28973
> 1631333_s_at6 13.73725
> 1631333_s_at7 14.33371
> 1631333_s_at8 14.15979
> 1631333_s_at9 14.30442
> 1631333_s_at10 14.70169
> 1631333_s_at11 14.25695
> 1631333_s_at12 14.39359
> 1631333_s_at13 14.51533
> 1631333_s_at14 13.42114
>
> Where do the probe numbers 1-14 come from?
These come from the fact that you cannot have duplicate row names for a
data.frame, so R mangles the names by adding sequential numbers on the end.
I would like to be able to relate them to the information in 'probes1'.
> For example, that 1631333_s_at1 is actually N1 in probes1, or something like this.
> I thought may be the numbering of probes (119715:119728) in probes1 has something to do with the numbering in y, but this is not the case:
>> ind
> [1] 118091 118092 118093 118094 118095 118096 118097 118098 118099 118100 118101 118102 118103 118104
You can line things up using the (x, y) coordinates from the probe
package, along with the (x, y) coordinates from the cdf package.
As an example, let's use the hgu95av2 chip, since I already have the
required packages installed.
On this chip, 10193 of the probesets are ordered the same for both the
cdf and probe package. But there are > 12k probesets, so that isn't
close enough.
This one matches:
> indices2xy(get("100_g_at", hgu95av2cdf)[,1], cdf="hgu95av2cdf")
x y
[1,] 497 273
[2,] 208 557
[3,] 495 355
[4,] 478 371
[5,] 612 429
[6,] 563 317
[7,] 223 559
[8,] 523 575
[9,] 551 445
[10,] 509 475
[11,] 576 249
[12,] 568 349
[13,] 523 441
[14,] 562 421
[15,] 622 473
[16,] 567 607
> a <- as.data.frame(hgu95av2probe)
> a[a$Probe.Set.Name == "100_g_at",c("x","y")]
x y
449 497 273
450 208 557
451 495 355
452 478 371
453 612 429
454 563 317
455 223 559
456 523 575
457 551 445
458 509 475
459 576 249
460 568 349
461 523 441
462 562 421
463 622 473
464 567 607
This one does not:
> indices2xy(get("1002_f_at", hgu95av2cdf)[,1], cdf="hgu95av2cdf")
x y
[1,] 309 555
[2,] 195 583
[3,] 375 585
[4,] 341 403
[5,] 629 153
[6,] 619 379
[7,] 480 471
[8,] 439 475
[9,] 410 391
[10,] 619 491
[11,] 537 237
[12,] 510 255
[13,] 500 275
[14,] 381 521
[15,] 366 541
[16,] 449 357
> a[a$Probe.Set.Name == "1002_f_at",c("x","y")]
x y
2870 449 357
2871 309 555
2872 195 583
2873 375 585
2874 341 403
2875 629 153
2876 619 379
2877 480 471
2878 439 475
2879 410 391
2880 619 491
2881 537 237
2882 510 255
2883 500 275
2884 381 521
2885 366 541
So lining up the data is just a matter of extracting the data, and
re-ordering based on the (x, y) coordinate information.
Best,
Jim
>
> Thank you!
> best regards
> Galina
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list