[Bioc-devel] yeast2 metadata

Nianhua Li nli at fhcrc.org
Thu Jul 20 23:06:41 CEST 2006


Hi, Mattia,

Mattia Pelizzola wrote:
> Dear all, 
>
> I wonder why some very important environments are missing in the yeast2 
> annotation package (in both 1.8 and 1.7 bioC releases). For instance the 
> ACCNUM and the SYMBOL environments are missing, while the GENENAME env is 
> available. This would be a problem in writing scripts that can be applied to 
> a generic Affy annotation package.
>   
Most Annotation packages for human, mouse and rat are generated by
ABPkgBuilder, and therfore have similar set of environments. But
Annotation packages for other organisms like yeast and arabidopsis are
generated by other functions, so the interface can be different. yeast2
is generated by yeastPkgBuilder. Environment yeast2ORF plays a role
similar to ACCNUM in some other packages. We first get probeset ID to
ORF ID mapping from Affymetrix and then merge with the ORF ID to SGD ID
mapping from
ftp://genome-ftp.stanford.edu/pub/yeast/data_download/literature_curation/orf_geneontology.tab 

to get probeset ID to SGD ID mapping. And then we use SGD ID to get
other annotations. For human, mouse or rat annotation packages, we map
probeset ID to Entrez Gene ID and get other annotations from there. We
use different functions/processes to build annotation packages of
different organisms because the community standards are different.
> Moreover, in both 1.8 and 1.7 bioC releases, the yeast2 package has some 
> problems in the quality checks. For example for the old one:
>
> Quality control information for  yeast2
> Date built: Created: Tue Oct  4 19:23:39 2005
>
> Number of probes: 10928
> Probe number missmatch: yeast2ALIAS; yeast2CHRLOC; yeast2CHR; 
> yeast2DESCRIPTION; yeast2ENZYME; yeast2GENENAME; yeast2GO; yeast2ORF; 
> yeast2PATH; yeast2PMID
> Probe missmatch: None
> Mappings found for probe based rda files:
>          yeast2ALIAS found 1759 of 10928
>          yeast2CHRLOC found 4962 of 10928
>          yeast2CHR found 5645 of 10928
>          yeast2DESCRIPTION found 5631 of 10928
>          yeast2ENZYME found 778 of 10928
>          yeast2GENENAME found 4335 of 10928
>          yeast2GO found 5645 of 10928
>          yeast2ORF found 6356 of 10928
>          yeast2PATH found 1110 of 10928
>          yeast2PMID found 5454 of 10928
> Mappings found for non-probe based rda files:
>          yeast2CHRLENGTHS found 17
>          yeast2ENZYME2PROBE found 446
>          yeast2GO2ALLPROBES found 3870
>          yeast2GO2PROBE found 2615
>          yeast2ORGANISM found 1
>          yeast2PATH2PROBE found 99
>          yeast2PMID2PROBE found 34881
>
>
> I previously experienced problems regarding the "Probe number missmatch", but 
> usually the total number of probes was less than the number of probes in the 
> corresponding cdf package, and the annotation was very poor. In this case the 
> number of probesets seems to be ok, while I'm not able to evaluate the amount 
> of annotated probesets.
>   
The probe number missmatch may because of these codes in function
getProbe2SGD.:

    merged <- merge(probes, temp, by = by, all.x = TRUE)
    merged <- merged[!duplicated(merged[, 1]), ]

Here "probes" is the probeset ID to ORF ID mapping obtained from
Affymetrix. All probeset IDs are included (10928 probeset IDs in total).
"temp" is the ORF ID to SGD ID mapping obtained from
ftp://genome-ftp.stanford.edu/pub/yeast/data_download/literature_curation/orf_geneontology.tab. 

They are first merged together, so still have 10928 probeset IDs. And
then when multiple probeset IDs are mapped to the same ORF ID, only the
first probeset ID is kept. Because the 10928 probeset IDs are mapped to
only 6356 unique ORF IDs, only 6356 probeset IDs are kept for further
annotation process. I guess that's why we get probe number mismatch. It
seems this is the author's intention to remove probeset IDs with
duplicated mappings. Maybe the author can explain more.

thanks

nianhua

> I would ask if it would be possible not to release metadata packages showing 
> missmatches.
>
> regards and thanks for any comment
>
> mattia
>
>



More information about the Bioc-devel mailing list