[BioC] gene2pathway retrain: which model is more complete?

Bogdan b.t.tokovenko at imbg.org.ua
Mon Aug 23 13:52:53 CEST 2010


After converting my custom gene2Domains mapping into a list of vectors

> head(entrez2interpro_nested)
$`679594`
 [1] "IPR019956" "IPR019954" "IPR019955" "IPR000626"

$`682397`
[1] "IPR019956" "IPR019954"

and feeding that into retrain(), I now have the 4th model (most
complete?), built using
genes: 5667 of 5667
features: 4007
level detectors: 78

This obsoletes my Questions 3 and 4 from my previous email.
However, Questions 1 and 2 are still not fully clear to me.

I would now paraphrase Q2 into:
Of all the retrain()-generated models I now have, which one is
theoretically better to use?
The one with the most genes, most level detectors, or most features (domains)?
Or the one with the lowest average prediction error, disregarding all
other factors?

On 22 August 2010 17:00, Bogdan <b.t.tokovenko at imbg.org.ua> wrote:
> Dear all,
>
> I have 2 PCs: server running Debian Lenny and R 2.7.1, and home
> running Debian Testing and R 2.11.1. Both have gene2pathway 1.6.1 (and
> dependencies) installed.
>
> When running `model.rno = retrain(organism = "rno")`, I got slightly
> different outputs describing the components to build the model:
>
> (server)
> genes: 4055 of 5667
> features: 3553
> level detectors: 74
>
> (home)
> genes: 3987 of 5577
> features: 3488
> level detectors: 75
>
> Question 1: retrain() manual states that all the data for model
> training is fetched from KEGG and Ensembl. How then could these
> differences (above) be possible? I've run each retrain twice, to be sure that was not
> a momentarily glitch.
>
>
> Seeing this, I've decided to manually supply gene2Domains mapping.
> Using BioMart, I asked for all entrez-interpro pairs:
>> head(entrez2interpro_list)
> $`679594`
> [1] "IPR019956"
>
> $`679594`
> [1] "IPR019954"
>
> $`679594`
> [1] "IPR019955"
>
> $`679594`
> [1] "IPR000626"
>
> $`682397`
> [1] "IPR019956"
>
> $`682397`
> [1] "IPR019954"
>
>> length(unique(names(entrez2interpro_list)))
> [1] 17666
>
>> model.rno = retrain(organism = "rno", gene2Domains = entrez2interpro_list)
>
> Feeding entrez2interpro_list to retrain(), I got these numbers:
>
> (manual gene2Domains)
> genes: 5677 of 5677
> features: 1852
> level detectors: 78
>
> Question 2 (main question): Of these 3 models I now have, which one is
> theoretically better to use? The one with most genes, most level
> detectors, or most features?
>
> Question 3: Is the format of my entrez2interpro_list correct? There
> were no errors, but that list has duplicate rownames. I wonder if each
> EntrezID should be in the list only once, with all relevant IPRs
> packed into a nested list.
> (possibly related) Question 4: How could it happen that there are only
> 1852 features for the most complete coverage of gene mappings in
> "manual gene2Domains" case?

-- 
Regards,
Bogdan Tokovenko
--
Laboratory of Systems Biology,
Department of Genetic Information Translation Mechanisms,
Institute of Molecular Biology and Genetics, Kyiv, Ukraine
http://SysBio.org.ua/
http://BioMed.org.ua/COTRASIF/



More information about the Bioconductor mailing list