[BioC] Ontology lack in goProfiles : no BP or CC just MF

Arnaud Mounier arnaud.mounier at dijon.inra.fr
Fri Apr 18 12:35:04 CEST 2014


Hi,
I've build a specific DataFrame with python pandas to compute ontology 
frequencies with goProfiles in bioconductor. I use the basicProfile 
function with option 'GOTermsFrame' but without the optional column 
'Evidence'. I've got one big dataframe as follow :

     In [1]: df.info()
     <class 'pandas.core.frame.DataFrame'>
     Int64Index: 119626 entries, 0 to 119625
     Data columns (total 3 columns):
     GeneID      119626 non-null object
     GOID        119626 non-null object
     Ontology    119626 non-null object
     dtypes: object(3)
So, almost 120000 entries with divided with Ontology as follow :
     In [2]: df.groupby(['Ontology'])['Ontology'].count()
     Ontology
     BP          58802
     CC          26867
     MF          33957

When I compute goProfile with any three Ontology at level 2, I've got
this frequencies :

     In [3]: rdf = com.convert_to_r_dataframe(df)
     In [4]: %%R -i rdf
     > library(goProfiles)
     > rdf <- as.data.frame(rdf)
     > print(head(rdf))
                     GeneID       GOID Ontology
     0 VIT_201s0011g00010.1 GO:0043565       MF
     1 VIT_201s0011g00010.1 GO:0003964       MF
     2 VIT_201s0011g00010.1 GO:0006278       BP
     3 VIT_201s0011g00010.1 GO:0006367       BP
     4 VIT_201s0011g00010.1 GO:0003743       MF
     5 VIT_201s0011g00010.1 GO:0005840       CC

     > profiles.ANY <- 
basicProfile(rdf,idType='GOTermsFrame',onto="ANY",level=2)
     > printProfiles(profiles.ANY,percentage=T,aTitle="Test GO Profile")

     Test GO Profile
     ========================
     [1] "MF ontology"
                         Description       GOID Frequency
     12         antioxidant activity GO:0016209       1.0
     9                       binding GO:0005488      75.0
     4            catalytic activity GO:0003824      65.1
     1  electron carrier activity... GO:0009055       3.5
     15 enzyme regulator activity... GO:0030234       1.6
     21 molecular transducer acti... GO:0060089       3.1
     3  nucleic acid binding tran... GO:0001071       2.8
     6  nutrient reservoir activi... GO:0045735       0.5
     2  protein binding transcrip... GO:0000988       0.1
     5             receptor activity GO:0004872       1.2
     7  structural molecule activ... GO:0005198       2.8
     8          transporter activity GO:0005215       8.2
     [1] "BP ontology"
     [1] Description GOID        Frequency
     <0 lignes> (ou 'row.names' de longueur nulle)
     [1] "CC ontology"
     [1] Description GOID        Frequency
     <0 lignes> (ou 'row.names' de longueur nulle)

So, neither BP or CC Ontology is show up.

But when I take a slice of 500 rows of this big dataframe and compute
the same ways (any ontology, level=2), I've got this :

     In [5]: dft = df[0:500]
     In [6]: rdft = com.convert_to_r_dataframe(dft)
     In [7]: %%R -i rdft
     > profs.ANY <- 
basicProfile(rdf,idType='GOTermsFrame',onto="ANY",level=2)
     > printProfiles(profiles.ANY,percentage=T,aTitle="Test GO Profile")
     Test Profile
     ============
     [1] "MF ontology"
                        Description       GOID Frequency
     9                      binding GO:0005488      77.8
     4           catalytic activity GO:0003824      49.2
     1 electron carrier activity... GO:0009055       3.2
     3 nucleic acid binding tran... GO:0001071       1.6
     7 structural molecule activ... GO:0005198       1.6
     8         transporter activity GO:0005215      12.7
     [1] "BP ontology"
     [1] Description GOID        Frequency
     <0 lignes> (ou 'row.names' de longueur nulle)
     [1] "CC ontology"
                       Description       GOID Frequency
     3                        cell GO:0005623      93.4
     6               cell junction GO:0030054       3.3
     17                  cell part GO:0044464      93.4
     2        extracellular region GO:0005576       8.2
     9   macromolecular complex... GO:0032991      21.3
     1                    membrane GO:0016020      34.4
     8  membrane-enclosed lumen... GO:0031974       3.3
     15              membrane part GO:0044425      19.7
     4                    nucleoid GO:0009295       1.6
     10                  organelle GO:0043226      75.4
     13             organelle part GO:0044422      21.3
     19                   symplast GO:0055044       3.3

I'm not really understand why :
- there is no BP frequencies in both df whereas thereis 58802 genes with
BP ontology in the main frame
- there is CC frequencies in short frame and not at all in the main
frame whereas the short in first part of the big one.

Can the level (2 in this case) can explain this big difference ?

Thank's a lot,
Arnome.
-- 
« Quand les hommes considèrent certaines situations comme réelles, elles 
sont réelles dans leur conséquence. »
Le théorème de Thomas.

Arnaud Mounier
INRA - UMR Agroécologie 1347
CNRS - ERL IPM 6300 (Plant-Microorganism Interaction)
17, rue Sully - BP 86510 - F-21065 Dijon Cedex - France
Work phone : +33 380 693 167 - Fax : +33 380 693 753

https://www6.dijon.inra.fr/umragroecologie/Personnel/IPM/ITA/MOUNIER-Arnaud



More information about the Bioconductor mailing list