[BioC] building enzyme centric metabolic network

Mon Apr 20 18:46:49 CEST 2009

Hi Saroj --

All those 'for' loops and the idea that you'd have to 'monitor
progress' make me think there's a better way to do this. Without
commenting on the merits of the approach, if the goal is to get an
'incidence matrix' listing which genes (rows) appear in which pathways
(columns) I might

load the library

  library(org.Mm.eg.db)

get a data frame of gene and path ids (probably there's a more
AnnotationDbi way of doing this...)

  tbl <- toTable(org.Mm.egPATH)

get the unique gene names

  ugenes <- unique(tbl$gene_id)

For each path_id, create a logical vector with 'TRUE' when a gene_id
appears in the path. This is a little tricky here; get("%in%") gets
the "%in%" function, usually one could just use a function name;
x=ugenes provides the 'x' (first) argument to %in%, forcing the genes
in each pathway into the 'table' (second) argument to %in%, ordering
ugenes %in% genes means that every pathway ends up with an identical
length and order logical vector

  occur <- with(tbl, tapply(gene_id, path_id, get("%in%"), x=ugenes))

convert the result into a matrix by unlisting the tapply result

  m <- matrix(unlist(occur), ncol=dim(occur),
              dimnames=list(gene=ugenes, path=names(occur)))

and voila             

> m[1:5,1:10]
       path
gene    00010 00020 00030 00031 00040 00051 00052 00053 00061 00062
  11298 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  11303 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  11304 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  11305 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  11306 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

some sanity checks

> dim(tbl) ## 11039 gene / path combinations
[1] 11039     2
> sum(m)
[1] 11039

> colSums(m)[1:5] ## number of genes in the first 5 pathways
00010 00020 00030 00031 00040 
   54    27    25     2    22 
> nrow(toTable(revmap(org.Mm.egPATH)["00010"]))
[1] 54
> nrow(toTable(revmap(org.Mm.egPATH)["00020"]))
[1] 27

> max(rowSums(m)) # maximum paths a gene belongs to
[1] 31
> which.max(rowSums(m))  # gene 26413, row 2354
26413 
 2354 
> nrow(toTable(org.Mm.egPATH["26413"]))
[1] 31

It would be possible to change the structure of m into a more
space-efficient one, but as rowSums and colSums indicate above it's
very useuful in its current form.

Martin

Saroj K Mohapatra <smohapat at vbi.vt.edu> writes:

> Hi Anupam:
>
> Using your second criterion, i.e., two enzymes are linked if they are in the same pathway, one strategy would be to use the information from org. packages for creating a network for each species.
>
> # For mouse, use the package org.Mm.eg.db to get the list of all entrez ids, associated KEGG id
> library("org.Mm.eg.db")
> egKEGGids = as.list(org.Mm.egPATH)
>
> # many entrez ids have no KEGG id; discard these
> sum(is.na(egKEGGids))
> egKEGGids = egKEGGids[!is.na(egKEGGids)]
>
> # For all possible entrez id pairs, find a KEGG id; if not available, drop that pair
> keggnet = matrix(ncol=3, nrow=0) # Empty matrix with 3 columns
> colnames(keggnet) = c("EG1", "EG2", "KEGG")
> mycounter = 1 # to count the current entrez id pair
> numEg = length(egKEGGids) # number of entrez ids
> for(i in 1:(numEg-1)) {
>   for(j in (i+1):numEg) {
>     eg1 = names(egKEGGids)[i]
>     eg2 = names(egKEGGids)[j]
>     kegg1 = as.character(unlist(egKEGGids[i])) 
> 		# kegg id for the first gene
>     kegg2 = as.character(unlist(egKEGGids[j])) 
> 		# kegg id for the 2nd gene
>     common = intersect(kegg1, kegg2) # common between two
>     numCommon = length(common) # how many shared
>     if(numCommon>0) {
>       for(k in 1:numCommon) {
> 	keggnet = rbind(keggnet, c(eg1, eg2, common[k]))
> 	cat(i,  "\t", j,   "\t", numEg,"\t | ") 
> 	cat(eg1,"\t", eg2, "\t", common[k], "\n")
> 		# monitor progress
> 	mycounter = mycounter+1 # move the counter
>       }
>     }
>   }
> }
>
> # now write the data to a file for future use
> write.table(keggnet, file="mouse_KEGG_network.txt", 
> 	col.names=T, row.names=F, sep="\t", quote=F)
>
> The resulting file has three columns for two entrez ids (EG1, EG2) and KEGG (ID) shared by the two entrez ids. There might be a faster way of using it though (in stead of the for loops). Also, there might already be such networks available within bioconductor. 
>
> I imagine that you would need one such file for each species of your interest (to study "evolutionary aspects").
>
> Best,
>
> Saroj
>
> ----- Original Message -----
> From: "anupam sinha" <anupam.contact at gmail.com>
> To: Bioconductor at stat.math.ethz.ch
> Sent: Sunday, April 19, 2009 4:35:12 AM GMT -05:00 US/Canada Eastern
> Subject: [BioC] building enzyme centric metabolic network
>
> Hi all,
>          I am working on the evolutionary aspects of metabolic net
> works(enzyme centric). This network will consists of nodes that are enzymes
> and any two enzymes are linked if they share a metabolite (or in the same
> pathway). Can anyone suggest me a way to build these networks using
> KEGGgraph or KEGGSOAP packages ? Thanks in advance.
>
> Regards,
>
> Anupam Sinha
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793