[BioC] building enzyme centric metabolic network
Saroj Mohapatra
smohapat at vbi.vt.edu
Mon Apr 20 19:14:16 CEST 2009
Hi Martin:
Thanks a lot. This _is_ a much better way of creating the incidence matrix.
Saroj
Martin Morgan wrote:
> Hi Saroj --
>
> All those 'for' loops and the idea that you'd have to 'monitor
> progress' make me think there's a better way to do this. Without
> commenting on the merits of the approach, if the goal is to get an
> 'incidence matrix' listing which genes (rows) appear in which pathways
> (columns) I might
>
> load the library
>
> library(org.Mm.eg.db)
>
> get a data frame of gene and path ids (probably there's a more
> AnnotationDbi way of doing this...)
>
> tbl <- toTable(org.Mm.egPATH)
>
> get the unique gene names
>
> ugenes <- unique(tbl$gene_id)
>
> For each path_id, create a logical vector with 'TRUE' when a gene_id
> appears in the path. This is a little tricky here; get("%in%") gets
> the "%in%" function, usually one could just use a function name;
> x=ugenes provides the 'x' (first) argument to %in%, forcing the genes
> in each pathway into the 'table' (second) argument to %in%, ordering
> ugenes %in% genes means that every pathway ends up with an identical
> length and order logical vector
>
> occur <- with(tbl, tapply(gene_id, path_id, get("%in%"), x=ugenes))
>
> convert the result into a matrix by unlisting the tapply result
>
> m <- matrix(unlist(occur), ncol=dim(occur),
> dimnames=list(gene=ugenes, path=names(occur)))
>
> and voila
>
>
>> m[1:5,1:10]
>>
> path
> gene 00010 00020 00030 00031 00040 00051 00052 00053 00061 00062
> 11298 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> 11303 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> 11304 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> 11305 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> 11306 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> some sanity checks
>
>
>> dim(tbl) ## 11039 gene / path combinations
>>
> [1] 11039 2
>
>> sum(m)
>>
> [1] 11039
>
>
>> colSums(m)[1:5] ## number of genes in the first 5 pathways
>>
> 00010 00020 00030 00031 00040
> 54 27 25 2 22
>
>> nrow(toTable(revmap(org.Mm.egPATH)["00010"]))
>>
> [1] 54
>
>> nrow(toTable(revmap(org.Mm.egPATH)["00020"]))
>>
> [1] 27
>
>
>> max(rowSums(m)) # maximum paths a gene belongs to
>>
> [1] 31
>
>> which.max(rowSums(m)) # gene 26413, row 2354
>>
> 26413
> 2354
>
>> nrow(toTable(org.Mm.egPATH["26413"]))
>>
> [1] 31
>
> It would be possible to change the structure of m into a more
> space-efficient one, but as rowSums and colSums indicate above it's
> very useuful in its current form.
>
> Martin
>
>
> Saroj K Mohapatra <smohapat at vbi.vt.edu> writes:
>
>
>> Hi Anupam:
>>
>> Using your second criterion, i.e., two enzymes are linked if they are in the same pathway, one strategy would be to use the information from org. packages for creating a network for each species.
>>
>> # For mouse, use the package org.Mm.eg.db to get the list of all entrez ids, associated KEGG id
>> library("org.Mm.eg.db")
>> egKEGGids = as.list(org.Mm.egPATH)
>>
>> # many entrez ids have no KEGG id; discard these
>> sum(is.na(egKEGGids))
>> egKEGGids = egKEGGids[!is.na(egKEGGids)]
>>
>> # For all possible entrez id pairs, find a KEGG id; if not available, drop that pair
>> keggnet = matrix(ncol=3, nrow=0) # Empty matrix with 3 columns
>> colnames(keggnet) = c("EG1", "EG2", "KEGG")
>> mycounter = 1 # to count the current entrez id pair
>> numEg = length(egKEGGids) # number of entrez ids
>> for(i in 1:(numEg-1)) {
>> for(j in (i+1):numEg) {
>> eg1 = names(egKEGGids)[i]
>> eg2 = names(egKEGGids)[j]
>> kegg1 = as.character(unlist(egKEGGids[i]))
>> # kegg id for the first gene
>> kegg2 = as.character(unlist(egKEGGids[j]))
>> # kegg id for the 2nd gene
>> common = intersect(kegg1, kegg2) # common between two
>> numCommon = length(common) # how many shared
>> if(numCommon>0) {
>> for(k in 1:numCommon) {
>> keggnet = rbind(keggnet, c(eg1, eg2, common[k]))
>> cat(i, "\t", j, "\t", numEg,"\t | ")
>> cat(eg1,"\t", eg2, "\t", common[k], "\n")
>> # monitor progress
>> mycounter = mycounter+1 # move the counter
>> }
>> }
>> }
>> }
>>
>> # now write the data to a file for future use
>> write.table(keggnet, file="mouse_KEGG_network.txt",
>> col.names=T, row.names=F, sep="\t", quote=F)
>>
>> The resulting file has three columns for two entrez ids (EG1, EG2) and KEGG (ID) shared by the two entrez ids. There might be a faster way of using it though (in stead of the for loops). Also, there might already be such networks available within bioconductor.
>>
>> I imagine that you would need one such file for each species of your interest (to study "evolutionary aspects").
>>
>> Best,
>>
>> Saroj
>>
>> ----- Original Message -----
>> From: "anupam sinha" <anupam.contact at gmail.com>
>> To: Bioconductor at stat.math.ethz.ch
>> Sent: Sunday, April 19, 2009 4:35:12 AM GMT -05:00 US/Canada Eastern
>> Subject: [BioC] building enzyme centric metabolic network
>>
>> Hi all,
>> I am working on the evolutionary aspects of metabolic net
>> works(enzyme centric). This network will consists of nodes that are enzymes
>> and any two enzymes are linked if they share a metabolite (or in the same
>> pathway). Can anyone suggest me a way to build these networks using
>> KEGGgraph or KEGGSOAP packages ? Thanks in advance.
>>
>> Regards,
>>
>> Anupam Sinha
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
More information about the Bioconductor
mailing list