[BioC] rBiopaxParser, Reactome and namespaces
Frank Kramer
frank.kramer at med.uni-goettingen.de
Thu May 23 14:30:54 CEST 2013
Dear Paul,
thank you for the report. I absolutely agree with you, the namespaces of
an OWL (and XML/RDF) file are not fixed and can vary between pathway
database providers. Namespace identifiers are, or at least should be,
removed from instances during parsing. As you noticed I strip namespaces
off the input parameters since you should not be able to find anything
if you include them and to add a bit of robustness as well.
It seems this did not work very well in your case ;-)
Unfortunately I could not reproduce your problem:
####### CODE
library(rBiopaxParser)
#reactome urls changed so what used to link to biopax2 is now biopax3.
#this is for shortness of example code, I also tried this with manually
#downloading the owls
file=downloadBiopaxData(database="reactome",model="reactome",version="biopax2")
biopax = readBiopax(file, verbose=T)
head(biopax$df)
####### OUTPUT
Found a BioPAX level 3 OWL. Parsing...
[Info Verbose] Parsing Biopax-Model as a data.frame...
(...)
[Info Verbose] Finished! Created a data.frame with 1000689 rows within
only 3591.365 seconds.
> head(biopax$df)
class id property property_attr
1 BiochemicalReaction BiochemicalReaction1 left rdf:resource
2 BiochemicalReaction BiochemicalReaction1 left rdf:resource
3 BiochemicalReaction BiochemicalReaction1 left rdf:resource
4 BiochemicalReaction BiochemicalReaction1 right rdf:resource
5 BiochemicalReaction BiochemicalReaction1 right rdf:resource
6 BiochemicalReaction BiochemicalReaction1 eCNumber rdf:datatype
property_attr_value property_value
1 #Complex1
2 #Complex2
3 #Protein12
4 #SmallMolecule1
5 #Complex3
6 http://www.w3.org/2001/XMLSchema#string 3.1.3.48
> head(unique(biopax$df$class))
[1] BiochemicalReaction Complex
[3] CellularLocationVocabulary UnificationXref
[5] Protein ProteinReference
33 Levels: BiochemicalReaction BioSource ... UnificationXref
> head(table(biopax$df$property), n=3)
author cellularLocation comment
65131 24758 130936
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.95-4.1 bitops_1.0-5 rBiopaxParser_1.0.0
loaded via a namespace (and not attached):
[1] XML_3.96-1.1
####### END
Can you check biopax$namespaces, to see if any namespaces were detected
during parsing? These are saved in order to reuse them if you want to
write out a new Biopax OWL file later on.
Can you check if you are using the current release/devel version of the
rBiopaxParser?
Best wishes,
Frank
University Medical Center Göttingen
Department for Medical Statistics
Humboldtallee 32
37073 Göttingen
Germany
Phone: +49 (0) 551 39-10710
Fax: +49 (0) 551 39-4995
http://www.ams.med.uni-goettingen.de/amsneu/kramer-en.html
Am 22.05.2013 05:08, schrieb Paul Shannon:
> Hi Frank,
>
> I am most happiliy using the rBiopaxParser package, and your vignette, in order to extract detailed (but topologically simple) interaction data from the latest Reactome "Homosapiens.owl". Your package offers great power and convenience.
>
> However, I run into difficulty with namespaces.
>
> For a simple example, consider this one line from the method listIntances, found in the file R/selectBiopax.R:
>
> sel = sel & (tolower(biopax$df$class) %in% tolower(stripns(class)))
>
> As parsed from Homosapiens.owl, the class column of biopax$df has values like these, always containing a namespace prefix:
>
> head(unique(biopax$df$class))
> "bp:BiochemicalReaction" "bp:Protein"
> "bp:CellularLocationVocabulary" "bp:UnificationXref"
> "bp:ProteinReference" "bp:BioSource"
>
> By stripping the namespace off of "bp:Protein" (the right hand side of the %in% clause) it cannot match the biopax$df$class value, as parsed from the owl file (which preserves the "bp:").
>
> I believe I see similar logic in other places, with these methods specifically encountered so far:
>
> selectInstances
> listPathwayComponents
>
> Namespaces are used with the "property" column as well:
>
> head(table(biopax$df$property), n=3)
> bp:author bp:cellularLocation bp:comment
> 55654 23838 123750
>
> Speaking from the nickel seats, and not claiming to understand all of the implications: perhaps these could be neatly avoided if your readBiopax method could optionally eliminate namespaces when reading in an owl file?
>
> Thanks,
>
> - Paul
>
>
More information about the Bioconductor
mailing list