Mebi 550 Project 2

Knowledge Representation and Biomedical Applications
MEBI 550, Fall 2013

Project 2: Connecting pathway information with SPARQL
Due: Mon, Oct 21

In this project, you will carry out an example of a common information need for molecular biology researchers: I'm interested in protein XYZ and its role in pathway Q; what other pathways does this protein participate in? (and what publications might say more about these pathways?)

Learning objectives:

Become facile with opening and browsing OWL ontologies in Protege4.3
Learn some biology (or extend what you know) about pathways
Execute real SPARQL queries.
Connect pathways!

Step-by-step, I'd like you to at least try the following:

Retrieve several (at least three) pathways from the Reactome Pathway Browser. Pick pathways that have some intersection. Use "BioPax" as the format, which will prompt you to save them as OWL files. (BioPAX level 3 is preferred, but level 2 is also okay. Do not try to mix levels!)
Run a SPARQL query on each to extract the proteins involved. (Where "protein" is defined exactly by membership in the BioPAX class "protein".) The output of a SPARQL query is another RDF file. (Note that you can therefore check your work by simply looking at the pathway.owl file inside of protege.)
Then, run a second-level SPARQL query to find the intersections -- i.e. where a particular protein appears in more than one Reactome pathway. There may not be many of these, depending on the selected pathways.

It is completely possible to combine steps 2 & 3 -- i.e. to create a single large query that find the proteins that occur in multiple pathways. However, the queries will be simpler if you do the above. Furthermore, there will be some value to just doing query #2, and value to storing these results for a number of "secondary-level" queries, per step 3. You should feel free to make up your own second-level queries.

For step one, you may want to try a variety of different pathways before settling on the 3 or so that you carry out the queries on. Think ahead -- you'll have to come up with some reasonable or interesting queries for your selected pathways. You may want to do some background knowledge about the pathways to find out more about them. Imagine more complex complex queries that might follow the intersection query: The researcher might wish to know what reactions that the protein participates in, and whether it is an input or an output of that reaction. Note that BioPAX reactions have properties "LEFT" and "RIGHT" to indicate inputs and outputs. (But these don't include catalysts or enzymes, which are part of "Control".....)

The choice of pathways you select from Reactome is up to you. My preference is that each student use a different set of pathways, but given the number available at Reactome, this should happen naturally. Finally, a word of caution about Protein names. You may want to consider canonical naming sources.

Information about SPARQL

Tthe SPARQL.org web site, is the definitive web site for SPARQL quesions. They have a SPARQL processor, but this engine seems to have issues communictating with http://faculty.washington.edu/, so my recommendation is NOT to use this engine, but instead a local one known as the VSparQL query engine, maintained by Todd Detwiler of the Structural Biology group. (VSparql is a superset of SPARQL, so regular SPARQL queries will work fine here). Other engines include the Virtuoso engine, as provided by the Bio2RDF project. All of these resources allow you to enter and run arbitrary SPARQL queries.

Of course, the ultimate resource for SPARQL is also available: the W3C specification document. Although long and technical, this document does have a nice index and some simple examples included.

SPARQL example:

To help you, here is a working SPARQL query that retrieves all of the complexes (a group of proteins, stuck together) from the muscle contraction pathway from Reactome. As you can see from the "FROM" clause, I have already downloaded the Biopax OWL file, and dumped it onto my UW web page. You should be able to try this out with the VSPARQL processor: (This query uses biopax level 2.)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX bp: <http://www.biopax.org/release/biopax-level2.owl#>

CONSTRUCT { ?complex rdf:type "mus-contract complex" }
FROM <http://faculty.washington.edu/gennari/Reactome_397014.owl>
WHERE { ?complex rdf:type bp:complex }

This will create a very simple RDF file where the answers are all of rdf:type "mus-contract complex". For your queries, you will probably want a different CONSTRUCT clause, one that better allows for easy combination across pathways.

In order to use web-based query processors, you'll have to post data sources to the web. In the above, I've posted a copy of the muscle contraction pathway to my own web pages. You will have to do the same for your selected pathways. As UW graduate students (one might argue, as 21st century citizens), you should all have web page creation ability and space reserved under http://students.washington.edu/, so you should be able to dump OWL and RDF files to such a location. In the past, students have also used 3rd party capabilities like dropbox.com to create web-accesible resources, and this works fine. (In no way am I endorsing or recommending this particular company!) You will also need to take the results from your step 2 queries and re-post these to the web (as RDF files) before you can run step 3 "meta level" queries.

Resources:

Protege 4.3. Please download Protege 4.3 onto your own machine (Mac or PC). Once you get the pathway files from Reactome, you'll want to "open OWL ontology" from Protege 4.3. You may also play with WebProtege, but I think you'll eventually want to have a local installation.
Reactome (of course). Note that the BioPAX download button is under the "downloads" tab on the bottom panel when looking at a pathway page. There is also a "Protege" download, but this is an OWL download, and I suspect it is NOT BioPax.
BioPAX. There is lots of material here: specifications and documentation. For example, you can download the base ontology for level 3 or level 2 and look at it without any reactome data. (You can also use WebProtege to look at them.)

Deliverables (due 1pm, Mon, Oct 21st):

There are three deliverables. These may be written as a single Word document, or as multiple files.

Technical material. Begin by providing a brief rationale for your questions, then list each query in English, and finally provide a SPARQL listing of the queries you ran. Remember, at least one query must be ask for information across multiple pathways, or multiple result files, i.e. a "second-level" query, that operates on the results of the first-level query.
You should probably include some results, but this is not required (and don't include pages and pages of results!).
Essay: similar to project 1, provide a 1-2 page essay. This should begin with the domain -- which pathways you selected and why, as well as an analysis of the results you may have retrieved. Then, also include relevant comments / thoughts / frustrations about the technologies you used: BioPAX, Protege and the SPARQL engine(s) used. You may also want to reflect more broadly on the role of informatics technologies for bio-researchers. Of course, you may include self-reflective material, but note next deliverable.
Using the rubric below, provide some self-grading material. For each row in the rubic, indicate which of the four categories you think you acheived. Optionally, include a brief paragraph for comments and thoughts on your own performance. How can you strive to improve?

All these should be handed in via the course Catylst drop box, as usual.

Grading rubric:

	Less than 3.0	3.0 -- 3.2	3.3 - 3.7	3.8 -- 4.0
Queries	Queries do not work, or do not meet requirements (3 or more pathways; queries that build from other queries)	Queries meet requirements. Basic SPARQL syntax is mastered.	Queries show the student pursuing an interesting / imaginative question and using SPARQL appropriately.	Queries are sophisticated, and go beyond the requirements.
Reactome & Biology	Student mis-understands basic cell biology / pathway information.	The student learned sufficient biology and reactome knowledge to pose appropriate queries.	Relative to their background / starting knowledge, the student learned a fair bit about the domain.	The student demonstrates impressive biological and pathway knowledge.
BioPAX	BioPAX terms are mis-used.	Evidence the student has used Protege successfully to understand the BioPAX ontology.	The student's queries leverage and demonstrate detailed knowledge of BioPAX.
Essay	Essay has organizational problems, or many syntax and sentence-level problems.	Essay is clear, well-organized. Shows good self-reflection of what was learned.	Essay is very well-written. Demonstrates significant learning about the domain, and succinctly and clearly explains the student's approach to questions.	Essay is sparkling and insightful.

Last Updated:
Oct, '12

Contact the instructor at: gennari@uw.edu