|
Knowledge Representation and Biomedical Applications | ||||||||||||||||||||||||
Project 2: Connecting pathway information with SPARQL In this project, you will carry out an example of a common information need for molecular biology researchers: I'm interested in protein XYZ and its role in pathway Q; what other pathways does this protein participate in? (and what publications might say more about these pathways?) Learning objectives:
Step-by-step, I'd like you to at least try the following:
It is completely possible to combine steps 2 & 3 -- i.e. to create a single large query that find the proteins that occur in multiple pathways. However, the queries will be simpler if you do the above. Furthermore, there will be some value to just doing query #2, and value to storing these results for a number of "secondary-level" queries, per step 3. You should feel free to make up your own second-level queries. For step one, you may want to try a variety of different pathways before settling on the 3 or so that you carry out the queries on. Think ahead -- you'll have to come up with some reasonable or interesting queries for your selected pathways. You may want to do some background knowledge about the pathways to find out more about them. Imagine more complex complex queries that might follow the intersection query: The researcher might wish to know what reactions that the protein participates in, and whether it is an input or an output of that reaction. Note that BioPAX reactions have properties "LEFT" and "RIGHT" to indicate inputs and outputs. (But these don't include catalysts or enzymes, which are part of "Control".....) The choice of pathways you select from Reactome is up to you. My preference is that each student use a different set of pathways, but given the number available at Reactome, this should happen naturally. Finally, a word of caution about Protein names. You may want to consider canonical naming sources. Information about SPARQL Tthe SPARQL.org web site, is the definitive web site for SPARQL quesions. They have a SPARQL processor, but this engine seems to have issues communictating with http://faculty.washington.edu/, so my recommendation is NOT to use this engine, but instead a local one known as the VSparQL query engine, maintained by Todd Detwiler of the Structural Biology group. (VSparql is a superset of SPARQL, so regular SPARQL queries will work fine here). Other engines include the Virtuoso engine, as provided by the Bio2RDF project. All of these resources allow you to enter and run arbitrary SPARQL queries. Of course, the ultimate resource for SPARQL is also available: the W3C specification document. Although long and technical, this document does have a nice index and some simple examples included. SPARQL example: To help you, here is a working SPARQL query that retrieves all of the complexes (a group of proteins, stuck together) from the muscle contraction pathway from Reactome. As you can see from the "FROM" clause, I have already downloaded the Biopax OWL file, and dumped it onto my UW web page. You should be able to try this out with the VSPARQL processor: (This query uses biopax level 2.) PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> CONSTRUCT { ?complex rdf:type "mus-contract complex" } This will create a very simple RDF file where the answers are all of rdf:type "mus-contract complex". For your queries, you will probably want a different CONSTRUCT clause, one that better allows for easy combination across pathways. In order to use web-based query processors, you'll have to post data sources to the web. In the above, I've posted a copy of the muscle contraction pathway to my own web pages. You will have to do the same for your selected pathways. As UW graduate students (one might argue, as 21st century citizens), you should all have web page creation ability and space reserved under http://students.washington.edu/, so you should be able to dump OWL and RDF files to such a location. In the past, students have also used 3rd party capabilities like dropbox.com to create web-accesible resources, and this works fine. (In no way am I endorsing or recommending this particular company!) You will also need to take the results from your step 2 queries and re-post these to the web (as RDF files) before you can run step 3 "meta level" queries. Resources:
Deliverables (due 1pm, Mon, Oct 21st): There are three deliverables. These may be written as a single Word document, or as multiple files.
All these should be handed in via the course Catylst drop box, as usual. Grading rubric:
|
|||||||||||||||||||||||||
Last Updated: |
Contact the instructor at: gennari@uw.edu
|