Data Provenance Tools

A wide variety of software tools have been developed to support reproducible research and provenance tracking in computational scientific research.

The Lab Notebook

Science Lab Notebook
[Laboratory Notebook]. (n.d.). Retrieved March, 2019, from https://pixabay.com/

The traditional tool for tracking provenance information in science is the laboratory notebook. The problem with computational science is that the number of details that must be recorded is so large that writing them down by hand is incredibly tedious and error-prone.

What is a Scientific Work flow?

A scientific workflow management system provides a framework for capturing, executing, processing and monitoring a sequence of defined tasks in a scientific application.

Scientific Workflow
[Scientific Workflow]. (n.d.). Retrieved March, 2019, from https://pixabay.com/

Provenance is a critical concept in scientific work flows, they allow scientists to understand the origin of their results, to repeat their experiments, and to validate the processes that were used to derive data. In such systems, a workflow can be graphically designed by chaining together a sequence tasks, where each task may take input data from previous tasks and data from external data sources. In order for a work flow to be reproduced, provenance information must be recorded indicating where the data originated, how it was altered, and which components were used. This will allow other scientists to re-conduct the experiment, confirming the results[1].

Advantages of workflows include:

  • providing an easy-to-use environment for scientists to create their own workflows
  • providing interactive tools for the scientists enabling them to execute their workflows and view their results
  • simplifying the process of sharing and reusing workflows between the scientists
  • enabling scientists to track the provenance of the workflow execution results
  • providing solutions to keep track of provenance information when it comes to complex data-intensive research
  • reducing the complexity of manually tracking input data sets and order of task execution

References
[1] Simmhan, Y. L., Plale, B., & Gannon, D. (n.d.). A Survey of Data Provenance (Vol. 34, pp. 31-36, Publication No. 3). Bloomington, IN: Computer Science Deapatment: Indiana University