Skip to content

GSoC 2019-The Pinnacle

Posted on:August 24, 2019 at 06:25 AM

Often we run into scenarios where it becomes hard to properly save or reproduce results when working with datasets. The main issue we face is that the dataset gets changed or updated and doesn’t run with our existing code or produces same results. For eg.:

…and many more.

Well, the following scenarios are now solved if you use Data Retriever as the dataset manager for your project. Now, Data Retriever not only helps you in cleaning and installing datasets but also in storing dataset in its current state and gives you the ability to share the saved states with others in form of compressed archive.

Here’s a look at all the functionalities that I have added to Data Retriever.

The command to commit dataset to a compressed archive by using retriever:

retriever commit dataset-name -m "commit message" -p path_to_store_archive_file

The command to store all your committed datasets in a directory of your choice i.e. in the provenance directory by setting an environment variable PROVENANCE_DIR:

retriever commit dataset-name -m "commit message"

The command to see the log of committed datasets stored in provenance directory:

retriever log dataset-name

The command to install committed datasets using a committed archive file.

retriever install engine-name path_to_archive_file

The command to install datasets stored in provenance directory:

retriever install engine-name dataset-name --hash-value hash

For all the above commands corresponding python interface is also available.

In future, I would like to extend provenance features to the Data Retriever packages for R and Julia. Also, I would like to add more tests for Retriever Provenance. It was my second GSoC working with the awesome folks at Weecology. Thanks to the timely inputs from my mentors Ethan White and Andrew Zhang at various stages that helped in making the project possible from design to implementation and special thanks to my mentor Henry Senyondo for his constant support and guidance in the last one and a half year. It’s kind of funny but after working on two GSoC projects I feel emotionally attached to Data Retriever.

Here’s a list of all the pull requests merged during the coding period while working on Retriever Provenance:

All of this work was merged to the master branch in pull request #1381.

Also, made some commits to retriever-recipes a new project created by my fellow GSoC’19 participant Harshit Bansal that moved all the retriever dataset scripts to a separate repository for easy maintenenace.

Commits on retriever-recipes

See you guys later!!!