Skip to content

GSoC 2019-Retriever Provenance

Posted on:May 26, 2019 at 04:22 PM

The open source journey that started with Mozilla’s addons-server has now come to me being selected in the Google Summer of Code 2019. I have been selected in Google Summer of Code for the second time in two consecutive years. The feeling after being selected for the second time is inexplicable.

The thing that has increased in the last one year apart from my knowledge after getting selected in Google Summer of Code 2018 is the level of confidence. I code with an increased level of confidence now. If I have an idea I know I can execute it. I now have an eye for a higher level of perfection than before. Being a Google Summer of Code participant has proved very beneficial to me.

Coming to my Google Summer of Code 2019 project I have been selected in the same organisation as last year i.e. Data Retriever under NumFOCUS umbrella. The Data Retriever is a package manager for data. It downloads, cleans, and stores publicly available data, so that analysts spend less time cleaning and managing data, and more time analyzing it. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days.

This year my project is Retriever Provenance. Currently, it is hard to reproduce previous installations of a dataset using Data Retriever due to updates in the dataset and Data Retriever itself. This project aims to add provenance capabilities to Data Retriever so that it becomes easier to reproduce previous installations of a dataset at a later date.

As I am already familiar with the codebase I hope I will be able to deliver the project faster. Looking forward to another great summer of learning with my mentors.