For my second independent project, I chose to create a model predicting the lab of origin for engineered DNA, using a dataset provided by DrivenData.org as part of a model-creating contest. I chose this dataset partly because I wanted to challenge myself - simply wrangling the dataset into a usable format was beyond my comprehension when I first started - but mostly because I already had inherent interest in the field of biosecurity and was excited to figure out how I might one day apply my beginner data scientist skills to a challenging problem that actually matters quite a lot.
You see, the technology that has allowed humans to engineer DNA and thereby solve problems has also given us the ability to create new problems, most of which we are unprepared to handle. Part of crafting solutions to the problem of harmful DNA engineering is the attribution problem; or being able to identify the source of engineered DNA if it's discovered in the wild. As DrivenData puts it, "Reducing anonymity can encourage more responsible behavior within scientific and entrepreneurial communities, without stifling innovation."
I feel grateful to AltLabs and their partners for sponsoring this contest and its bounty, and a personal thanks to Christine Chung for writing the guide to getting started, which was indispensable to my ability to tackle this project. I had picked out a back up plan for this unit's project, predicting spread of dengue fever in Latin America, but did not have to change plans only because of Ms. Chung's writing.
Ultimately, my algorithm could correctly assign a DNA sequence to the ten likeliest labs of origin (of which there were 1314 total) 86% of the time on the public test data and 82% of the time on the private test data. This is a huge improvement over the benchmark top ten accuracy of 33%, and allowed me to place in the top 10% of all the competitors, which I think is remarkable given that I had only been studying data science for 7 weeks and had first generated a virtual environment only a few days before beginning this project. Furthermore, due to my model performing better than the BLAST benchmark accuracy of 75.6%, I was invited to participate in the Innovation Track.
However, I think my relative success says more about the power and efficiency of the algorithm XGBoost, the work put into it by its creator Tianqi Chen, and the incredible minds that contribute yearly to the creation of open source coding libraries, which substantially reduces the slope of the learning curve for newbies like me.
My best score was simply achieved by using XGBoost "out of the box," applying it to the entire training feature matrix (without a train-validation split). The only parameter I changed was to set max_delta_step to 1, in an attempt to account for the heavily imbalanced classes (e.g. the DNA sequences were not distributed amongst all the labs uniformly, but rather some labs accounted for disproportionately large numbers of DNA sequences and other labs very few).
In other words, even though I only had a few days to enter a competition that had been running for months, getting to the 90th percentile of all competitors was simply a matter of knowing about a particular algorithm and knowing how to apply it. What is going on here? These competitors are already filtered for competence (the ability to follow along with the starter instructions) and interest (caring enough about the topic or the money to enter in the first place). So what should I take away from this surprising result?
What went well:
- Picking something I care about
- Having an explanation of how to get started
- Knowledge of XGBoost, and knowing how to read documentation so that I could account for imbalanced classes
Lessons I learned and ways to improve:
Working with large datasets with lots of columns takes time and processing power. (Duh!)
Accordingly, if you want to learn, it helps to do lots of iterating, so choosing a large dataset on which it takes lots of time to fit a model is probably not the best way to learn.
When tweaking your model, it's good to iterate in a logical manner, planning your next couple steps based on your next output, however...
...you risk not getting the benefits of this if you're not taking good notes. This can be in the form of commenting your code and/or naming your variables in ways that make it easy to keep track of what you're doing. This goes double if you're using multiple computers to test slightly different versions of the same code.
Along those same lines, if you want the ability to create visualizations of earlier model versions, or you want to retain the option to easily go back and compare bits and pieces of model versions to each other, you should reinstantiate your model each time you refit it, with a fresh name.
Projecting into the future, I probably could have made my model slightly better if I'd had more time to experiment, but I expect substantial improvements would require more advanced data science skills and/or knowledge of molecular biology to facilitate the engineering of new features.
Collaboration(/ building on the work of other smart people), plus a tiny amount of expertise, can get you really, really far, which leads me to a point I think a lot of forum members will be familiar with:
Given a combination of passion, willingness to learn, and rigorous/systematic thinking, it can be easy to do better than the average person at data modeling, much like it's easy to do better than the average person at picking effective charities (or many other goals members of the EA community care about).
...However, to do exceptionally well (to be not just better than most people, but the best; OR, to Actually Fix a problem), you also need time, resources, and domain knowledge.