For my second independent project, I chose to create a model predicting the lab of origin for engineered DNA, using a dataset provided by DrivenData.org as part of a model-creating contest. I chose this dataset partly because I wanted to challenge myself - simply wrangling the dataset into a usable format was beyond my comprehension when I first started - but mostly because I already had inherent interest in the field of biosecurity and was excited to figure out how I might one day apply my beginner data scientist skills to a challenging problem that actually matters quite a lot.
You see, the technology that has allowed humans to engineer DNA and thereby solve problems has also given us the ability to create new problems, most of which we are unprepared to handle. Part of crafting solutions to the problem of harmful DNA engineering is the attribution problem; or being able to identify the source of engineered DNA if it's discovered in the wild. As DrivenData puts it, "Reducing anonymity can encourage more responsible behavior within scientific and entrepreneurial communities, without stifling innovation."
I feel grateful to AltLabs and their partners for sponsoring this contest and its bounty, and a personal thanks to Christine Chung for writing the guide to getting started, which was indispensable to my ability to tackle this project. I had picked out a back up plan for this unit's project, predicting spread of dengue fever in Latin America, but did not have to change plans only because of Ms. Chung's writing.
Ultimately, my algorithm could correctly assign a DNA sequence to the ten likeliest labs of origin (of which there were 1314 total) 86% of the time on the public test data and 82% of the time on the private test data. This is a huge improvement over the benchmark top ten accuracy of 33%, and allowed me to place in the top 10% of all the competitors, which I think is remarkable given that I had only been studying data science for 7 weeks and had first generated a virtual environment only a few days before beginning this project. Furthermore, due to my model performing better than the BLAST benchmark accuracy of 75.6%, I was invited to participate in the Innovation Track.
However, I think my relative success says more about the power and efficiency of the algorithm XGBoost, the work put into it by its creator Tianqi Chen, and the incredible minds that contribute yearly to the creation of open source coding libraries, which substantially reduces the slope of the learning curve for newbies like me.
My best score was simply achieved by using XGBoost "out of the box," applying it to the entire training feature matrix (without a train-validation split). The only parameter I changed was to set max_delta_step to 1, in an attempt to account for the heavily imbalanced classes (e.g. the DNA sequences were not distributed amongst all the labs uniformly, but rather some labs accounted for disproportionately large numbers of DNA sequences and other labs very few).
In other words, even though I only had a few days to enter a competition that had been running for months, getting to the 90th percentile of all competitors was simply a matter of knowing about a particular algorithm and knowing how to apply it. What is going on here? These competitors are already filtered for competence (the ability to follow along with the starter instructions) and interest (caring enough about the topic or the money to enter in the first place). So what should I take away from this surprising result?
What went well:
- Picking something I care about
- Having an explanation of how to get started
- Knowledge of XGBoost, and knowing how to read documentation so that I could account for imbalanced classes
Lessons I learned and ways to improve:
Working with large datasets with lots of columns takes time and processing power. (Duh!)
Accordingly, if you want to learn, it helps to do lots of iterating, so choosing a large dataset on which it takes lots of time to fit a model is probably not the best way to learn.
When tweaking your model, it's good to iterate in a logical manner, planning your next couple steps based on your next output, however...
...you risk not getting the benefits of this if you're not taking good notes. This can be in the form of commenting your code and/or naming your variables in ways that make it easy to keep track of what you're doing. This goes double if you're using multiple computers to test slightly different versions of the same code.
Along those same lines, if you want the ability to create visualizations of earlier model versions, or you want to retain the option to easily go back and compare bits and pieces of model versions to each other, you should reinstantiate your model each time you refit it, with a fresh name.
Projecting into the future, I probably could have made my model slightly better if I'd had more time to experiment, but I expect substantial improvements would require more advanced data science skills and/or knowledge of molecular biology to facilitate the engineering of new features.
Collaboration(/ building on the work of other smart people), plus a tiny amount of expertise, can get you really, really far, which leads me to a point I think a lot of forum members will be familiar with:
Given a combination of passion, willingness to learn, and rigorous/systematic thinking, it can be easy to do better than the average person at data modeling, much like it's easy to do better than the average person at picking effective charities (or many other goals members of the EA community care about).
...However, to do exceptionally well (to be not just better than most people, but the best; OR, to Actually Fix a problem), you also need time, resources, and domain knowledge.
First, congratulations. This is impressive, you should be very proud of yourself, and I hope this is the beginning of a long and fruitful data science career (or avocation) for you.
I think the simplest explanation is that your model fit better because you trained on more data. You write that your best score was obtained by applying XGBoost to the entire feature matrix, without splitting it into train/test sets. So assuming the other teams did things the standard way, you were working with 25%-40% more data to fit the model. In a lot of settings, particularly in the case of tree-based methods (as I think XGBoost usually is), this is a recipe for overfitting. In this setting, however, it seems like the structure of the public test data was probably really close to the structure of the private test data, so the lack of validation on the public dataset paid off for you.
I think one interpretation of this is that you got lucky in that way. But I don't think that's the right takeaway. I think the right takeaway is that you kept your eye on the ball and chose the strategy that worked based on your understanding of the data structure and the available methods and you should be very satisfied.
Are you sure that this is the standard way in competitions? It is absolutely correct that before the final submission, one would find the best model by fitting it on a train set and evaluating it on the test set. However, once you found a best performing model that way, there is no reason not to train the model with the best parameters on the train+test set, and submit that one. (Submission are the predictions of the model on the validation set, not the parameters of the model). After all, more data equals better performance.