Precise Altruism is a service that reads a number of news feeds of effective altruism organizations and general news aggregators and classifies the news articles according to their relevance to altruism and effective altruism. Articles that fall into this category are then linked and summarized on Tumblr and posted to Twitter and Facebook under the name of Altrunews. (A post is by no means to be understood as an endorsement.)
The service reads feeds from the following sources and classifies them based on a hand-annotated corpus of a few hundred news articles.
- The Against Malaria Foundation
- GiveWell (two feeds)
- Giving What We Can
- The Live You Can Save
- Charity Science
- 80,000 Hours
- David Roodman’s blog
- Julia Wise’s blog (Giving Gladly)
- Ben Kuhn’s blog
- Brian Tomasik’s blog (Reducing Suffering)
- My own blog (claviger.net)
- The Effective Altruism Forum
- Animal Charity Evaluators
- The Abdul Latif Jameel Poverty Action Lab (three feeds)
- Center for Global Development
- Sentience Politics
- The Global Priorities Project
- Gates Notes
- Evidence Action
- Your Siblings
- The World Health Organization
- Raising for Effective Giving
- Good Ventures
- Innovations for Poverty Action
- Vegan Outreach
- The Future of Humanity Institute
- Animal Equality
- The Google News feed of English-language news articles containing certain keywords
- The Kuerzr feed of English-language news articles containing a similar set of keywords
Unfortunately I couldn’t find the feeds of the Schistosomiasis Control Initiative, the Copenhagen Consensus Center, and Mercy For Animals. I’m open for further source feed suggestions, preferably Atom, not RSS.
By the way, Peter Hurford runs an unfiltered feed exclusively over EA blogs, and I wrote a thing once, the Resyndicator, that could be used for something like that (especially in scenarios where it doesn’t already exist).
The heart of our application is a classification pipeline built with scikit-learn, which uses tf-idf to generate a feature matrix of our news data and then a Stochastic Gradient Descent classifier to assign them one of our two categories.
We used grid search and cross-validation to determine the optimal classifier and an optimal set of parameters for it. Using only a small set of plausible parameters and only three splits for the cross-validation, we quickly determined the four out of initially ten classification algorithms that performed best on our data, Stochastic Gradient Descent, Logistic Regression, and two variations of the Support Vector Machines classifier. In our final, most finely tuned run, Stochastic Gradient Descent achieved an F1 score of 93%, about two percentage points more than the best of the other three classifiers.
The clearest takeaways from the grid search over a plausible SGD parameter set were that as loss functions
perceptron performed well; that as penalty
elasticnet performed well; that activating the shuffling helped; that using bigrams in addition to unigrams was useful but that 3-grams did not improve the F1 score; and that the best values for
n_iter varied widely among the best configurations.
It’s been almost a year since I implemented this, so please don’t quiz me on the details.
The daemon is the service that continuously runs on the server and continually checks the source feeds. It sends
if-none-matches headers whenever possible to minimize server load and traffic. Then the feed entries are compared to those in the database to filter out known ones, whereby we also compute the Jaccard distance between the preprocessed titles to avoid posting the same press releases over and over.
The articles that are typically associated with these entries are then fetched, stripped of boilerplate using Readability, summarized using Sumy, and finally posted to Tumblr. We extended the extraction step with one that also extracts a featured image and added a naive keyword extraction for the post tags on Tumblr.