Precise Altruism

Dawn Drescher

Summary

Precise Altruism is a service that reads a number of news feeds of effective altruism organizations and general news aggregators and classifies the news articles according to their relevance to altruism and effective altruism. Articles that fall into this category are then linked and summarized on Tumblr and posted to Twitter and Facebook under the name of Altrunews. (A post is by no means to be understood as an endorsement.)

You can follow Altrunews on Tumblr, Twitter, and Facebook.

Introduction

Precise Altruism is a university project by Lea Helmers and me, which we worked on throughout a data science course by Dr. Kashif Rasul at the Freie Universität Berlin.

The service reads feeds from the following sources and classifies them based on a hand-annotated corpus of a few hundred news articles.

The Against Malaria Foundation
GiveWell (two feeds)
GiveDirectly
Giving What We Can
The Live You Can Save
Charity Science
80,000 Hours
David Roodman’s blog
Julia Wise’s blog (Giving Gladly)
Ben Kuhn’s blog
Brian Tomasik’s blog (Reducing Suffering)
My own blog (claviger.net)
The Effective Altruism Forum
Animal Charity Evaluators
The Abdul Latif Jameel Poverty Action Lab (three feeds)
Center for Global Development
Sentience Politics
The Global Priorities Project
Gates Notes
Evidence Action
Your Siblings
The World Health Organization
Raising for Effective Giving
Good Ventures
Innovations for Poverty Action
Vegan Outreach
The Future of Humanity Institute
Animal Equality
The Google News feed of English-language news articles containing certain keywords
The Kuerzr feed of English-language news articles containing a similar set of keywords

Unfortunately I couldn’t find the feeds of the Schistosomiasis Control Initiative, the Copenhagen Consensus Center, and Mercy For Animals. I’m open for further source feed suggestions, preferably Atom, not RSS.

By the way, Peter Hurford runs an unfiltered feed exclusively over EA blogs, and I wrote a thing once, the Resyndicator, that could be used for something like that (especially in scenarios where it doesn’t already exist).

The Classifier

The heart of our application is a classification pipeline built with scikit-learn, which uses tf-idf to generate a feature matrix of our news data and then a Stochastic Gradient Descent classifier to assign them one of our two categories.

We used grid search and cross-validation to determine the optimal classifier and an optimal set of parameters for it. Using only a small set of plausible parameters and only three splits for the cross-validation, we quickly determined the four out of initially ten classification algorithms that performed best on our data, Stochastic Gradient Descent, Logistic Regression, and two variations of the Support Vector Machines classifier. In our final, most finely tuned run, Stochastic Gradient Descent achieved an F1 score of 93%, about two percentage points more than the best of the other three classifiers.

The clearest takeaways from the grid search over a plausible SGD parameter set were that as loss functions log, hinge, modified_huber, and perceptron performed well; that as penalty l2 and elasticnet performed well; that activating the shuffling helped; that using bigrams in addition to unigrams was useful but that 3-grams did not improve the F1 score; and that the best values for alpha and n_iter varied widely among the best configurations.

It’s been almost a year since I implemented this, so please don’t quiz me on the details.

The Daemon

The daemon is the service that continuously runs on the server and continually checks the source feeds. It sends if-modified-since and if-none-matches headers whenever possible to minimize server load and traffic. Then the feed entries are compared to those in the database to filter out known ones, whereby we also compute the Jaccard distance between the preprocessed titles to avoid posting the same press releases over and over.

The articles that are typically associated with these entries are then fetched, stripped of boilerplate using Readability, summarized using Sumy, and finally posted to Tumblr. We extended the extraction step with one that also extracts a featured image and added a naive keyword extraction for the post tags on Tumblr.

RyanCareyMar 21 20152

That's pretty sweet. And I am mighty impressed with how well the feeds, such as the Tumblr, really understand what effective altruists are interested in. Nice work!

Dawn DrescherMar 22 20151

:‑D

Peter WildefordMar 21 20151

This is a cool data science accomplishment! Congratulations!

Is there any way we could plug this in to power r/smartgiving? If you have one RSS feed for the whole thing, I can automatically post from that.

Dawn DrescherMar 22 20152

Thankies! The Tumblr RSS feed should do the trick. They don’t have Atom unfortunately. I hope the voting works well on that board, because one in ten posts or so is going to be a bit off topic.

Using the feedback from Reddit would be valuable (and a form of reinforcement learning), but given how much other stuff I have to do, it’s unlikely to materialize. I’m happy to accept pull requests though. :‑3

Peter WildefordMar 24 20151

Should be online, populating r/smartgiving, as of right now.

Peter WildefordMar 26 20150

Is it possible to get an RSS feed that contains the constituent articles and not the Tumblr posts?

Peter WildefordMar 21 20150

P.S: r/smartgiving would also allow you to refit your model based on the upvotes / downvotes, which enables supervised learning!

Effective Altruism Forum
EA Forum

Precise Altruism

6

Summary

Introduction

The Classifier

The Daemon

6

Reactions

More posts like this