David_Moss

I am the Principal Research Manager at Rethink Priorities working on, among other things, the EA Survey, Local Groups Survey, and a number of studies on moral psychology, focusing on animal, population ethics and moral weights.

In my academic work, I'm a Research Fellow working on a project on 'epistemic insight' (mixing philosophy, empirical study and policy work) and moral psychology studies, mostly concerned either with effective altruism or metaethics.

I've previously worked for Charity Science in a number of roles and was formerly a trustee of EA London.

Sequences

EA Survey 2020

Wiki Contributions

Comments

Lessons learned running the Survey on AI existential risk scenarios

I think it depends a lot on the specifics of your survey design. The most commonly discussed tradeoff in the literature is probably that having more questions per page, as opposed to more pages with fewer questions, leads to higher non-response and lower self-reported satisfaction, but people answer the former more quickly. But how to navigate this tradeoff is very context-dependent.

All in all, the optimal number of items per screen requires a trade-off:
More items per screen shorten survey time but reduce data quality (item nonresponse) and respondent satisfaction (with potential consequences for motivation and cooperation in future surveys). Because the negative effects of more items per screen mainly arise when scrolling is required, we are inclined to recommend placing four to ten items on a single screen, avoiding the necessity to scroll. 

https://www.researchgate.net/publication/249629594_Design_of_Web_Questionnaires_The_Effects_of_the_Number_of_Items_per_Screen

In this context, survey researchers have to make informed decisions regarding which approach to use in different situations. Thus, they have to counterbalance the potential time savings and ease of application with the quality of the answers and the satisfaction of respondents. Additionally, they have to consider how other characteristics of the questions can influence this trade-off. For example, it would be expected that an increase in answer categories would lead to a considerable decrease in data quality, as the matrix becomes larger and harder to complete. As such, in addition to knowing which approach leads to better results, it is essential to know how characteristics of the questions, such as the number of categories or the device used, influence the trade-off between the use of grids and single-item questions.

https://sci-hub.ru/https://journals.sagepub.com/doi/full/10.1177/0894439316674459

But, in addition, I think there are a lot of other contextual factors that influence which is preferable. For example, if you want respondents to answer a number of questions pertaining to a number of subtly different prompts (which is pretty common in studies with a within-subjects component), then having all the questions for one prompt on one page may help make salient the distinction between the different prompts. There are other things you can do to aid this, like having gap pages between different prompts, though these can really enrage respondents.

Lessons learned running the Survey on AI existential risk scenarios

I think all of the following (and more) are possible risks:


- People are tired/bored and so answer less effortfully/more quickly

- People are annoyed and so answer in a qualitatively different way

- People are tired/bored/annoyed and so skip more questions

- People are tired/bored/annoyed and dropout entirely

Note that people skipping questions/dropping out is not merely  a matter of quantity (reduced numbers of responses), because the dropout/skipping is likely to be differential. The effect of the questions will be to lead to precisely those respondents who are more likely to be bored/tired/annoyed by those questions and to skip questions/dropout if bored/tired/annoyed to be less likely to give responses.

Regrettably, I think that specifying extremely clearly that the questions are completely optional influences some respondents (it also likely makes many simply less likely to answer these questions), but doesn't ameliorate the harm for others. You may be surprised how many people will provide multiple exceptionally long open comments and then complain that the survey took them longer than the projected average. That aside, depending on the context, I think it's sometimes legitimate for people to be annoyed by the presence of lots of open comment questions even if they are explicitly stated to be optional because, in context, it may seem like they need to answer them anyway.

Lessons learned running the Survey on AI existential risk scenarios

Thanks for the post. I think most of this is useful advice.

"Walkthroughs" are a good way to improve the questions

In the academic literature, these are also referred to as "cognitive interviews" (not to be confused with this use) and I generally recommend them when developing novel survey instruments.  Readers could find out more about them here.

Testers are good at identifying flaws, but bad at proposing improvements... I'm told that this mirrors common wisdom in UI/UX design: that beta testers are good at spotting areas for improvement, but bad (or overconfident) at suggesting concrete changes.

This is also conventional understanding in academia. Though there are some, mostly qualitative-oriented, philosophies that focus more on letting participants define articulation of the research output, there's generally no reason to think that respondents should be able to describe how a question should be asked (although, of course, if you are pretesting anyway, there is little reason not to consider suggestions). Depending on what you are measuring, respondents may not even be aware of what underlying construct (not necessarily something they even have a concept for) an item is trying to measure. Indeed, people may not even be able to accurately report on their own cognitive processes. Individuals' implicit understanding may outpace their ability to explicitly theoretically understand the issue at hand (for example, people can often spot incorrect grammar, or a misapplied concept, but not provide explicit accounts of the rules governing the thing in question).

Include relatively more "free-form" questions, or do interviews instead of a survey...

In interviews, you can be more certain that you and the respondent are thinking about questions in the same way (especially useful if your survey must deal with vague concepts)...

In interviews, you can get a more granular understanding of participants’ responses if desired, e.g. understanding relevant aspects of their worldviews, and choose to delve deeper into certain important aspects.

I agree there are some very significant advantages to the use of more qualitative instruments such as open-comments or interviews (I provide similar arguments here). In some cases these might be so extreme that it only makes sense to use these methods.  That said, the disadvantages are potentially severe, so I would recommend against people being too eager to either switch to fully qualitative methods or add more open comment instruments to a mixed survey:

  • Open comment responses may greatly reduce comparability (and so the ability to aggregate responses at all, if that is one of your goals), because respondents may be functionally providing answers to different questions, employing different concepts
  • Analysing such data typically raises a lot of issues of subjectivity and researcher degrees of freedom
  • You can attempt to overcome those issues by pre-registering even qualitative research (see here or here), and by following a fixed protocol in advance using a more objective method to analyse and aggregate responses, but then this reintroduces the original issues of needing to force individuals' responses into fixed boxes when they may have been thinking of things in a different manner.

Including both fixed response and open comment things in the same format may seem like the best of both worlds and is often the best approach, but open comments questions are often dramatically more time-consuming and demanding than fixed response questions and so their inclusion can greatly reduce the quality of the responses to the fixed questions.

I think running separate qualitative and quantitative studies is worth seriously considering: either with initial qualitative work helping to develop hypotheses followed by  quantitative study or with a wider quantitative study followed by qualitative work to delve further into the details. This can also be combined with separate exploratory and confirmatory stages of research, which is often recommended.

This latter point relates to the issue of preregistration, which you mention. It is common not to preregister analyses for exploratory research (where you don't have existing hypotheses which you want to test and simply want to explore or describe possible patterns in the data) - though some argue you should preregister exploratory research anyway. I think there's a plausible argument for erring on the side of preregistration in theory, based on the fact that preregistration allows reporting additional exploratory analyses anyway, or explicitly deviating from your preregistered analysis to run things differently if the data requires it (which sometimes it does if certain assumptions are not met). That said, it is quite possible for researchers to inappropriately  preregister exploratory research and not deviate or report additional analyses, even where this means the analyses they are reporting are inappropriate and completely meaningless, so this is a pitfall worth bearing in mind and trying to avoid. 

Ideally, I'd want to stress test this methodology by collecting ~10 responses and running it - probably you could just simulate this by going through the survey 10 times, wearing different hats.

Another option would be to literally simulate your data (you could simulate data that either does or does not match your hypotheses, for example) and analysing that. This is potentially pretty straightforward depending on the kind of data structure you anticipate.

Incidentally, I agreed with almost all the advice in this post except for the things in the "Other small lessons about survey design." In particular, I think "Almost always have just one question per page" and not using sliding scales rather than lists, seem like things I would not generally recommend (although having one question per page and using lists rather than scales is often fine). For "screening" questions for competence, unless you literally want to screen people out from taking the survey at all, you might also want to consider running these at the end of the survey rather than the beginning. Rather than excluding people from the survey entirely and not gathering their responses at all, you could gather their data , and then conduct analyses excluding respondents who fail the relevant checks, if appropriate (whether it's better to gather their data at all or not depends a lot on the specific case). Which order is better is a tricky question, depending on the specific case. One reason to have such questions later is that respondents can be annoyed by checks which seem like they are trying to test them (this most commonly comes up with comprehension/attention/instructional manipulation checks), which can influence later questions (the DVs you are usually interested in). Of course, in some circumstances, you may be concerned that the main questions will themselves influence responses to your 'check' in a way that would invalidate them.

For what it's worth, I'm generally happy to offer comments on surveys people are running, and although I can't speak for them, I imagine that would go for my colleagues on the survey team at Rethink Priorities too.

EA Survey 2020: Geography

Would it be helpful to put some or all of the survey data on a data visualisation software like google data studio or similar? This would allow regional leaders to quickly understand their country/city data and track trends. It might also save time by reducing the need to do so many summary posts every year and provide new graphs on request.

 

We are thinking about putting a lot more analyses on the public bookdown next year, rather than in the summaries, which might serve some of this function. As you'll be aware, it's not that difficult to generate the same analysis for each specific country. 

A platform that would allow more specific customisation of the analyses (e.g. breakdowns by city and gender and age etc.) would be less straightforward, since we'd need to ensure that no analyses could be sensitive or de-anonymising.

Unfortunately, we committed to not share any individual data available (except to CEA, if respondents opted in to do that). We're still happy to receive requests from people, such as yourself, who would like to see additional aggregate analyses (though the caveat about them not being potentially de-anonymising still applies, which is particularly an issue where people want analyses looking at a particular geographic area with a small number of EAs).

EA Survey 2020: Geography

~15 months isn't necessary a target for the future. I think we could actually increase the gap to ~1.5 years going forward. But yes, the reasons for that would be to get the best balance between getting more repeated measurements (which increases the power, loosely speaking, of our estimates), being able to capture meaningful trends (looking at cross-year data, most things don't seem to change dramatically in the course of only 12 months), and reducing survey fatigue. That said, whatever the average frequency of the survey going forward, I expect there to be some variation as we shuffle things around to fit other  organisations' timelines and to not clash with other surveys (like the EA Groups Survey) and so on.  

EA Survey 2020: Geography

Thanks for the question. We're planning to release the next EA Survey sometime in the middle of 2022. Historically, the average length of time between EA Surveys has been ~15 months, rather than every 12 months, and last year's survey was run right at the end of the year, so there won't be a survey within 2021 (the last time this happened was 2016). 

EA Survey 2020: Geography

That makes sense. Reference numbers even for things like race is surprisingly tricky. We've previously considered comparing the percentages for race within the EA Survey to baseline percentages. But although this works passably well for the US (EAS respondents are more white) and UK (EAS respondents are less white)- without taking into account the fact that EAS respondents are disproportionately  rich, highly educated and young and therefore should not be expected to represent the composition of the general population- for many other major countries there simple isn't national data on race/ethnicity that matches the same categories as the US/UK. I think people should generally be a lot more uncertain when estimating how far the EA community is representative in this sense. The figures still allow comparison within the EA community though.

EA Survey 2020: Geography

Here are the countries with the highest EAs per capita. Note that Iceland, Luxembourg and Cyprus, nevertheless have very low numbers of EA (<5) respondents. This graph doesn't leave out any countries with particularly high numbers of EAs, in absolute terms, though Poland and China are missing despite having >10.

EA Survey 2020: Geography

We have reported this previously in both EAS 2018 and EAS 2019. We didn't report it this year because the per capita numbers are pretty noisy (at least among the locations with the highest EAs per capita, which tend to be low population countries). But it would be pretty easy to reproduce this analysis using this year's data.

EA Survey 2020: Community Information

To get another reference point I coded the "High Standards" comments and found that 75% did not seem to be about "perceived attitudes towards others." Many comments explicitly disavowed the idea that that they think EAs look down on others, for example, but still reported that they feel bad because of demandingness considerations or because 'everyone in the community is so talented' etc.

Load More