All of Dan_Keys's Comments + Replies

Intelligence 1: Individual cognitive abilities.

Intelligence 2: The ability to achieve a wide range of goals.

Eliezer Yudkowsky means Intelligence 2 when he talks about general intelligence. e.g., He proposed "efficient cross-domain optimization" as the definition in his post by that name. See the LW tag page for General Intelligence for more links & discussion.

4David Johnston6d
Eliezer’s threat model is “a single superintelligent algorithm with at least a little bit of ability to influence the world”. In this sentence, the word “superintelligent” cannot mean intelligence in the sense of definition 2, or else it is nonsense - definition 2 precludes “small or no ability to influence the world”. Furthermore, in recent writing Eliezer has emphasised threat models that mostly leverage cognitive abilities (“intelligence 1”), such as a superintelligence that manipulates someone into building a nano factory using existing technology. Such scenarios illustrate that intelligence 2 is not necessary for AI to be risky, and I think Eliezer deliberately chose these scenarios to make just that point. One slightly awkward way to square this with the second definition you link is to say that Yudkowsky uses definition 2 to measure intelligence, but is also very confident that high cognitive abilities are sufficient for high intelligence and therefore doesn’t always see a need to draw a clear distinction between the two.

The model assumes gradually diminishing returns to spending within the next year, but the intuitions behind your third voice think that much higher spending would involve marginal returns that are a lot smaller OR ~zero OR negative?

Huh, now that you mention it, I think the third voice thinks that much higher spending would be negative, not just a lot smaller or zero. So maybe that's what's going on: The third voice intuits that there are backfire risks along the lines of "EA gets a reputation for being ridiculously profligate" that the model doesn't model? Maybe another thing that's going on is that maybe we literally are funding all the opportunities that seem all-things-net-positive to us. The model assumes an infinite supply of opportunities, of diminishing quality, but in fact maybe there are literally only finitely many and we've exhausted them all.

Could you post something closer to the raw survey data, in addition to the analysis spreadsheet linked in the summary section? I'd like to see something that:

  • Has data organized by respondent  (a row of data for each respondent)
  • Shows the number given by the respondent, before researcher adjustments (e.g., answers of 0 are shown as "0" and not as ".01") (it's fine for it to show the numbers that you get after data cleaning which turns "50%" and "50" into "0.5")
  • Includes each person's 6 component estimates, along with a few other variables like their dire
... (read more)
Yes I will do, although some respondents asked to remain anonymous / not have their data publicly accessible and so I need to make some slight alterations before I share. I'd guess a couple of weeks for this

The numbers that you get from this sort of exercise will depend heavily on which people you get estimates from. My guess is that which people you include matters more than what you do with the numbers that they give you.

If the people who you survey are more like the general public, rather than people around our subcultural niche where misaligned AI is a prominent concern, then I expect you'll get smaller numbers.

Whereas, in Rob Bensinger's 2021 survey of "people working on long-term AI risk", every one of the 44 people who answered the survey gave an estim... (read more)

I completely agree that the survey demographic will make a big difference to the headline results figure. Since I surveyed people interested in existential risk (Astral Codex Ten, LessWrong, EA Forum) I would expect the results to bias upwards though. (Almost) every participant in my survey agreed the headline risk was greater than the 1.6% figure from this essay, and generally my results line up with the Bensinger survey. However, this is structurally similar to the state of Fermi Paradox estimates prior to SDO 'dissolving' this - that is, almost everyone working on the Drake Equation put the probable number of alien civilisations in this universe very high, because they missed the extremely subtle statistical point about uncertainty analysis SDO spotted, and which I have replicated in this essay. In my opinion, Section 4.3 indicates that as long as you have any order-of-magnitude uncertainty you will likely get asymmetric distribution of risk, and so in that sense I disagree that the mechanism depends on who you ask. The mechanism is the key part of the essay, the headline number is just one particular way to view that mechanism.

If the estimates for the different components were independent, then wouldn't the distribution of synthetic estimates be the same as the distribution of individual people's estimates?

Multiplying Alice's p1 x Bob's p2 x Carol's p3 x ... would draw from the same distribution as multiplying Alice's p1 x Alice's p2 x Alice's p3 ... , if estimates to the different questions are unrelated.

So you could see how much non-independence affects the bottom-line results just by comparing the synthetic distribution with the distribution of individual estimates (treating ... (read more)

In practice these numbers wouldn't perfectly match even if there was no correlation because there is some missing survey data that the SDO method ignores (because naturally you can't sample data that doesn't exist). In principle I don't see why we shouldn't use this as a good rule-of-thumb check for unacceptable correlation. The synth distribution gives a geomean of 1.6%, a simple mean of around 9.6%, as per the essay The distribution of all survey responses multiplied together (as per Alice p1 x Alice p2 x Alice p3) gives a geomean of approx 2.3% and a simple mean of approx 17.3%. I'd suggest that this implies the SDO method's weakness to correlated results is potentially depressing the actual result by about 50%, give or take. I don't think that's either obviously small enough not to matter or obviously large enough to invalidate the whole approach, although my instinct is that when talking about order-of-magnitude uncertainty, 50% point error would not be a showstopper.

Does the table in section 3.2 take the geometric mean for each of the 6 components?

From footnote 7 it looks like it does, but if it does then I don't see how this gives such a different bottom line probability from the synthetic method geomean in section 4 (18.7% vs. 1.65% for all respondents). Unless some probabilities are very close to 1, and those have a big influence on the numbers in the section 3.2 table? Or my intuitions about these methods are just off.

That's correct - the table gives the geometric mean of odds for each individual line, but then the final line is a simple product of the preceding lines rather than the geometric mean of each individual final estimate. This is a tiny bit naughty of me, because it means I've changed my method of calculation halfway through the table - the reason I do this is because it is implicitly what everyone else has been doing up until now (e.g. it is what is done in Carlsmith 2021) , and I want to highlight the discrepancy this leads to.

Have you looked at how sensitive this analysis is to outliers, or to (say) the most extreme 10% of responses on each component?

The recent Samotsvety nuclear risk estimate removed the largest and smallest forecast (out of 7) for each component before aggregating (the remaining 5 forecasts) with the geometric mean. Would a similar adjustment here change the bottom line much (for the single probability and/or the distribution over "worlds")?

The prima facie case for worrying about outliers actually seems significantly stronger for this survey than for an org l... (read more)

I had not thought to do that, and it seems quite sensible (I agree with your point about prima facie worry about low outliers). The results are below.

To my eye, the general mechanism I wanted to defend about is preserved (there is an asymmetric probability of finding yourself in a low-risk world), but the probability of finding yourself in an ultra-low-risk world has significantly lowered, with that probability mass roughly redistributing itself around the geometric mean (which itself has gone up to 7%-ish)

In some sense this isn't totally surprising - remo... (read more)

A passage from Superforecasting:

Flash back to early 2012. How likely is the Assad regime to fall? Arguments against a fall include (1) the regime has well-armed core supporters; (2) it has powerful regional allies.  Arguments  in  favor  of  a  fall  include  (1)  the  Syrian  army  is  suffering  massive defections;  (2)  the  rebels  have  some  momentum,  with  fighting  reaching  the  capital. Suppose you weight the strength of t

... (read more)

Two empirical reasons not to take the extreme scope neglect in studies like the 2,000 vs 200,000 birds one as directly reflecting people's values.

First, the results of studies like this depend on how you ask the question. A simple variation which generally leads to more scope sensitivity is to present the two options side by side, so that the same people would be asked both about 2,000 birds and about the 200,000 birds (some call this "joint evaluation" in contrast to "separate evaluation"). Other variations also generally produce more scope sensitive resu... (read more)

A passage from Superforecasting: Note: in the other examples studied by Mellers & colleagues (2015) [], regular forecasters were less sensitive to scope than they should've been, but they were not completely insensitive to scope, so the Assad example here (40% vs. 41%) is unusually extreme.

It would be interesting whether the forecasters with outlier numbers stand by those forecasts on reflection, and to hear their reasoning if so. In cases where outlier forecasts reflect insight, how do we capture that insight rather than brushing them aside with the noise? Checking in with those forecasters after their forecasts have been flagged as suspicious-to-others is a start.

The p(month|year) number is especially relevant, since that is not just an input into the bottom line estimate, but also has direct implications for individual planning. The plan ... (read more)

These numbers seem pretty all-over-the-place. On nearly every question, the odds given by the 7 forecasters span at least 2 orders of magnitude, and often substantially more. And the majority of forecasters (4/7) gave multiple answers which seem implausible (details below) in ways that suggest that their numbers aren't coming from a coherent picture of the situation.

I have collected the numbers in a spreadsheet and highlighted (in red) the ones that seem implausible to me.

Odds span at least 2 orders of magnitude:

Another commenter noted that the answers to ... (read more)

Hey Dan, thanks for sanity-checking! I think you and feruell are correct to be suspicious of these estimates, we laid out reasoning and probabilities for people to adjust to their taste/confidence. * I agree outliers are concerning (and find some of them implausible), but I likewise have an experience of being at 10..20% when a crowd was at ~0% (for a national election resulting in a tie) and at 20..30% when a crowd was at ~0% (for a SCOTUS case) [likewise for me being ~1% while the crowd was much higher; I also on occasion was wrong updating x20 as a result, not sure if peers foresaw Biden-Putin summit but I was particularly wrong there]. * I think the risk is front-loaded, and low month-to-year ratios are suspicious, but I don't find them that implausible (e.g., one might expect everyone to get on a negotiation table/emergency calls after nukes are used and for the battlefield to be "frozen/shocked" – so while there would be more uncertainty early on, there would be more effort and reasons not to escalate/use more nukes at least for a short while — these two might roughly offset each other). * Yeah, it was my prediction that conjunction vs. direct wouldn't match for people (really hard to have a good "sense" of such low probabilities if you are not doing a decomposition). I think we should have checked these beforehand and discussed them with folks.
Hey, thanks for the analysis, we might do something like that next time to improve consistency of our estimates, either as a team or as individuals. Note that some of the issues you point out are the cost of speed, of working a bit in the style of an emergency response team [] , rather than delaying a forecast for longer. Still, I think that I'm more chill and less worried than you about these issues, because as you say the aggregation method was picked this up, and it doesn't take the geometric mean of the forecasts that you colored in red, given that it excludes the minimum and maximum. I also appreciated the individual comparison between chained probabilities and directly elicited ones, and it makes me even more pessimistic about using the directly elicited ones, particularly for <1% probabilities

The Less Wrong posts Politics as Charity from 2010 and Voting is like donating thousands of dollars to charity from November 2012 have similar analyses to the 2020 80k article.

Agreed that there are some contexts where there's more value in getting distributions, like with the Fermi paradox.

Or, before the grants are given out, you could ask people to give an ex ante distribution for "what will be your ex post point estimate of the value of this grant?" That feeds directly into VOI calculations, and it is clearly defined what the distribution represents. But note that it requires focusing on point estimates ex post.

> Or, before the grants are given out, you could ask people to give an ex ante distribution for "what will be your ex post point estimate of the value of this grant?" That feeds directly into VOI calculations, and it is clearly defined what the distribution represents. But note that it requires focusing on point estimates ex post. Aha, but you can also do this when the final answer is also a distribution. In particular, you can look at the KL-divergence between the initial distribution and the answer, and this is also a proper scoring rule.

I think it would've been better to just elicit point estimates of the grants' expected value, rather than distributions. Using distributions adds complexity, for not much benefit, and it's somewhat unclear what the distributions even represent.

Added complexity: for researchers giving their elicitations, for the data analysis, for readers trying to interpret the results. This can make the process slower, lead to errors, and lead to different people interpreting things differently. e.g., For including both positive & negative numbers in the distributions... (read more)

Yeah, you can use a mixture distribution if you are thinking about the distribution of impact, like so [] , or you can take the mean of that mixture if you want to estimate the expected value, like so [] . Depends of what you are after.
My intuitions point the other way with regards to point estimates vs distributions. Distributions seem like the correct format here, and they could allow for value of information calculations, sensitivity, to highlight disagreements which people wouldn't notice with point estimates, to better combine. The bottom line could also change when using estimates, e.g., as in here []. That said, they do have a learning curve and I agree with you that they add additional complexity/upfront cost.

In the table with post-discussion distributions, how is the lower bound of the aggregate distribution for the Open Phil AI Fellowship -73, when the lowest lower bound for an individual researcher is -2.4? Also in that row, Researcher 3's distribution is given as "250 to 320", which doesn't include their median (35) and is too large for a scale that's normalized to 100.

Hey, thanks Should have been -250, updated. This also explains the -73.

I haven't seen a rigorous analysis of this, but I like looking at the slope, and I expect that it's best to include each resolved prediction as a separate data point. So there would be 743 data points, each with a y value of either 0 or 1.

There are several different sorts of systematic errors that you could look for in this kind of data, although checking for them requires including more features of each prediction than the ones that are here.

For example, to check for optimism bias you'd want to code whether each prediction is of the form "good thing will happen", "bad thing will happen", or neither. Then you can check if probabilities were too high for "good thing will happen" predictions and too low for "bad thing will happen" predictions. (Most of the example predictions were "good thing... (read more)

2Javier Prieto5mo
We do track whether predictions have a positive ("good thing will happen") or negative ("bad thing will happen") framing, so testing for optimism/pessimism bias is definitely possible. However, only 2% of predictions have a negative framing, so our sample size is too low to say anything conclusive about this yet. Enriching our database with base rates and categories would be fantastic, but my hunch is that given the nature and phrasing of our questions this would be impossible to do at scale. I'm much more bullish on per-predictor analyses and that's more or less what we're doing with the individual dashboards.

Pardon my negativity, but I get the impression that you haven't thought through your impact model very carefully.

In particular, the structure where

Every week, an anonymous team of grantmakers rank all participants, and whoever accomplished the least morally impactful work that week will be kicked off the island. 

is selecting for mediocrity.

Given fat tails, I expect more impact to come from the single highest impact week than from 36 weeks of not-last-place impact.

Perhaps for the season finale you could bring back the contestant who had the highest imp... (read more)

2Yonatan Cale8mo
Seems like it's more important to encourage discussions-about-impact than it is to encourage impact directly But I'm not sure, I don't even own a TV

How much overlap is there between this book & Singer's forthcoming What We Owe The Past?

I got 13/13.

q11 (endangered species) was basically a guess. I thought that an extreme answer was more likely given how the quiz was set up to be counterintuitive/surprising. Also relevant: my sense is that we've done pretty well at protecting charismatic megafauna; the fact that I've heard about a particular species being at risk doesn't provide much information either way about whether things have gotten worse for it (me hearing about it is related to things being bad for it, and it's also related to successful efforts to protect it).

On q6 (age distributi... (read more)

For example: If there are diminishing returns to campaign spending, then taking equal amounts of money away from both campaigns would help the side which has more money.

If humanity goes extinct this century, that drastically reduces the likelihood that there are humans in our solar system 1000 years from now. So at least in some cases, looking at the effects 1000+ years in the future is pretty straightforward (conditional on the effects over the coming decades).

In order to act for the benefit of the far future (1000+ years away), you don't need to be able to track the far future effects of every possible action. You just need to find at least one course of action whose far future effects are sufficiently predictable to guide you (and good in expectation).

The initial claim is that for any action, we can assess its normative status by looking at its long-run effects. This is a much stronger claim than yours.

The initial post by Eliezer on security mindset explicitly cites Bruce Schneier as the source of the term, and quotes extensively from this piece by Schneier.

In most of his piece, by “aiming to be mediocre”, Schwitzgebel means that people’s behavior regresses to the actual moral middle of a reference class, even though they believe the moral middle is even lower.

This skirts close to a tautology. People's average moral behavior equals people's average moral behavior. The output that people's moral processes actually produce is the observed distribution of moral behavior.

The "aiming" part of Schwitzgebel's hypothesis that people aim for moral mediocrity gives it empirical content. It gets harder to pick out the empirical content when interpreting aim in the objective sense.

This is fair. I was trying to salvage his argument without running into the problems mentioned in the above comment, but if he means "aim" objectively, then its tautologically true that people aim to be morally average, and if he means "aim" subjectively, then it contradicts the claim that most people subjectively aim to be slightly above average (which is what he seems to say in the B+ section). The options are: (1) his central claim is uninteresting (2) his central claim is wrong (3) I'm misunderstanding his central claim. And I normally would feel like I should play it safe and default to (3), but it's probably (2).

Unless a study is done with participants who are selected heavily for numeracy and fluency in probabilities, I would not interpret stated probabilities literally as a numerical representation of their beliefs, especially near the extremes of the scale. People are giving an answer that vaguely feels like it matches the degree of unlikeliness that they feel, but they don't have that clear a sense of what (e.g.) a probability of 1/100 means. That's why studies can get such drastically different answers depending on the response format, and why (I predict) eff... (read more)

That can be tested on these data, just by looking at the first of the 3 questions that each participant got, since the post says that "Participants were asked about the likelihood of humans going extinct in 50, 100, and 500 years (presented in a random order)."

I expect that there was a fair amount of scope insensitivity. e.g., That people who got the "probability of extinction within 50 years" question first gave larger answers to the other questions than people who got the "probability of extinction within 500 years" question first.

Do you know if this platform allows participants to go back? (I assumed it did, which is why I thought a separate study would be necessary.)

I agree that asking about 2016 donations in early 2017 is an improvement for this. If future surveys are just going to ask about one year of donations then that's pretty much all you can do with the timing of the survey.

In the meantime, it is pretty easy to filter the data accordingly -- if you look only at donations made by EAs who stated that they joined on 2014 or before, the median donation is $1280.20 for 2015 and $1500 for 2016.

This seems like a better way to do the analyses. I think that the post would be more informative & easier to inter... (read more)

2Peter Wildeford5y
The median 2016 reported donation total of people who joined on 2015 or before was $655. We'll talk amongst the team about if we want to update the post or not. Thanks!

It is also worth noting that the survey was asking people who identify as EA in 2017 how much they donated in 2015 and 2016. These people weren't necessarily EAs in 2015 or 2016.

Looking at the raw data of when respondents said that they first became involved in EA, I'm getting that:

7% became EAs in 2017
28% became EAs in 2016
24% became EAs in 2015
41% became EAs in 2014 or earlier

(assuming that everyone who took the "Donations Only" survey became an EA before 2015, and leaving out everyone else who didn't answer the question about when they bec... (read more)

2Peter Wildeford5y
You're right there's a long lag time between asking about donations and the time of the donations... for the most part this is unavoidable, though we're hoping to time the survey much better in the future (asking only about one year of donations and asking just a month or two after the year is over). This will come with better organization in our team. In the meantime, it is pretty easy to filter the data accordingly -- if you look only at donations made by EAs who stated that they joined on 2014 or before, the median donation is $1280.20 for 2015 and $1500 for 2016.

This year, a “Donations Only” version of the survey was created for respondents who had filled out the survey in prior years. This version was shorter and could be linked to responses from prior years if the respondent provided the same email address each year.

Are these data from prior surveys included in the raw data file, for people who did the Donations Only version this year? At the bottom of the raw data file I see a bunch of entries which appear not to have any data besides income & donations - my guess is that those are either all the people who took the Donations Only version, or maybe just the ones who didn't provide an email address that could link their responses.

1Peter Wildeford5y
All the raw data for all the surveys for all the years are here: []. We have not yet published a longitudinal linked dataset for 2017, but will publish that soon. Those are the people who took the Donations Only version.

It might be possible to fix in a not-too-tedious way, by using find-replace in the source code to edit all of the broken links (and anchors?) at once.

It appears that this analysis did not account for when people became EAs. It looked at donations in 2014, among people who in November 2015 were nonstudent EAs on an earning to give path. But less than half of those people were nonstudent EAs on an earning to give path at the start of 2014.

In fact, less than half of the people who took the Nov 2015 survey were EAs at the start of 2014. I've taken a look at the dataset, and among the 1171 EAs who answered the question about 2014 donations:
40% first got involved in EA in 2013 or earlier
21% first got involved... (read more)

Thanks Dan! I didn't know this, I'll look more closely at the data when I get the chance.

If the prospective employee is an EA, then they are presumably already paying lots of attention to the question "How much good would I do in this job, compared with the amount of good I would do if I did something else instead?" And the prospective employee has better information than the employer about what that alternative would be and how much good it would do. So it's not clear how much is added by having the employer also consider this.

Employees don't know who else is being considered for the position so they don't have as much information about the tradeoffs there as employers do. Alternatively, you could interpret me as saying that employees should consider how their taking a job affects what jobs other people take; although at that point you're looking at fairly far-removed effects and I'm not sure you can do anything useful at that level.

Thanks for looking this up quickly, and good point about the selection effect due to attrition.

I do think that it would be informative to see the numbers when also limited to nonstudents (or to people above a certain income, or to people above a certain age). I wouldn't expect to see much donated from young low- (or no-) income students.

For the analysis of donations, which asked about donations in 2014, I'd like to see the numbers for people who became EAs in 2013 or earlier (including the breakdowns for non-students and for donations as % of income for those with income of $10,000 or more).

37% of respondents first got involved with EA in 2015, so their 2014 donations do not tell us much about the donation behavior of EAs. Another 24% first got involved with EA in 2014, and it's unclear how much their 2014 donations tell us given that they only began to be involved in EA midyear.

That is a very good point, and ties in to vipulnaik's point below about starting the survey collection time just after the start of a year so that donation information can be recorded for the immediately preceding year. I've quickly run the numbers and the median donation in 2014 for the 467 people who got involved in 2013 or earlier was $1,500, so significantly higher than that for EAs overall. This is not including people who didn't say what year they got involved, so probably cuts a few people out who did get involved before 2014 but can't remember. Also if we have constant attrition from the EA movement then you'd expect the pre-2014 EAs to be more committed as a whole This is a very good point and is making me lean towards vipulnaik's suggestion for future surveys, as this problem will be just as pressing if the movement continues to grow at the rate it has done.

My guess (which, like Michael's, is based on speculation and not on actual information from relevant decision-makers) is that the founders of Open Phil thought about institutional philosophy before they looked in-depth at particular cause areas. They asked themselves questions like:

How can we create a Cause Agnostic Foundation, dedicated to directing money wherever it will do the most good, without having it collapse into a Foundation For Cause X as soon as its investigations conclude that currently the highest EV projects are in cause area x?

Do we want to... (read more)

A few of these reasons do suggest that it might be useful to make grants in a cause area to stay open to it/keep actively researching it/keep potential grantees aware that you're funding it. This would suggest that it's worthwhile to spend relatively small amounts of money on less promising cause areas, but maintain spending to keep momentum. This does have downsides: 1. It costs money. If you can afford to spend $200 million/year, and you want to spend $5 million/year on each suboptimal cause area, that would easily eat up a quarter to a half of your budget. 2. It costs staff time. You have limited capacity to do research and talk to grantees, so any time spent doing this in a suboptimal cause area is time spent not doing it in an optimal cause area. Maybe you could resolve this by putting only passing investment into less important areas and making grants without investigating them much. Making grants in secondary cause areas has benefits, but the question is, does it have sufficient benefits to make it better than spending those grants on the strongest cause area(s)? Aside from the fact that I'm skeptical of this claim, Open Phil is fairly opaque about how it makes grant decisions. It produces writeups about the pros and cons of cause areas/grants, which is nice, but that doesn't tell us why this grants was chosen rather than some other grant, or why Open Phil has chosen to prioritize one cause area over another. And like I said, I'm skeptical of this claim. Perhaps making grants to lots of cause areas promotes EA ideas. But since the standard EA claim is that individual donors should give to the single best cause, maybe a foundation would better promote EA ideas by focusing on the single best area until it has enough funding that it's no longer best on margin. I don't really know either way and I don't know how one would know. I'm also not convinced that promoting EA ideas is a good thing.

I can't tell what's being done in that calculation.

I'm getting a p-value of 0.108 from a Pearson chi-square test (with cell values 55, 809; 78, 856). A chi-square test and a two-tailed t-test should give very similar results with these data, so I agree with Michael that it looks like your p=0.053 comes from a one-tailed test.

1Jeff Kaufman7y
Yes, you're right. Sorry! I redid it computationally and also got 0.108. Post updated.

A quick search into the academic research on this topic roughly matches the claims in this post.

Meta-analyses by Allen (1991) (pdf, blog post summary) and O'Keefe (1999) (pdf, blog post summary) defined "refutational two-sided arguments" as arguments that include 1) arguments in favor of the preferred conclusion, 2) arguments against the preferred conclusion, and 3) arguments which attempt to refute the arguments against the preferred conclusion. Both meta-analyses found that refutational two-sided arguments were more persuasive than one-sided ar... (read more)

Scott A agrees (see point 8): []

Have you looked at the history of your 4 metrics (Visitors, Subscribers, Donors, Pledgers) to see how much noise there is in the baseline rates? The noisier they are, the more uncertainty you'll have in the effect size of your intervention.

Could you have the pamphlets only give a url that no one else goes to, and then directly track how many new subscribers/donors/pledgers have been to that url?

I think we'll be able to get a standard deviation, as well as mean, for the baseline values we compare against, which should be helpful to determine if the individual distribution results are significantly different than the baseline rates. I don't think we'll have enough distribution days in the pilot to be able to get the same for the pilot numbers (e.g. we won't be able to tell if the results of individual distributions is typical of all distributions), but that seems like something we could accumulate over time if we proceed with the program. There is a custom URL only advertised in the pamphlet leading to a "quiz" page, and we will be tracking that, although I would guess that most of the traffic generated by the pamphlets would just be to the main homepage. Like if someone handed me a pamphlet for an organization and I was interested in learning more, I'd probably just Google the orgs name and go from there, rather than type in a URL. But we'll see.