David Johnston

Posts

Sorted by New

Wiki Contributions

Comments

Linch's Shortform

The world's first slightly superhuman AI might be only slightly superhuman at AI alignment. Thus if creating it was a suicidal act by the world's leading AI researchers, it might be suicidal in exactly the same way. In the other hand, if it has a good grasp of alignment then it's creators might also have a good grasp of alignment.

In the first scenario (but not the second!), creating more capable but not fully aligned descendants seems like it must be a stable behaviour of intelligent agents, as by assumption

  1. behaviour of descendants is only weakly controlled by parents
  2. the parents keep making better descendants until the descendants are strongly superhuman

I think that Buck's also right that the world's first superhuman AI might have a simpler alignment problem to solve.

Simple comparison polling to create utility functions

I would be interested in this same concept but framed so as to compare personal utility instead of impersonal utility, because I feel like I'm trying to estimate other people's values for personal utility and aggregate them in order to get an idea of impersonal utility. It seems tricky, though:

 - How many {50} year old {friends/family members/strangers} would you save vs {5} year old {friends/family members/strangers}?

This seems straightforward, except maybe it's necessary to add "considering only your own benefit" if we want personal utilities that we can aggregate instead of a mixture of personal and impersonal utilities.

 - How many 50 year old yourselves would you save vs 5 year old yourselves?

This one doesn't make much sense to me, and if I try to frame it differently, e.g. 

"imagine  a group of 50-74 year olds and a group of <5 year olds. There's a treatment that saves {X} 50 year olds and {Y} 5 year olds, and the <5 year olds dictate who gets it. What is the minimum X:Y for there to be a 50% chance of choosing the 50-74 year olds?"

My first thought is there's no way to sensibly answer this question because 3 year olds are incredibly stubborn and also won't understand.

Anyway, don't know if this is very helpful, but that was my first response to the app and the result of my first few minutes thinking about it.

What would you do if you had a lot of money/power/influence and you thought that AI timelines were very short?

One thing I'd want to do is to create an organisation that builds networks with add many AI research communities as possible, monitors AI research as comprehensively as possible and assesses the risk posed by different lines of research.

Some major challenges:

  • a lot of labs want to keep substantial parts of their work secret, even more so for e.g. military
  • encouraging sharing of more knowledge might inadvertently spread knowledge of how to do risky stuff
  • even knowing someone is doing something risky, might be hard to get them to change
  • might be hard to see in advance what lines of research are risky

I think networking + monitoring + risk assessing together can help with some of these challenges. Risk assessing + monitoring: we have a better idea of what we do and don't need to know, which helps with the first and second issues. Also, if we have good relationships with labs we are probably better placed to come up with proposals that reduce risk while not hindering lab goals too much.

Networking might also help know where relatively unmonitored research is taking place, even if we can't find out much more about it.

It would still be quite hard to have a big effect, but I think even knowing partially who is taking risks is pretty valuable in your scenario.

Donating money, buying happiness: new meta-analyses comparing the cost-effectiveness of cash transfers and psychotherapy in terms of subjective well-being

I share this concern. I don't have much of a baseline on how much meta-analysis overstated effect sizes, but I suspect it is substantial.

One comparison I do know about: as of about 2018, the average effect size of unusually careful studies funded by the EEF (https://educationendowmentfoundation.org.uk/projects-and-evaluation/projects) was 0.08, while the mean of meta-analytic effect sizes overall was allegedly 0.40(https://visible-learning.org/hattie-ranking-influences-effect-sizes-learning-achievement/), suggesting that meta analysis in that field on average yields effect sizes about five times higher than is realistic.

The point is, these concerns cannot be dealt with simply by suggesting that they won't make enough difference to change the headline result; in fact they could.

If this issue was addressed in the research discussed here, it's not obvious to me how it was done.

Give well rated the evidence of impact for GiveDirectly "Exceptionally strong", though it's not clear exactly what this means with regard to the credibility of studies that estimate the size of the effect of cash transfers on wellbeing (https://www.givewell.org/charities/top-charities#cash). Nevertheless, if a charity was being penalized in such comparisons for doing rigorous research, then I would expect to see assessments like "strong evidence, lower effect size", which is what we see here.

Donating money, buying happiness: new meta-analyses comparing the cost-effectiveness of cash transfers and psychotherapy in terms of subjective well-being

I think the follow up is much more helpful, but I found the original helpful too. I think it may be possible to say the same content less rudely, but "I think strong minds research is poor" is still a useful comment to me.

On the Universal Distribution

One thing to think about: in order to reason about "observations" using mathematical theory, we need to (and do) convert then into mathematical things. Probability theory can only address the mathematical things we get in the end.

Most schemes for doing this ignore a lot of important stuff. E.g. "measure my height in cm, write down the answer" is a procedure that produces a real number, but also one that is indifferent to almost every "observation" I might care about in the future.

(The quotes around observation are to indicate that I don't know if it's exactly the right word).

One thing we could try to to is to propose a scheme for mathematising every observation we care about. One way we could try to do this is to try to come up with a sequence of questions "are my observations like X or like not X?". Then the mathematical object our observations become will be a binary sequence. In practice, this will never solve the problem of distinguishing any two observations we care to distinguish, but maybe imagining something like this that goes on forever is not a bad idealization in the sense that we might care less and less about the remaining undistinguished observations.

Can this story capture something like the tale of the universal prior? The problem here is that what I've described looks a bit like a Turing machine -- it outputs a sequence of binary digits -- but it isn't a Turing machine because it has no well-defined domain. In fact, the problem of getting from a vague domain to something mathematical is what it was meant to solve to begin with.

One way we can conceptualize of inputs to this process is to postulate "more powerful observers". For example, if I turn an observation into n binary questions, a more powerful observer is one that asks the same n questions and also asks one more. Then our "observation process" is a Turing machine that takes the output of the more powerful observer and drops the last digit.

However, if we consider the n->infinity limit of this, it seems consistent to me that the more powerful observer could be an anti-inductor or a randomiser vs us every step of the way.

So it seems that this story at least requires an assumption like "we can eventually predict the more powerful observer perfectly".

There are lots of other ways to make more powerful observers, they just need to be capable of distinguishing everything our observation process distinguishes.

Why aren't you freaking out about OpenAI? At what point would you start?

I think this post and Yudkowski's Twitter thread that started it are probably harmful to the cause of AI safety.

OpenAI is one of the top AI labs worldwide, and the difference between their cooperation and antagonism to the AI safety community means a lot for the overall project. Elon Musk might be one of the top private funders of AI research, so his cooperation is also important.

I think that both this post and the Twitter thread reduce the likelihood of cooperation without accomplishing enough in return. I think that the potential to do harm to potential cooperation is about the same for a well-researched, well-considered comment as for an off-the-cuff comment, but the potential to do good is much higher for comments of the first type than the second. So, for comments that might cause offense, the standard for research and consideration should be higher than usual.

This post: it's extremely hard to understand what exactly OpenAI is being accused of doing wrong. Your sentence  "The small marginal impact of having anything to do with popularizing AI Safety dominates any good these movements many have produced." reads to me as an argument that Yudkowsky is wrong, and the fact that the launch lead indirectly to more AI safety discourse means that it was a positive. However, this doesn't match the valence of your post.

Your second argument, that most of their safety researchers left, is indeed some cause for concern (edit: although seemingly quite independent from your first point). However, surely it is perfectly possible to ask the departed safety researchers whether they themselves think that their departures should be taken as a signal of no confidence in OpenAI's commitment to safety before advocating actions to be taken against them. To clarify: you may or may not get a helpful response, but I think that this is an easy thing to do, is clearly a reasonable step if you are wondering what these departures mean, and I think you should take such easy & reasonable steps before advocating a position like this.

If OpenAI is pursuing extremely risky research without proper regard to safety, then the argument set out here ought to be far stronger. If not, then it is inappropriate to advocate doing harm to OpenAI researchers.

The Twitter thread: To an outsider, it seems like this concerns regarding the language employed at OpenAI's launch were resolved quickly in a manner that addressed the concerns of safety advocates. If the resolution did not address their concerns, and safety advocates think that this should be widely known, then that should be explained clearly, and this thread did no such thing.

It looked to me like Yudkowsky was arguing, as he often likes to, that contributions to AI risk are cardinally greater than contributions to anything else when assessing someone's impact on something. It is not obvious to me that he intended this particular episode to have more impact than his many other statements to this effect. Nonetheless, it seems to have done so (at least, I'm seeing it pop up in several different venues), and I at least would appreciate if he could clarify if there is an ongong issue here and what it is, or not.

How would you run the Petrov Day game?

It seems like the game would better approximate the game of mutually assured destruction if the two sides had unaligned aims somehow, and destroying the page could impede "their" ability to get in "our" way.

Maybe the site that gets more new registrations on Petrov day has the right to demand that the loser advertise something of their choice for 1 month after Petrov day. Preferably, make the competition something that will be close to 50/50 beforehand.

The two communities could try to negotiate an outcome acceptable to everyone or nuke the other to try to avoid having to trust them or do what they want.

The motivated reasoning critique of effective altruism

Here's one possible way to distinguish the two: Under the optimizer's curse + judgement stickiness scenario retrospective evaluation should usually take a step towards the truth, though it could be a very small one if judgements are very sticky! Under motivated reasoning, retrospective evaluation should take a step towards the "desired truth" (or some combination of truth an desired truth, if the organisation wants both).

The motivated reasoning critique of effective altruism

I like this post. Some ideas inspired by it:

If "bias" is pervasive among EA organisations, the most direct implication of this seems to me to be that we shouldn't take judgements published by EA organisations at face value. That is, if we want to know what is true we should apply some kind of adjustment to their published judgements.

It might also be possible to reduce bias in EA organisations, but that depends on other propositions like how effective debiasing strategies actually are.

A question that arises is "what sort of adjustment should be applied?". The strategy I can imagine, which seems hard to execute, is: try to anticipate the motivations of EA organisations, particularly those that aren't "inform everyone accurately about X", and discount those aspects of their judgements that support these aims.

I imagine that doing this overtly would cause a lot of offence  A) because it involves deliberately standing in the way of some of the things that people at EA organisations want and B) because I have seen many people react quite negatively to accusations "you're just saying W because you want V".

Considering this issue - how much should we trust EA organisations - and this strategy of trying to make "goals-informed" assessments of their statiments, it occurs to me that a question you could ask is "how well has this organisation oriented themselves towards truthfulness?". 

I like that this post has set out the sketch of a theory of organisation truthfulness. In particular 
"In worlds where motivated reasoning is commonplace, we’d expect to see:

  1. Red-teaming will discover errors that systematically slant towards an organization’s desired conclusion.
  2. Deeper, more careful reanalysis of cost-effectiveness or impact analyses usually points towards lower rather than higher impact."

Presumably, in worlds where motivated reasoning is rare, red-teaming will discover errors that slant towards and away from an organisation's desired conclusion and deeper, more careful reanalysis of cost-effectiveness points towards lower and higher impact equally often.

I note that you are talking about a collection of organisations while I'm talking about a specific organisation. I think you are thinking about it  from "how can we evaluate truth-alignment" and I'm thinking about "what do we want to know about truth-alignment". Maybe it is only possible to evaluate collections of organisations for truth-alignment. At the same time I think it would clearly be useful to know about the truth-alignment of individual organisations, if we could.

It would be interesting, and I think difficult, to expand this theory in three ways:

  1. To be more specific about what "an organisation's desired conclusion" is, so we can unambiguously say whether something "slants towards" it
  2. Consider whether there are other indications of truth-misalignment
  3. Consider whether it is possible to offer a quantitative account of (A) the relationship between the degree of truth-misalignment of an organisation and the extent to which we see certain indications like consistent updating in the face of re-analysis and (B) the relationship between an organisation's truth-misalignment and the manner and magnitude by which we should discount their judgements

To be clear, I'm not saying these things are priorities, just ideas I had and haven't carefully evaluated.

Load More