537Joined Feb 2017


Program Coordinator of AI Safety Camp.


Topic Contributions

Question just to double-check: are posts no longer going to be evaluated for the AI Worldview Prize? Given that is, that the FTX Future team has resigned.

Question just to double-check: are posts no longer going to be evaluated for the AI Worldview Prize? Given that is, that the FTX Future team has resigned (

I'm just saying that if any AI with external access would be considered dangerous


I'm saying that general-purpose ML architectures would develop especially dangerous capabilities by being trained in high-fidelity and high-bandwidth input-output interactions with the real outside world. 

A specific cruxy statement that I disagree on:

An AI that is connected to the internet and has access to many gadgets and points of contact can better manipulate the world and thus do dangerous things more easily. However, if an AI would be considered dangerous if it had access to some or all of these things, it should also be considered dangerous without it, because giving such a system access to the outside world, either accidentally or on purpose, could cause a catastrophe without further changing the system itself. Dynamite is considered dangerous even if there is no burning match held next to it. Restricting access to the outside world should instead be regarded as a potential measure to contain or control a potentially dangerous AI and should be seen as inherently insecure.

My disagreement here is threefold:

  1. The above statement appears to assume that dangerous transformative AI has already been created, whereas ‘red lines’ set through shared consensus and global regulation should be set to prevent the creation of such AI in the first place (with a wide margin of safety to account for unknown unknowns and that some actors will unilaterally attempt to cross the red lines anyway).
  2. My rough sense is that the most dangerous kind of ‘general’ capabilities that could be developed in self-learning machine architectures are those that can be directed to enact internally modelled changes over physical distances within many different contexts of the outside world. These are different kind of capability than eg. containing general knowledge about facts of the world, or of say making calibrated predictions of the final conditions of linear or quasi-linear systems in the outside world.

    Such 'real world' capabilities seem to need many degrees of freedom in external inputs and outputs to be iteratively trained into a model. 

    This is where the analogy of AI's potential with dynamite's potential for danger does not hold:
    - Dynamite has explosive potential from the get go (fortunately limited to a physical radius) but stays (mostly) chemically inert after production. It does not need further contact points of interaction with physical surroundings to acquire this potential for human-harmful impact.
    - A self-learning machine architecture gains increasing potential for wide-scale human lethality (through general modelling/regulatory functions that could be leveraged or repurposed to modify conditions of the outside environment in self-reinforcing loops that humans can no longer contain) via long causal trajectories of  the architecture's internals having interacted at many contact points with the outside world in the past. The initially produced 'design blueprint' does not immediately acquire this potential through production of needed hardware and initialisation of model weights. 

    If engineers end up connecting up more internet channels, sensors and actuators for large ML model training and deployment while continuing to tinkering with the model’s underlying code base, then from a control engineering perspective, they are setting up a fragile system that is prone to inducing cascading failures in the future. Engineers should IMO not be connecting up what amounts to self-learning spaghetti code for open-endedly learning and autonomously enacting changes in the real world. This, in my view, would be an engineering malpractice where practitioners are grossly negligent in preventing risks to humans living everywhere around the planet.
  3. You can have hidden functional misalignments selected for through local interactions of code internal to the architecture with their embedded surroundings. Here are arguments I wrote on that:

    A model can be trained to cause effects we deem functional but under different interactions with structural aspects of the training environment than we expected. Such a model’s intended effects are not robust to shifts of the distribution of input data received when the model is deployed in new environments. Example: in deployment this game agent ‘captures’ a wall rather than the coin it got trained to (incidentally next to the right-most wall).

    Compared to side-scroller games, real-life interactions are much more dimensionally complex. If we train a Deep RL model on a high-bandwidth stream of high-fidelity multimodal inputs from the physical environment in interaction with other agentic beings, we have no way of knowing whether any hidden causal structure got selected for and stays latent even during deployment test runs… until a rare set of interactions triggers it to cause outside effects that are out of line.

    Core to the problem of goal misgeneralization in machine learning is that latent functions of internal code are being expressed under unknown interactions with the environment. A model that coherently overgeneralizes functional metrics over human contexts is concerning but trackable. Internal variance being selected to act out of line all over the place is not trackable.

    Note that an ML model trains on signals that are coupled to existing local causal structures (as simulated on eg. localized servers or as sensed within local physical surroundings). Thus, the space of possible goal structures that can be selected for within an ML model is constrained by features that can be derived from data inputs received from local environments. Goals are locally selected for and thus partly non-orthogonal (cannot vary independently) with intelligence.

Maybe interesting. A friend writing a draft asked me for some posts for background.

Here are posts that came to mind from the top of my head (do suggest posts I missed):





Organisation: -

Glad this opened up a richer discussion :)

I probably trust the monetary establishment a bit more and see newer proposals as more predictable / within the historical distribution of what we've seen before.

Got it.

Happy to have a chat in February btw. Will try to read the paper you linked to before our call in that case:

Monetary policy seems like some strange dark art where we think we have more control over the world than we actually do, and we’re probably not seeing some of the most important unintended consequences of our actions. You have to do the best you can based on what you understand, so I mostly agree with the monetary policy regime, but wouldn’t be surprised if we look at things very differently in 50 or 100 years. (The history of the subject is really rather recent — the Federal Reserve was born in 1913, and Nixon only fully took us off the gold standard in 1971.)

This resonates for me (it's interesting though how you and I still draw different conclusions on whether to build up the supply of monetary stimulus or not; seems like some underlying differences in how we each relate with the use of leaky abstractions like MMT in practice?). 

Thanks, I appreciate learning from you here. 

A separate topic you might be interested in would be Modern Monetary Theory.

I'll make time to read  this paper and arguments for MMT policy when/if OpenPhil writes up their updated review later this year. For now, got to focus on digging into other research and research programs.

Hi Peter, thank you too for your brief and clear response on the stated concerns  and others' thoughtful comments.

Looking forward to reading any follow-up review you get to write on this subject later this year. 

I guess Joseph Tainter’s historical work is also somewhat relevant – on past developing societies like ancient Rome becoming increasingly spread out, specialised and organisationally complex to the point where political leaders couldn’t centrally fund regulatory governance and maintenance of their society except by debasing the common currency of trade (noting that an analogy to the current US situation I think only holds if the Fed ‘printing money’ induces inflation of the US currency down the line). From this perspective, monetary stimulus could correspond with an implicit attempt at maintaining US organisational complexity at a level that is no longer tenable.

This is a sharp question (hence the strong upvote). I appreciate this.

I’m an amateur here, so do take any more specific thoughts I have on this (that go beyond ‘this seems really uncertain/fuzzy and potentially systemically damaging’ and ‘let’s not get stuck in conflicts with people who wield differently insightful ideological views’) with a grain of salt.

Some concerns that come to mind:

  • fostering the expectation amongst institutions and companies that the US Fed will centrally take care of any economic threat or crisis whenever one appears disincentives locally responsible leaders from stockpiling needed money and materials to ensure their organisation weathers the storm and from designing regulations and programs to avert, prepare for and deal with outside shocks in more targeted and concretely verifiable ways.
  • long-run USD inflation (in twenty years or more)
  • loss of trust in the United States Dollar as a ‘world reserve currency’ with foreign governments shifting their funds away to some other currency or portfolio of currencies (this would be bad for US trade interests, but could be good from a cosmopolitan perspective if what replaces it is some common currency storage mechanism that doesn’t have first movers gain outsized wealth/decision power in bits and zeros and therefore indirectly disincentivise work by entrepreneurs and workers elsewhere).

I don’t feel satisfied by the above list. I think a reasonable counterargument is that those vaguely possible consequences don’t justify not extending monetary stimulus right now that could avert obvious harms experienced by citizens. I would be wary though of getting anchored on requiring specific claims here.

Load More