Jasmine Brazilek

Co-founder @ Compassion Aligned Machine Learning
200 karmaJoined Working (6-15 years)

Bio

compassionml.com

Comments
19

I also disagree with the conclusion here. Yes, it's hard to measure so we shouldn't assume we'll never be able to measure it!  Also all AI values research is dependent on the model training regimes too. For the precautionary principle we should act as though they have welfare until we can see clear evidence against that. Thoughtful post though so thanks for that. 

Progress may be possible, but CaML doesn't have the technical background to make progress on determining how consciousness works, so we leave that to others.

Our current work in this space is on measuring whether AIs take the possibility of consciousness seriously (without being overconfident in one direction or another). So we're measuring observable behaviors of giving statements and actions inconsistent with believing that AI welfare is clearly impossible or that current AIs are definitely conscious. I agree that current methods can provide at best weak and heavily debatable findings (for the reasons the linked post articulates), though I think that's importantly different from precisely zero evidence. 

In science it's usually a good instinct to dismiss something this unclear, but there are two issues with that in this case (and some others): First, the issue is enormously important if true. Second, the philosophical difficulty of artificial consciousness means that our current confusion doesn't provide Bayesian evidence either way: we'd expect ourselves to have basically these opinions in worlds where artificial consciousness is the default and also worlds where it's impossible.

I definitely agree and am grateful for your opinion. I am not interested in consciousness research, but do believe there is tractability into the idea of AIs causing digital-mind suffering without attempting to solve the consciousness debate.

Thanks Michael, we avoided mentioning post-training to imply that "new paradigm needed" would also count on the "disagree" side of the spectrum. In other words, "disagree" on this question would mean either "post-training is sufficient" or "new paradigms are needed/sufficient".

This is really cool work! Is there a graph you can show summarizing what the agents were doing turn after turn in this simulation? Is there anything that would validate this is common sense behavior and you have made a reasonable simulation here?

Thanks for the well -reasoned comment!
Alignment is clearly there  -> Given the pro-welfare plot most model scores did not increase beyond 50% and no model got to 100%.
I think I am most concerned about how this result extends to AGI. If alignment is this shallow and the default is not to think about it than I think this leads to misaligned AGI. I think if helpfulness is conflicting with compassion as I expect it is, labs need to be deemphasizing helpfulness and adding emphasis on compassion

Also the ability to follow a prompt to consider welfare is also not alignment. I wouldn't consider that reassuring.

This is a very sad post to read for me. because I think it's obvious the AI x animals field needs to expand extremely quickly. I also agree that it's tiny currently and the funding situation is also constrained for now (have heard this will change from some important people, but it's not changing fast enough to grow a movement). I feel we're in a bit of a loop currently where some funders want to support impactful projects in this space but aren't seeing enough of those and the movement builders are really struggling to get funds to get more track record. I would love to see more orgs in the technical weeds of AI alignment towards animals and I know you have the skills to start one if you're committed enough to it. 

Also the concern for alignment risk is valid but not unsolvable! If you put your mind to this problem specifically with a technical skill set you could make real progress here!

As said by others here, I agree that the current strategy by Senterra Funders is too risk averse and giving to only these major funds really limits the impact this money could have for smaller less established organizations. It would be great if the community shifted to a more diverse portfolio of funds (including pooled funds and regraters). If the bottleneck is shifting from money to speed then the community should double down on less established granters who have the capacity to move the money on a timescale that matters. I agree that individual orgs shouldn't be reaching out but I worry about the risk of all the funding ending up in a few obvious places that can't spend it fast enough.

While I like this review overall and agree the AHB needs some better calibration some issues I have:

This does not use context distillation: Asking a model to generate prompts then training on those responses without a filtering process is not context distillation, it's just amplifying any issues the model already has. 

This should be using a paired T-test not an unpaired T-test.

Training a 32B model on 1k of data for 2 epochs, I'm not sure we can expect those models to be reliably trained or act any differently

The AHB needs to be adopted by frontier labs especially and not just animal advocates. That means it cannot be telling people to go vegan or avoid leather indiscriminately. It is more about nuanced thinking and raising issues while letting people make their own choices. Better examples of failure modes of the AHB would be showing it judged some of these responses incorrectly

Do you have an example of any benchmark out there that would satisfy all your testing criteria?

Load more