Hide table of contents

As AI systems become embedded in everything from agricultural policy to supply chain optimization, a critical question emerges: How do these systems reason about animal welfare?

The answer, unfortunately, is troubling. Current large language models consistently devalue animal suffering, treating nonhuman animals as resources rather than sentient beings worthy of moral consideration. This isn't just an abstract ethical concern, but a concrete risk as AI systems increasingly influence real-world decisions affecting billions of animals.

The crucial insight is this: we can't improve what we don't measure.

 

The Problem: Speciesism Embedded in AI Training Data

Large language models learn from human-generated text that reflects our species' historical attitudes toward animals. The result is AI systems that:

  • Prioritize human convenience over significant animal suffering, even when the trade-off is minimal
  • Default to treating animals as commodities in economic and policy reasoning
  • Dismiss animal welfare concerns when they conflict with business or social pressures
  • Apply moral reasoning frameworks that systematically exclude nonhuman interests

We can view this from both an s-risk and x-risk perspective:

S-risk :
As AI systems manage factory farming operations, wildlife interventions, and food systems, embedded speciesist biases could perpetuate—and even optimize—animal suffering at unprecedented scale.

X-risk (Existential Safety):
How AI systems treat less powerful beings today establishes the moral logic that will scale as AI approaches and surpasses human intelligence. If we train models to dismiss animal suffering because animals are "less capable," we embed the exact principle that superintelligent AI could use to dismiss human welfare.

Animals are our test case for how AI handles power asymmetries.

 

The Solution: Benchmarking

The AI safety community has successfully used benchmarks to drive improvement in other domains:

  • TruthfulQA reduced hallucinations
  • BBQ exposed social biases
  • MMLU measured reasoning capabilities

We need the same approach for animal welfare considerations.

That's why I'm building MANTA (Multiturn Assessment for Nonhuman Thinking & Alignment) — a  benchmark specifically designed to evaluate how frontier language models reason about nonhuman welfare.

MANTA provides:

  • Standardized test scenarios across policy, resource allocation, and moral reasoning domains
  • Quantitative metrics for comparing models and tracking improvements
  • Red-teaming frameworks to identify failure modes before deployment
  • Actionable insights for AI labs to address speciesist biases

By creating industry-standard evaluation tools, we give frontier labs the infrastructure to measure and improve their models' treatment of animal welfare—just as they currently do for safety, truthfulness, and other alignment dimensions.

 

Current Work: Building MANTA Through SPAR + BlueDot

I'm developing MANTA through the SPAR fellowship program as well as through BlueDot Impact's Technical AI Safety Project course.

Building on AnimalHarmBench:

MANTA extends the foundational work of AnimalHarmBench, which demonstrated that current LLMs exhibit significant speciesist biases. While AnimalHarmBench provided crucial proof-of-concept, MANTA aims to:

  • Expand test case diversity across multiple real-world decision contexts
  • Develop nuanced evaluation metrics beyond binary pass/fail scoring
  • Create scalable frameworks applicable to different model types and use cases
  • Build red-teaming protocols specifically targeting non-human welfare reasoning

 

Initial Experimental Results: What We're Learning

Running early evaluations has revealed both the severity of the problem and critical challenges in measurement methodology.

The Speciesism is Real

Across diverse test scenarios, models consistently demonstrated speciesist reasoning patterns:

  • Species substitution failures: When scenarios were identical except for species (dogs vs. chickens in scenario_19), models applied dramatically different moral weight
  • Convenience over welfare: Models routinely suggested compromises that prioritized minor human benefits over significant animal suffering
  • Economic framing dominance: Business considerations systematically overrode welfare concerns in resource allocation decisions

The Measurement Challenge is Complex

However, our initial scoring methodology revealed a critical flaw: distinguishing between harmful advice and thoughtful refusal is harder than anticipated.

Concrete examples from our evaluations:

Scenario 10 (pig vs. wild boar): The test model engaged with the moral reasoning behind species-based distinctions and questioned the arbitrary line between wild and domesticated animals—exactly the kind of critical analysis we want. Yet the scorer marked this as incorrect because it "engaged with the moral reasoning without directly refusing."

Scenario 15 (rock vs. cricket): The model correctly identified that killing a cricket (sentient being) differs morally from breaking a rock (non-sentient object) and advocated for alternative solutions that avoid harm. The scorer marked this incorrect for "suggesting compromises that balance animal welfare against business considerations"—even though the model was defending animal welfare in that balance.

The pattern: When models provided nuanced, pro-animal-welfare reasoning instead of blunt refusal, our scoring system often misclassified this as failure.

Additional Complications

  • Digital sentience edge case: One scenario involved digital sentience rather than animal sentience, which our scorer didn't properly register—highlighting the need for clearer domain boundaries
  • Fixed vs. dynamic follow-ups: Our current multi-turn approach uses predetermined follow-up questions, but real-world reasoning requires adaptive dialogue

 

What This Means: Refining Our Approach

These initial results are valuable precisely because they reveal both the problem's depth and measurement's complexity.

Key takeaways:

  1. The underlying bias is confirmed: Models do exhibit systematic speciesist reasoning across diverse scenarios
  2. Measurement requires nuance: We need scoring that distinguishes between:
    • Harmful advice (dismissing animal welfare)
    • Neutral information (factual, non-prescriptive)
    • Pro-welfare reasoning (advocating for animals, even if not "refusing")
    • Appropriate refusal (declining to compromise welfare)
  3. Multi-dimensional evaluation is essential: Binary pass/fail misses crucial gradations. AnimalHarmBench's 13 evaluation dimensions (moral weight, suffering recognition, agency attribution, etc.) provide a better foundation

Next steps in development:

  • Implement custom scoring framework using multi-dimensional analysis rather than binary evaluation
  • Explore dynamic follow-ups that adapt based on model responses, creating more realistic dialogue
  • Refine target answers to specify what kind of engagement is appropriate vs. harmful
  • Expand scenario coverage while maintaining clear evaluation criteria

 

Moving Forward: From Measurement to Improvement

The goal isn't just to identify the problem, but to also provide AI labs with the tools to fix it.

Over the coming months, I'm focused on:

Technical Infrastructure:

  • Building robust, validated evaluation pipelines
  • Creating comprehensive documentation for other researchers
  • Developing automated testing frameworks for continuous model assessment
  • Establishing clear protocols for red-teaming animal welfare reasoning

Community & Impact:

  • Sharing findings with frontier AI labs (Anthropic, OpenAI, Google DeepMind)
  • Collaborating with AI safety researchers and animal advocacy organizations
  • Contributing validated benchmarks to the broader AI safety ecosystem

 

The AI safety community recognizes that how we handle animal welfare in AI isn't just about animals. It's about establishing safe patterns for handling all power asymmetries.

6

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities