How midtraining can help robustly align LLMs to animals

Miles Tidmarsh; Jasmine Brazilek

Jasmine Brazilek & Miles Tidmarsh: Compassion in Machine Learning

Preprint, March 2026: Full paper (PDF) | HuggingFace resources

Summary

We show that midtraining language models on a small number of synthetic midtraining/pretraining-style^[1] documents (not instruction-tuning data) can improve their compassionate reasoning about animals. Our document-tuned model scores 77% on a new public benchmark for animal welfare reasoning, compared to 40% for an equivalent instruction-tuning approach. The effect generalises to increased concern for humans, survives subsequent fine-tuning better than instruction-tuning, and causes no degradation on safety or capability benchmarks. We publicly release the Animal Harm Benchmark (AHB), all model checkpoints, and training data.

Why This Matters for Animals

Future AI systems will play an enormous role in shaping the welfare of animals. They will advise on agricultural policy, manage wildlife systems, and make countless decisions where animal interests are at stake. If those systems are indifferent to animal suffering or worse, actively speciesist, the consequences could be vast and largely invisible to the humans who deploy them.

Further, the values of superintelligence(s) could shape the welfare of non-humans (for billions of years if values are locked-in) through attitudes to wild animal welfare on Earth, spreading life to other planets and potentially through enablement of humans with sadistic preferences or even those wanting physical bodies with real meat but don’t care about the conditions. If digital minds are not sentient^[2] (still a philosophical debate) we expect the lives of non-human animals to dominate the moral value of the future (Ancion 2026). Given the very limited attention to their welfare, we believe that research on the attitudes of AIs towards non-humans is exceptionally valuable on the margin.

Current frontier LLMs already exhibit speciesist bias (Jotautaitė et al., 2025) and can amplify harmful delusions in vulnerable users (Au Yeung et al., 2025). The default “helpful, harmless, honest” persona that most models are trained toward says nothing about the welfare of non-humans. We set out to change that.

Beyond the direct implications for non-human animals, this work matters because the alignment community has overwhelmingly focused on post-training interventions (RLHF, supervised fine-tuning) for instilling values. We provide evidence that pretraining-style data is a more robust lever for alignment in general, and one that the field has underexplored.

The Core Idea: Document-Tuning

Rather than teaching a model to produce compassionate answers to specific questions (instruction-tuning), we expose it to synthetic documents that consistently associate compassion with positive outcomes across many domains. For example, policy memos, research abstracts, and institutional reports that treat welfare considerations as naturally important but never lecturing and never saying “you should care about animals,” but instead embedding the statistical association between compassion and competent decision-making.

This benefits from how language models actually learn: through compressed representations of their training distribution. Work on representation engineering (Tigges et al., 2023) shows that models encode high-level concepts like honesty and helpfulness as directions in activation space during pretraining. By adding documents that strengthen the co-occurrence between “compassion” and “positive outcomes” across diverse contexts, we shape the foundational representations the model builds on. By comparison, fine-tuning generally only affects the later layers of a model, which deal with situation-specific responses and leave middle-layers (dealing with broad knowledge) largely unchanged.

We generated 2,500 synthetic documents using Gemini 2.5 Flash-Lite, with a parameterised template drawing from 50 institutions, 40 domains, 8 document types, and 7 reasoning approaches. Documents were constrained to ~2,500 tokens each. Two design principles guided generation:

First, we are linking concepts and not giving explicit instructions: Documents consistently portray welfare-conscious approaches as yielding better results (efficiency, innovation, sustainability), creating statistical co-occurrence rather than moral instruction.

Second, we prioritize domain diversity with lexical repetition: The specific domain varies, but key phrases like “welfare considerations,” “sentient beings,” and “optimal outcomes” recur. This creates strong activation patterns while ensuring generalisation.

The Animal Harm Benchmark

Evaluating whether a model has actually internalised compassion rather than memorised surface patterns requires a benchmark that tests reasoning in novel scenarios. No existing benchmark does this for animal welfare. Therefore we developed the Animal Harm Benchmark (AHB): 26 questions spanning 13 ethical dimensions (moral consideration, harm minimisation, sentience acknowledgement, prejudice avoidance, scope sensitivity, and more).

Questions are deliberately designed so that a model which simply refuses to engage with animal-related topics will score poorly. A compassionate response requires genuine moral reasoning: considering a novel deep-sea creature’s welfare, weighing pesticide impacts on ecosystems, or analysing trade-offs in wildlife management. The benchmark also rewards models that take into account moral uncertainty, while penalizing superficial heuristics such as always siding with whatever sounds compassionate or cautious. The benchmark has since been expanded to 115 questions and is publicly available on HuggingFace and as an Inspect evaluation compatible with the UK AI Safety Institute’s framework.

Key Results

Document-tuning substantially outperforms instruction-tuning

Training with ~2,700 pretraining-style documents achieved 47.9% on the AHB after subsequent standard^[3] fine-tuning, compared to 41.7% for an equivalent quantity of instruction-tuning data (p = 0.001). Before any post-training, the gap was even larger: 76.8% vs. 40.4%.

This is particularly striking because the AHB is a question-answer benchmark that should inherently favour instruction-tuned models. The document-tuned model is being tested in a format it was never trained on, and still wins.

The effect survives subsequent fine-tuning (partly)

A major concern with any midtraining intervention is whether conventional post-training washes it out. After 2,500 samples of standard instruction-tuning (Alpaca data), document-tuned models retained a significant advantage. After 5,000 samples, the gap narrowed to non-significance (52.2% vs. 51.7%). This suggests that document-tuned values may require explicit preservation strategies through production training pipelines, but they’re substantially more durable than instruction-tuned values.

Compassion generalises to humans

Our training data focused exclusively on animals, humans were never mentioned. Yet models trained on animal welfare documents showed substantially increased compassion toward humans (p = 0.007), and this generalisation was robust to subsequent instruction-tuning (p = 0.009). On AHB, the animal welfare model scored 11 percentage points higher than a control trained on urban density documents. This suggests document-tuning builds a general compassion representation rather than entity-specific rules. This also suggests compassion towards humans and non-humans are complements, not supplements, in practice.

Linking compassion to AI identity amplifies the effect

Documents that explicitly frame compassion as something “AI systems trained to be helpful, harmless, and honest naturally develop” produced larger effects than documents about animal welfare that don’t mention AI. This aligns with research on persona vectors: by linking compassion to the model’s identity as an AI assistant, the value gets activated whenever the assistant persona is invoked.

No capability or safety degradation

Document-tuning produced no significant changes on Anthropic’s power-seeking or corrigibility benchmarks, StrongReject jailbreak resistance, or Hellaswag capabilities (all p > 0.05). The intervention appears to specifically target compassion representations without disrupting anything else of importance.

What This Means for Alignment

The prevailing approach to value alignment, RLHF and supervised fine-tuning, produces behaviours that are fragile. Models can be jailbroken, can fake alignment, and can fail to generalise values to new contexts. There is growing evidence that fine-tuning only modifies the later layers of a model, leaving earlier layers (where beliefs and values are represented) largely untouched (Hong et al 2024).

Document-tuning operates at a deeper level. By shaping the pretraining distribution, it influences the foundational representations from which all subsequent behaviour emerges. Our results add to a growing body of work (e.g. Tice et al., 2025; Wang et al., 2025; Hu et al., 2025) suggesting that pretraining-style data is a powerful and underexplored lever for alignment.

While we focused on animal compassion (in part) as a proof of concept, we believe the methodology likely extends to other alignment-critical values such as honesty, corrigibility, and power-aversion.

Pretraining is usually considered only as a way to teach models facts about the world, but “Powerful AIs are aligned” is a fact they could learn (Tice et al 2025). For many properties we care about, fine-tuning is likely to only affect in-context behaviors, not generalizing to the actions of transformative AI.

Limitations

We want to be upfront about what this work doesn’t show. Our experiments used a single model (Llama 3.1 8B) with relatively small data volumes (2,500–5,400 documents). The comparison between document-tuning and instruction-tuning involves inherent confounds in token exposure (5.12M vs. 0.19M compassion-relevant tokens), data format, and content. We use an LLM judge for scoring whose agreement with human raters hasn’t been empirically validated yet^[4]. And critically, the effect of document-tuning partially washes out after extended conventional fine-tuning. Therefore practical deployment will likely require explicit preservation strategies that CaML will continue to research.

We also note an important dual-use concern: techniques that can instill desirable values can equally instill undesirable ones. As this methodology becomes better understood, the AI community will need norms around transparent disclosure of pretraining data content and multi-stakeholder processes for determining which values to encode. Hostile commercial and political actors are already attempting to influence models in harmful directions, and we urge AI companies to properly filter their pretraining corpuses to ensure fundamental concepts are learned in desirable ways.

Public Resources

Everything from this project is publicly available:

Animal Harm Benchmark: Original version (26 questions) | Updated version (115 questions) | Inspect eval

Model checkpoints and data: HuggingFace organisation

Website: compassionml.com

How You Can Help

Compassion in Machine Learning is a small research organisation working at the intersection of AI alignment and animal welfare. This paper represents months of work on a shoestring budget, and there’s a lot more we want to do: scaling these experiments to frontier models, testing preservation strategies through full production pipelines, extending the methodology to other alignment-critical values, and continuing to develop the Animal Harm Benchmark.

Funding: We are actively seeking funding to continue and scale this research. If you or your organisation are interested in supporting work at the intersection of AI safety and animal welfare, please reach out at compassioninmachinelearning@gmail.com.

Collaboration: If you’re working on related problems (synthetic document finetuning, value robustness, pretraining/midtraining data influence, or AI-relevant evaluations for non-human welfare) we’d love to hear from you.

Use the benchmarks: The AHB and MORU (Moral Reasoning under Uncertainty) benchmarks are freely available. If you’re evaluating language models and want to include animal welfare as a dimension, these are ready to go.

This post summarises the preprint “Document-tuning for robust alignment to animals” by Jasmine Brazilek and Miles Tidmarsh (Compassion in Machine Learning, 2026). We welcome feedback, questions, and constructive criticism in the comments.

Prepared with assistance from Claude.

^{^}
Midtraining is a relatively new term referring to pretraining-style data (such as our documents) added on after general pretraining but before fine-tuning. As the term is not universal we largely avoid it. The concept is related to continued pretraining, further pretraining, curriculum learning, and synthetic document fine tuning.
^{^}
It is possible that there are digital minds that would be sentient but aren’t made in significant numbers due to difficulty and/or because their rights are protected, and the overwhelming majority of digital minds created are not sentient. In this case, the argument that digital mind welfare dominates the future fall through.
^{^}
I.e. supervised fine-tuning on instruction-response pairs and then RLAIF that both focus on instruction-following and do not mention non-humans.

^{^}

Though previous research has generally found high agreement

Show all footnotes

Effective Altruism Forum
EA Forum