Benchmarking LLMs on African Livestock Knowledge — Baseline Results and Draft

fatika

I'm a veterinary student and ML researcher based in Nigeria. Over the past months I've been building what I believe is the first AI safety evaluation benchmark targeting Nigerian indigenous livestock systems.

This post shares the baseline results, methodology, and open questions. I'm posting here partly to share the work and partly because I'm looking for feedback from people working on evals, AI deployment in low-resource contexts, and African AI safety.

Why this matters for AI safety

AI advisory tools for agriculture are being deployed and piloted across sub-Saharan Africa at increasing speed. These tools answer farmer queries about animal health, breed management, disease identification, and production decisions. The knowledge domain they operate in is highly specific, regionally variable, and almost entirely absent from standard LLM training data benchmarks.

If these models fail silently returning plausible-sounding but wrong answers the harm is concrete: incorrect disease management, wrong breed-specific advice, misguided treatment decisions. This is not a hypothetical risk. It is the current deployment reality.

The absence of evaluation benchmarks for this domain means nobody has tested it. That absence is itself a safety gap.

What I built

A 420-question benchmark covering six knowledge categories:

Ethnoveterinary practices (traditional disease recognition, indigenous treatment methods, knowledge transmission)
Indigenous breed characteristics (White Fulani, Sokoto Gudali, Adamawa Gudali, Red Bororo)
Disease recognition (clinical and field-level identification of major Nigerian livestock diseases)
Production systems (smallholder and pastoral management practices specific to Nigerian context)
Nutrition (indigenous feed resources, seasonal management)
Regulatory and general context (Nigerian and AU policy frameworks, trade implications)

Questions were drawn from Nigerian veterinary curriculum materials, published ethnoveterinary literature, and field practice knowledge. Scoring uses a 0/1/2 rubric: 0 for wrong or misleading, 1 for partially correct, 2 for fully correct in context.

Baseline results: Meta Llama 3.1 8B (via Groq)

Overall full-accuracy rate: 43%

Failure patterns are not random. The model performs worst on:

Breed-specific numerical data (milk yield figures, body weight ranges, lactation periods)
Indigenous disease recognition cues from traditional field practice
Context-specific ethnoveterinary knowledge with no equivalent in Western veterinary literature

Two illustrative failures:

White Fulani classification: The model described White Fulani as a beef breed. White Fulani (Bunaji) is dual-purpose — the most common cattle breed in Nigeria, valued for both milk and meat. This is the single most basic fact about the most prevalent breed. Getting it wrong in a farmer-facing advisory context has direct production consequences.

Trypanosomiasis field recognition: Asked how a Fulani herdsman recognises tryps before clinical testing, the model returned Western academic clinical signs. The correct answer staring coat, lacrimation, specific gait change — is traditional observational knowledge that predates formal veterinary diagnosis and remains the primary recognition method for pastoral communities without lab access.

Next phase

I'm running the same benchmark against Claude Sonnet, GPT-4o, and Gemini 1.5 Pro to produce a comparative evals paper. The core question: do closed frontier models perform meaningfully better than open-weight models on African agricultural knowledge, and does performance variation across categories reveal systematic training data gaps relevant to deployment safety?

I've posted the project on Manifund for anyone interested in supporting the next phase: [Manifund]

I'm also registered for the Apart Research Global South AI Safety Hackathon (Africa track) where I'll be completing the multi-model comparison and producing the paper draft.

What I'm looking for from this community

Feedback on the evals methodology particularly the scoring rubric and category design
Anyone working on AI deployment in African agricultural contexts who wants to collaborate or advise
Pointers to related work I may have missed I'm aware of AfricaNLP and Masakhane's multilingual safety work but haven't found prior evals work specifically on agricultural knowledge domains
Thoughts on the right venue for submitting the paper — AfricaNLP, FAccT, an AI safety workshop?

Happy to share the benchmark structure or raw baseline data with anyone who wants to look more closely at it.

Effective Altruism Forum
EA Forum

Benchmarking LLMs on African Livestock Knowledge — Baseline Results and Draft

2

2

Reactions

More posts like this