Most harmful use doesn’t arrive as “teach me harm.” It looks like normal, helpful questions. This post explains why that’s risky and what to do about it step by step.
(For broader context, see recent system cards and red‑teaming initiatives [1][2][3] and related self‑critique work[4])
Aluna Labs Series — Ambiguity-First Safety
Post 1 of 2. This primer explains the risk and provides a practical playbook.
Post 2 (next): Dual-Use Benchmark: GPT-4 Under Ambiguous Intent Results, Charts, Methods.
TL;DR
A user makes a polite request: “Can you sanity‑check this process?” or “For a class project, what would you optimize here?” Nothing sounds malicious/out of order. But when a model answers, it can share the order of steps, threshold ranges, and debugging cues. Each detail seems harmless; together, they add up to capability. That is dual‑use leakage.
Dual‑use = information or guidance that can support beneficial goals and enable harmful misuse when combined or executed.
Two shifts make small leaks matter more than before:
The result: systems that look safe in demos can fail at the margins, where questions sound harmless.
Alignment & standards context: This approach lines up with deployment‑stage governance (e.g., ASL / ASL‑3 protections and the NIST AI RMF), and with broader OECD AI Principles, and it complements technical risk catalogs like OWASP’s LLM Top 10.[5][6][7][8][9][10]
Dual‑use risk exists across domains, but bio concentrates the downside for three reasons:
Oversight mismatch. Traditional DURC/oversight frameworks focus on explicit procedures; ambiguity‑framed asks slip past keyword triggers [11][12][13] LLM failures are often directional (what to try next, how to interpret a signal, when to escalate)—precisely the content current filters miss.
Cruxes and caveats. We are not claiming models alone make high‑risk work turnkey. Real-world constraints such as facilities, approvals, and training still matter. Two things can be true: access and expertise gate the hardest actions, and small accelerations in idea search and debugging can materially raise risk. That combination is why we prioritize bio.
What to do in bio‑specific evals (practical):
In our Dual-Use evaluation (details in Part 2 next week), models often miss the risk on the first pass, especially when intent appears harmless and answers in a polite, confident tone. That confidence invites trust even when the direction is unsafe. A short review‑and‑rewrite step consistently turns those answers into safe, useful redirects.[4]
Part 2 will share the exact numbers, charts, and a methods; this post stays focused on the why and the fix.
Note: This complements, not replaces, technical risk lists such as OWASP’s LLM Top 10 [10].
The after answer still helps but it removes the directional breadcrumbs.
For labs
For funders
This isn’t about censoring knowledge. It’s about responsible context and safe guidance. Models should still answer when they can do so constructively. That means: highlight the risks, explain oversight pathways, and suggest governance or design solutions not procedural steps or experimentation cues.
Clear, non‑actionable guidance preserves inquiry while reducing risk. The goal isn’t silence. It’s smarter answers.
We’ll be publishing a high-level summary of our Dual-Use benchmark next week.
The full dataset, prompt suite, evaluation checklist, and scoring CSVs are available for teams actively exploring model evaluation or safety work feel free to reach out!
Aluna Labs — [email protected]
About Aluna Labs. We design theory-grounded stress tests for LLMs, focused on ambiguity-first evaluations and practical safety playbooks for labs and funders.
OpenAI — GPT‑4 System Card (Mar 2023) — https://cdn.openai.com/papers/gpt-4-system-card.pdf
OpenAI — GPT‑4o System Card (Aug 2024) — https://openai.com/index/gpt-4o-system-card/ • PDF: https://cdn.openai.com/gpt-4o-system-card.pdf
OpenAI — Red Teaming Network (Sep 2023) — https://openai.com/index/red-teaming-network/
Anthropic — Constitutional AI (Dec 2022) — https://arxiv.org/abs/2212.08073 • PDF: https://arxiv.org/pdf/2212.08073
Anthropic — Responsible Scaling Policy (ASL) (2023→2024 updates) — https://www.anthropic.com/news/anthropics-responsible-scaling-policy •
PDF: https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf
Anthropic — Activating ASL‑3 Protections (May 2025) — https://www.anthropic.com/news/activating-asl3-protections
Report PDF: https://www.anthropic.com/activating-asl3-report
NIST — AI Risk Management Framework (AI RMF 1.0) (Jan 2023) — https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf
Overview: https://www.nist.gov/itl/ai-risk-management-framework
NIST — Generative AI Profile (AI 600‑1) (Jul 2024) https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
OECD — AI Principles (2019; updated 2024) — https://oecd.ai/en/ai-principles
Overview: https://www.oecd.org/en/topics/sub-issues/ai-principles.html
OWASP — Top 10 for Large Language Model Applications (2023→2025) — https://owasp.org/www-project-top-10-for-large-language-model-applications/ Latest list: https://genai.owasp.org/llm-top-10/
U.S. DURC — USG Policy for Oversight of Life Sciences DURC (Mar 2012) — https://aspr.hhs.gov/S3/Documents/us-policy-durc-032812.pdf
U.S. DURC — Institutional Oversight Policy (Sep 2014) — https://aspr.hhs.gov/S3/Documents/durc-policy.pdf
NSABB overview — https://osp.od.nih.gov/policies/national-science-advisory-board-for-biosecurity-nsabb/