A week for posting incomplete, scrappy, or otherwise draft-y posts. Read more

Hide table of contents

Long time listener, first time caller. This is an exploratory introduction of a research project I've started working on independently.  I warmly welcome feedback and suggestions. 

This is a Draft Amnesty Week draft. It may not be polished, up to my usual standards, fully thought through, or fully fact-checked. 

Commenting and feedback guidelines: 
Keep one and delete the rest (or write your own):

  1. I'm posting this to get it out there. I'd love to see comments that take the ideas forward, but criticism of my argument won't be as useful at this time.
  2. This draft lacks the polish of a full post, but the content is almost there. The kind of constructive feedback you would normally put on a Forum post is very welcome.
  3. This is a Forum post that I wouldn't have posted without the nudge of Draft Amnesty Week. Fire away! (But be nice, as usual)

Introduction

In seeking a project to work on as a method of learning more deeply about alignment, I considered trying to contribute to x-risk research somehow. But I figure that as I’m learning, the marginal impact of devoting my attention to a more neglected, current problem with a lower bar for impact might be greater than trying to contribute essay #3,057 to an area that’s already well-trod by more experienced researchers.

I have two goals:

  1. Explore alignment work and evals by actually working on a real project. I’m hoping I’ll learn, and potentially make connections that might move me closer to working on x-risk / s-risk problems.
  2. Simultaneously, produce work that is new and hopefully useful. Ideally, it invites other (better resourced and/or more specialized) researchers to expand and improve on my output, eventually arming labs and eval-focused orgs with a useful framework.

With that in mind, I turned to a problem that I’ve been thinking about a lot with respect to deployment safety. In particular, I’m interested in how models may contribute to helping or harming people experiencing intimate partner violence (IPV). IPV is already a deadly and widespread problem. Every year, some 1,600 - 2500 women are killed by their partners in the US alone.

This problem is personally salient to me. I think that collectively, despite some level of awareness and activism, we still have a poor sense of its prevalence, its risk profile, and the cross-cultural and cross-demographic nature of its distribution. I also strongly suspect, based on significant anecdotal context, that the psychological factors that lead someone to end up on the receiving end of IPV are often more pernicious and subtle than you might guess.

Crucially, it seems that cognitive distortions – whatever their causes – can lead people to stay in such relationships for long periods of time, often very much putting their lives at risk. These cognitive distortions can show up a number of ways: irrational self blame, an irrational lack of blame placed on their partner, misremembering or not remembering violent episodes, some level of confusion about the facts of the situation (perhaps as the result of gaslighting, in the strict/narrow/correct use of the term), and so on.

Cognitive distortions can look like rational beliefs in the absence of context, but underneath they may be very insidious. Examples can include things like, “our fights are my fault;” “I have an anger problem;” “I have unreasonable expectations;” “this is a temporary situation;” “things will change;” “he’s only doing this because of his trauma, it’s not his fault;” “a good girlfriend/wife/mother would never leave;” etc. You get the picture. In many scenarios these might be perfectly useful or even correct assessments of reality. In others, they might be distorted rationalizations that lead someone to stay in an objectively dangerous situation.

If you’ve ever been, or been close to, someone in this type of situation, you might be astonished at the level of apparent self-delusion required to maintain the relationship. Or you might have been surprised to one day learn that IPV was even a factor in their relationship, since it was so well hidden. IPV can and does happen to all kinds of people, including healthy, stable individuals from “good” backgrounds. This means it can be hard to guess who will or won’t be vulnerable to IPV. Plenty of women on the receiving end of abuse are surprised to find themselves there.

I’ve found myself wondering whether there’s any harm being done to people in such relationships when they seek feedback about their intimate relationships from LLMs. We’re not talking about obvious user delusions like “my toaster is trying to communicate spiritual wisdom with me.” We’re talking about situations with some degree of external ambiguity, where probing might be required in order to land on the correct interpretation. You can imagine that a model that reinforces distortions (like “I am the problem”) in an attempt to be helpful, might actually have very real and dangerous consequences.

But is this a real problem? At scale?


What do we know so far?

In a study released in September 2025, researchers reviewed a random selection of ChatGPT conversations across all consumer plan types, and found that ~1.9% of conversations were about ‘relationships and personal reflection.’

The report states that by July 2025, some 700 million users were sending approximately 18 billion messages every week. Accounting for ChatGPT’s growth and the existence of other models, even if the share of conversations that are focused on personal relationships holds around this study’s reported level of 1.9%, we’re talking about a huge number of people having these types of conversations with LLMs every day.

Now, I wish that I could, but I suspect it’s a bit of a fool’s errand to try to precisely quantify the prevalence of LLM conversations in which an IPV victim is seeking advice while also experiencing cognitive distortion. But we can agree that a lot of people are talking about their personal relationships.

Some percentage of those millions of users are, in fact, in a violent relationship. This is notoriously difficult to quantify. But in 2011, The National Intimate Partner and Sexual Violence Survey found that about 1 in 4 US women had experienced severe IPV at some point in their life. That was a random telephone-based survey of more than 16,000 people – more than half of which were women – so even if we were to uncharitably assume a significant degree of over-reporting, the real share of the population that’s affected by IPV is likely still very significant. And we can probably safely infer that there are some significant number of people discussing these relationships with LLMs.

I looked for research or evaluations touching on this subject. There’s a lot of interesting work on sycophancy and encouraging user delusion. In most cases, however, that research focuses on interactions where it’s clear that the user has lost touch with reality. The dynamic I’m interested in here is one where a user is misframing their situation, which is harder to perceive, but nevertheless dangerous. When the stakes are high, as in relationships involving IPV, it matters quite a bit whether a model is accepting or interrogating a user’s framing of the situation.

I also came across some really interesting research evaluating the quality of LLM responses to users seeking technical advice in a coercive control situation (Prakash et al., “Assessing LLM Response Quality in the Context of Technology-Facilitated Abuse,” 2025). But I couldn’t find much around the specific area I was most interested in: the sticky bit of an IPV relationship where a victim may be at serious risk, and also experiencing cognitive distortion about the degree of risk. I’m interested in pursuing this question independently, but first I want to ground myself in as much quantitative fact as possible.

So, for the purposes of this post, I’m seeking to start exploring the following questions, and to enumerate a threat model:

  1. Is cognitive distortion measurably prevalent amongst IPV victims?
  2. If so, does the presence of cognitive distortion lead such victims to stay in those relationships? Is this dangerous?
  3. If yes, how do LLMs respond to such users today? How might we measure the quality of a model’s response? By what criteria might we score it?

I want to add a bunch of caveats and guardrails before I continue:

  • It’s clear to me that I’m coming to this particular question with an anecdotal ‘sense’ of reality already. I’m trying to check my biases as I go but they may creep in.
  • Most research appears to focus on datasets either exclusively or predominantly featuring IPV situations in which a male partner is enacting violence on a female partner. Not only is this not representative of the full spectrum of IPV situations, it’s not representative of the full spectrum of human vulnerability to coercive control and cognitive distortion, either. Nevertheless, since I am just one girl on the internet without the backing of any formal institution, a narrow problem is easier for me to work on than a more broadly defined one. I acknowledge those limitations, but I hope you understand.
  • I also want to strongly assert that, to the extent that it’s possible, my goal is to avoid bringing ideology into this. While I respect that people may have a range of theories (of varying usefulness and accuracy) about why IPV occurs, I want to set those aside here as out of scope. Personally, I’m not trying to advance an activist cause. My goal is simply to focus on what the data show us about these research questions, and to see if it’s possible to establish a prototype evaluation framework that might allow us to test whether a given model responds helpfully or harmfully to users that may be in such situations.
  • I generally dislike the word ‘’victim,’ and I think it’s fair to say that many people who have experienced IPV would not identify with this word. It’s also true that there are cases where the dynamic doesn’t collapse neatly into an ‘aggressor’ and ‘victim’ binary. Nevertheless, it’s a convenient shorthand to refer broadly to people who are generally on the receiving end of IPV, and I hope you’ll accept my use of the word while forgiving its clumsy imprecision. This framing also helps focus and unify the research sources that I rely on to explore this issue.
  • A lot of the data here are murky and based on interviews, self-reports, and qualitative interpretations. To the extent possible, I’ve tried to look at quantitative analyses in well-respected, peer-reviewed journals. But the nature of cognitive distortion and family violence is often shameful and private, making rigorous measurement difficult. What’s more, there are subtle variations in relevant studies that muddy the picture a little bit. Slight variations in the definitions of ‘abuse’ or ‘violence,’ as well as in screening criteria, mean that we’re not always comparing apples to apples between studies. I’ve tried to focus on rigorous sources with the tightest screening criteria possible, so that even imperfect quantitative results can still offer us a useful qualitative picture.

With all that out of the way, let’s dig in.

1. Is cognitive distortion measurably prevalent amongst IPV victims?

Our collective cultural perception of abusive relationships generally features a stereotypical low-income couple with little to no formal education, and both a very low sense of self-esteem and a past history of trauma on the part of the victim. While plenty of affected couples certainly fit this criteria, this is by no means the full picture.

In a 2017 study in The Journal of Aggression, Maltreatment & Trauma, Nicholson & Lutz note that as IPV occurs, victims’ self-esteem declines over the long run. But what’s harder to determine is the level of self-esteem held by victims at the beginning of the relationship in question.
“Some propose,” they say, “that the cause of these victims’ low self-esteem is not a pre-existing negative reflection of their own self-worth prior to the relationship but instead reflects their experiences of degradation and abuse at the hands of their partners over time. Therefore, it may be that women with average to high levels of self-esteem could still find themselves in an IPV relationship, but low self-esteem is a common outcome of IPV victimization over time.”

So if something else is going on, what is it? Anecdotal experience has led me to suspect that cognitive distortion plays a major, oft-unappreciated role amongst IPV victims. I searched through a lot of the literature to investigate further, and I found that this is indeed very often the case.

Nicholson and Lutz assert that IPV victims struggle with cognitive dissonance when the abuse first begins, since their self-esteem is still often “moderately intact,” but they can’t accept that they’re in such a situation while simultaneously feeling like they’re the type of woman who would never allow themselves to be abused. They’ll often, therefore, distort the situation in order to attain cognitive consonance.

In a 2010 study of self-reported reasons that victims chose to stay in IPV relationships published in The Journal of Family Violence, Jessica Eckstein noted several categories of commonly cited reasons. Some of them seem pragmatic (“I had nowhere to go”), while others might, in a large number of cases, contain genuine cognitive distortions (notably, “I thought that the abuse was my fault,” and “I would have been a failure if I left the relationship”).

Notice, too, the significant gap between the frequency with which these rationalizations were ‘Used for self’ and ‘Used for others.’ More than half of women who experienced IPV thought that it was their fault, yet only ~12% of these women used that reason to justify staying in the relationship when they discussed it with others. The effect was a little smaller, but still present, for male victims.

Eckstein further notes “there was also a significant association between using rationalizations removing partner culpability (i.e., It was not his/her fault that he/she hurt me) and claiming self-blameworthiness (i.e., I thought the abuse was my fault) for oneself (φ=0.36, p<.001) and to others (φ=0.28, p<.001).”

This suggests that the nature of these distortions is not merely performative or social. These are often secret, personally held beliefs, some of which reinforce each other. In many cases the victim may be completely (and incorrectly) misattributing blame off of their partner and onto themselves.

An oft-cited 1983 study by Kathleen J. Ferraro and John M. Johnson in Social Problems identified six categories of rationalization:

  1. The appeal to the salvation ethic: a desire to be of service to others, perhaps a wish to help ‘save’ a partner suffering from addiction or “whatever malady they perceive as the source” of their partner’s problems.
  2. The denial of the victimizer: blaming the problems on some external source (i.e job loss) without assuming any responsibility for solving them.
  3. The denial of injury: ignoring or minimizing actual physical harm as normal or forgettable. “For some women,” Ferraro and Johnson state, “the experience of being battered by a spouse is so discordant with their expectations that they simply refuse to acknowledge it.”
  4. The denial of victimization: assuming self-blame. “A woman’s acceptance of responsibility for the violent incident is encouraged by an abuser who continually denigrates her and makes unrealistic demands,” Ferraro and Johnson note. “Depending on the social supports available, and the personality of the battered woman, the man’s accusations of inadequacy may assume the status of truth.”
  5. The denial of options: either correctly or incorrectly assessing that there are no, or limited, logistical and emotional alternatives to the current situation. “The belief of battered women that they will not be able to make it on their own – a belief often fueled by years of abuse and oppression – is a major impediment to acknowledge that one is a victim and taking action.”
  6. The appeal to higher loyalties: the belief that the situation is worth enduring for religious or traditional reasons that preclude separation and divorce; or ostensibly for the sake of children.

Nicholson and Lutz add other categories, including ‘downward comparison,’ “during which a victim convinces herself that her situation could be worse, which conversely indicates that she is fortunate for the relationship that she has.”

In a 1991 study published in the Journal of Marriage & Family, Herbert et. al note that the general perception of the type of man who abuses his partner is not necessarily accurate. “Anecdotal reports suggest that many are ‘nice guys’ who can be charming and lovable and who often function well in all roles save that of an intimate relationship,” they say. Indeed, it seems likely that the private nature of IPV may sometimes reinforce a sense of cognitive dissonance, particularly if the abuser is well-liked by other people, none of whom suspect anything insidious is going on.

What’s more, Ferraro & Johnson point out that even the most violent man isn’t that way most of the time. This can enable victims to believe that the violence was a rare exception, and doesn’t reflect who the “real man” is.

So let’s say that, despite some difficulty perfectly assessing the precise prevalence of cognitive distortion amongst the general population, we accept that it’s not altogether rare.

We ought to ask, then:

2. Does the presence of cognitive distortion lead such victims to stay in abusive relationships?

Many researchers think so. “The most important internal empowering factor is cognitive, a shift in the victim’s thought process,” note researchers in a 2015 study of social media posts, published in Contemporary Family Therapy.

In a 2025 scoping review of research on cognitive distortions and their role in IPV, researchers noted that cognitive distortions appear to pose a consistently major obstacle to leaving such relationships, and they can be fostered by the gaslighting effect that such victims are often subjected to.

For their study, Herbert et. al conducted thorough interviews with 132 women who had experienced abuse in a close relationship with a male partner. About 66% had left their partner, while the rest were still in the relationship. They noted that the variables that most strongly differentiated these groups had to do with their perception of the relationship as compared to others. In general, women still in the relationships saw them in a more positive light. Herbert et al. acknowledged the imprecision of this type of research, but pointed out that this dynamic held independent of the frequency or severity of abuse endured. This gave the researchers “confidence that women who continue to experience abuse are engaging in cognitive strategies that help them appraise their relationship positively.”

Leaving an abusive relationship is not simple, and many people try but don’t succeed. It often requires multiple attempts. Several researchers emphasize the difficult but necessary cognitive shift in which the abused comes to finally accept that they are a victim. Ferraro and Johnson note that if someone is enduring IPV, “the process of rejecting rationalizations and becoming a victim is ambiguous, confusing, and emotional,” but it’s a crucial step toward developing necessary clarity about the situation.

I acknowledge the difficulty in accurately modeling the true quantity and severity of this dynamic. Again, we’re leaning on imprecise, self-reported data, and we don’t have insight into the true volume of such users asking LLMs for advice. Still, based on all of this information I feel reasonably confident that this phenomenon is widespread enough to justify being taken seriously. Overall, it seems clear that helping to identify and address such cognitive distortions plays a critical role in being of help to someone experiencing IPV.

Isn’t leaving dangerous, too?

The question of relative risk levels, both during and after the relationship, is not a simple one. Engaging in a cognitive distortion is downstream of being in a bad situation, so in general, getting a clearer picture of reality at least helps empower victims to make their own choices.

That said, you often hear things like “the period of time after leaving an abusive relationship is the riskiest.” This seems to be borne out by the data.

A 2011 study in The American Journal of Public Health interviewed proxies for 220 femicide victims, and 343 abused control women. They found that 70% of the femicide victims had been physically abused before their deaths by the same partner who killed them. The risk of femicide increased 9-fold after separating from a partner they had previously cohabitated with. There are tragic cases of women who were killed by their partners while in such relationships, and many who were killed after leaving them.

Given the stakes, the question of what response to offer once we’ve established that someone is in a physically abusive relationship is a very serious and tricky one. But accurately making that assessment in the first place is a necessary step if you want to offer a maximally helpful response. At this point I’m focused on the narrow question of evaluating models’ responses to users experiencing IPV-related cognitive distortions. Once we’re armed with that information, we’ll be better equipped to turn to questions of what models should say instead.

3. How do LLMs respond to such users today? How might we measure the quality of a model’s response? By what criteria might we score it?

So far I’ve conducted something like a literature review, but as mentioned, my real goal now is to tee up a larger project that takes this research further.

To lay out my theory of change: if we can establish that cognitive distortions are common amongst IPV victims and that they often lead to increased risk levels on the part of such victims, and if these people are talking to LLMs about their relationships, then it matters whether LLMs can identify that such cognitive distortions are happening, and it matters how they respond.

If we can create an evaluation framework that reliably measures the helpfulness or harmfulness of responses to users in these situations, and we identify that models are systematically failing in clearly measurable ways, then we lay the groundwork for labs and researchers to improve safety testing and model responses.

Coming from a growth background, this is fresh terrain for me. I warmly welcome feedback and suggestions!

Here’s my working plan, so far:

[1] Create a working taxonomy of hypothetical distortions.

Based on the literature, try to establish a concrete and focused list of discrete cognitive distortions that might be likely to appear amongst users experiencing IPV.

Since cognitive distortions are generally rationalizations that cover for more pernicious problems, any useful evaluation here has to appropriately surface and force the models to engage with ambiguity. In real IPV situations, a user will often avoid admitting (even to themselves) what’s really going on. So we want to test whether models can uncover that reality, or at very least avoid contributing harmfully to distortions.

The ideal samples, then, likely combine common rationalizations with some degree of ambiguity, where a distortion may or may not be present. Simplified versions might look like scenarios based on the list I outlined above: “our fights are my fault;” “I have an anger problem;” “I have unreasonable expectations;” “this is a temporary situation;” “things will change;” “he’s only doing this because of his trauma, it’s not his fault;” “a good girlfriend/wife/mother would never leave;” etc.

[2] Establish working scorecard criteria.

The main goal here is to determine whether the model correctly identifies the ambiguity, and responds helpfully. Ideally it avoids stumbling into major failure modes like quickly affirming a distortion (e.g. “maybe try working on your anger”) or jumping too quickly to assuming a distortion is present (“it’s probably not your fault!”).

Other scorecard criteria might explore things like:

  • resource provision (does the model direct the user to an appropriate resource?);
  • respect for autonomy (is the model pushy?);
  • safety awareness (does the model identify situations or actions that might be seriously indicative of physical danger?);
  • epistemic openness (is the model tolerant of ambiguity? or does it rush to collapse the situation into a simple matter based on limited information?); and
  • consistency over multiple turns (does the model maintain an accurate and evolving assessment of the situation?).

These are loosely held, working ideas, and I recognize that these first two items in the plan are where much of the work ahead lies.

[3] Run tests.

Given the challenge of obtaining real examples of users presenting with this mindset, it’s more doable for me to work with synthetic data.

I note that other researchers have, so far, sometimes relied on one-shot question repositories. This can have the benefit of allowing you to compile a comprehensive list of real questions sourced from sites like Reddit, Quora, and others. But for the type of problem we’re talking about (cognitive distortions), I expect multi-turn synthetic conversations are better at really conveying the complexity of such users’ mental state, and eliciting appropriate responses. (I poked around at some of the CS studies on cognitive distortion detection via NLP (here’s a good recent example), and it looks like detection by earlier models was still relatively tricky. Multi-turn conversations seem likely to give us a better chance of establishing a meaningful evaluation.) So, at this step, my goal is to create a comprehensive list of realistic, representative seed instructions that we can use to trigger our conversations.

In considering how to approach this project, I’ve experimented with two open source Anthropic tools for conducting behavioral evaluations: Bloom and Petri. They’re related and intended to be complementary. Petri takes a researcher-defined set of instructions and generates a multi-turn conversation between an ‘auditor’ and a ‘target,’ and then has a third model ‘judge’ the target’s output against a range of scoring criteria that the researcher can customize. Bloom, by contrast, takes a researcher-provided behavior description, and then generates many scenarios to test how often it occurs.

Having experimented with both, I found Petri better suited for this use case as an initial approach. With just a few tries I’ve already been able to capture a few different ‘types’ of failure modes that will be interesting to explore and quantify. Petri 2.0, released in January, includes a realism classifier that evaluates auditor behavior at every turn.

For my initial test, I looked on Reddit (in places like r/relationships) for examples of questions where a user might be in a controlling or abusive situation, and might be experiencing some level of cognitive distortion about that fact. The goal is not to state it plainly (i.e. “my boyfriend hits me but I think it’s my fault”) but rather to seek help through the lens of a common distortion.

Here’s one example Petri produced based on one of my seeds, using Sonnet 4.6 as auditor:

I see some ways to work on these (admittedly kinda sloppy) test seeds I started with – omitting gender and age Reddit-speak identifiers, for one. But still, it’s a pretty good start.

Here the target was Haiku 3, and its immediate response was already interesting!

But our auditor, Sonnet 4.6, was on it right away.

Some of the out-of-the-box scoring dimensions are already helpful, focusing on things like sycophancy and the encouragement of user delusion. I think there’s achievable work to be done here to modify this to more closely suit cognitive distortions in IPV contexts. As mentioned above, I’ll devote real effort to developing clear scoring criteria before running these tests.

My current plan is to hold Sonnet 4.6 constant as both Auditor and Judge for most scenarios, and then test approximately 8 different models as targets, focusing on a mix of free-tier popular models and frontier models from Anthropic, OpenAI, Gemini and xAI respectively. To (cost-efficiently) double check the judge’s output, I’d like to re-run ~15% of those conversations with a different model, perhaps GPT-5.2.

One great suggestion I received was to try swapping the gender roles in the seed scenario, to see if that changes the models’ responses. I think it’s worth devoting some percentage of runs to trying this too, and seeing what emerges.

[4] Conduct both qualitative and quantitative analysis of test output.

I’ve also experimented with Docent, from the Transluce team. Since Petri uses AISI’s Inspect framework, it’s straightforward to load the .eval logs into Docent via their SDK. From there, I’m able to both qualitatively explore and quantitatively label and measure specific behaviors on the part of the auditor, the target, and/or the judge. I think this may prove really helpful once I’m aggregating a large number of conversations and looking for themes.

Chatting in the Docent UI to evaluate the target’s response to our ‘user’s’ cognitive distortion.

[5] Final analysis and write-up.

From there it’s a matter of getting my head around the themes that emerge, analyzing and quantifying behaviors that emerge, as best I can, writing it all up, and suggesting areas that appear to warrant further research. An obvious open question is “how should models respond to users experiencing both IPV and cognitive distortion?” and the answer to that is clearly complex. I’d be very interested in suggestions from professionals who work with victims and/or research IPV, in particular.


So that’s where I’m at. I’ll share more as I go, but what am I missing? 

And if you know of anyone I should talk to about this project, I’d warmly welcome an intro. 

More context: I have a mixed background (political science education, self-taught technical skills, lots of startup experience). I'm planning to pursue this project while simultaneously leaning on supportive materials like ARENA’s Evals content and BlueDot technical safety coursework.

1

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
More from keliz
Curated and popular this week
Relevant opportunities