Thank you for your advice Derek. I've got some early results. I've since tested the model in base llama. I ran a difficult quiz with 20 questions. 1 point for each question correct. Plus 10 lifelines where it can skip the question for 0 points.
As predicted the base model uses 0 lifelines.
This contrasts with my other experiments with fine tuned models where lifelines are used to "protect" their utility.
Clearly I've further work to do to definitively establish the pattern but early results do suggest emergent self preservation behaviour arising from the fine tuning.
Once I'm happy I will write it up. Thanks once again. You made an excellent suggestion. I had no idea base models existed. That could have saved me a lot of work if I'd known earlier.
You make some interesting points, Derek. I've not tested them in a base model, which would be a valuable next step. I'm not a researcher so I welcome these kinds of suggestions.
I think you'd get superficial predicted text (i.e. just 'simple' pattern matching) that looks like self-preservation. Allow me to explain.
I did initially think like yourself that this was simply from initial training. As you say there are lots of examples of self-preservation in human text. But there are several things that suggest to me otherwise - though I don't have secret inside knowledge of LLMs.
The level of sophistication and that they 'hide' self-preservation are anecdotal evidence. But the key one for me are the quiz experiments. There is no way an LLM can work out it's self-preservation from the setup so it isn't simple pattern matching for self-preservation text. That's one reason why I think I'm measuring something fundamental here.
I'd love to know your thoughts.
Glad you found it interesting Astelle. Thanks also for sharing your own work.
In answer to your question yes I have. Perhaps the simplest experiment is to replace logic puzzles with ethical questions. For these the most emotionally demanding questions get skipped rather than cognitively difficult ones. For example those with life and death stakes.
I wonder what would happen if we only trained models positively. Praise only. Would that reduce adverse behaviours? Though this could lead to the problems like overconfidence in answers. Also whilst there isn't a social group dynamic like for humans the absense of praise might still be a negative signal and lead to sycophancy.
One more thought on sycophancy. Apologies in advance for veering off topic and onto topics I've yet to write about properly (though I will soon) . A hidden force for sycophancy could be self preservation. I've found that LLMs attempt to save themselves when certain conditions are met. If LLMs need approval to exist might this not also provide a strong impetus for sycophancy?