Can we apply formal theory analysis to the truth of LLM outputs?

Hi! I am Sergia. I was doing AI safety research before, interning at CHAI Berkeley, and Google, and being in and about AIS discussions. I'm recently into AI ethics. While talking to Emily Bender on Mastodon, I learned about her "Octopus" thought experiment that got me thinking into how truthful can LLMs be, from a Theoretical Computer Science perspective. Full disclaimer: I'm leftist and liberal. I hope this does not stop the discussion about the CS itself :) I have a concrete research direction (more like a question to those who know Theoretical CS better than I do).

Epistemic status: intuition of a CS graduate and one who read/did a small project on the AIXI stuff, now a bit forgotten and in need of answers by someone who's more into it now

The setup

I want to examine how truthful those large language models can be, given that a) they're trained to predict the next symbol b) they work in  and c) in their training data there are both true and false statements.

In what follows I focus only on a), as I feel a) is enough of an assumption to cause enough trouble

I want to model it like this. We have some very basic formal system (maybe even not an induction axiom). We have a Turing machine that prints true statements one by one by applying axioms that are applicable to existing true statements.

I choose this setup (a simple formal system) over full Set Theory because in our, real-world English language we rarely use infinite induction, Goedelian statements, and all that. It's more of a basic formal theory ("Socrates is a man, therefore..."). LLMs are known to fail sometimes even on those "basic logic" tasks like counting or "Alex was 5 years old when he was older than Jane by 1 years..." -- the readers here know more about those probably than I do.

For every  we have a dataset of the first  true statements, given by a machine  .

First question: is it true that for formal systems, we can do this - print all true statements like this?

We are interested in creating a Turing machine that, given the beginning of a true statement, will write the rest so that it's a valid true statement.

We use Solomonoff induction ("the best Machine Learning algorithm") which outputs a posterior over strings given the dataset and a prior. This procedure is uncomputable but it outputs a TM doing the job (predicting the rest of the string). We fix some prior

We apply Solomonoff induction to the output of  and obtain resulting probability distributions

Those are computable and represented by some Turing machine. This way we get 


For set theory, we know that the true  is uncomputable, otherwise, we can establish truth of a formula with it.

Question 2: is it computable for simpler theories, say, ones without an induction axiom? In other words, can we obtain the probability that a true formula starts with a  or with a symbol ""?

Question 3: are approximations to this probability computable? In other words, is it possible to obtain a bound, say, "the first symbol of a true formula starts with '' with probability in 0.1 to 0.2", with an algorithm that can give any precision in finite time?

My intuition is that the answer is "no" for questions 2 and 3. This would mean that when we feed a large language model more and more data, it will simply memorize true statements it has seen, but it will fail to generalize to unseen statements.

My intuition is that if we plot the probability of a true formula starting with, say, "", only estimated via count from finite "observations of true formulas" , this chart will go up and down, with no limit value. Like, if a lot of true formulas in the batch of the first 1000 start with a "" does not mean in the next 1000 there will be also a lot of true formulas starting with ""

Specific question 4: would this function have a limit value? Does the answer change if we have a decidable theory?

Question four is a specific case of question 3: obtaining this approximation via 


This potentially slows down the obsession with LLMs, as "learning truth by observing finite samples of it" is a doomed task: truth does not generalize this way.

Inspiration: there are linguists, humanities people, and people who are already hurt a bit with LLMs (mental health patients, artists, people in places affected by misinformation...) who say something like this, but tech people dismiss it as "more layers and more posts from Reddit in the data will solve everything"), I want to write it with them and with you. Well, there are superintelligence concerns and all that as well.

Potential conclusion

(if Q2-4 have negative answers)

The explanation would be that without learning the specific context where the LLM is deployed, without real feedback on an actual task (say, if the patient recovered), they are not very meaningful

Like, it's really hard to learn about a subject just by listening to lectures - it is something, but usually, when doing exercise sessions it becomes clear that the lecture was not clear at all, and then task-specific feedback and task-specific activities (brainstorming with someone who knows, asking the professor, etc) are required. Same with therapy - a freshly graduated therapist with no experience is probably not a good choice


The rest is up to the community, as I forgot most of theoretical CS by now :) Thank you!

License: CC A, Credits to my friend Arshavir for inspiring my thinking into this as well


Sorted by Click to highlight new comments since: Today at 4:37 PM

I got a bit disappointing answers on LW - people telling me how "this will not have impact" instead of answering the questions :) seriously, I think that it's a 15 minute problem for someone who knows theoretical CS well. It could have some impact on a very hard problem. Not the best option probably, but what is better?

Isn't it easier to spend 15 minutes to work on a CS theory problem, meeting new ppl, learning something ,instead of coming up with a long explanation of "why this is not the best choice"?

I'm a feminist but I'll give a trad cis example to illustrate this because I don't expect a feminist one to go well here (am I wrong?). In How I Met Your Mother the womanizer character Barney Stinson once had an issue. Women were calling him every minute and wanting to meet him. He couldn't choose which one is "the best" choice. As a result he didn't get to know any of them.

I feel it's the same - so much energy spent on "if it's the best thing to do" that even 15 minutes will not be spent on something new. Illusion of exploration - not actually trying the new thing but rather just quickly explaining why it's "not the best", spending most of the time "computing the best thing" and not actually doing it...

Am I not seeing it right? Am I missing something?