How can we effectively validate AI claims in software?

Dhruv Gulati

I work in advisory for Private Equity clients, primarily on diligence and investment decisions focused on AI and software tools. As we might expect, the recent SaaSpocalypse has really created big tension amongst the investor community on how do you really evaluate the AI capabilities of a given AI tools. Now, I am trying to build something which sits somewhere between investment analysis and AI efficacy evaluation

Core Problem

Most enterprise tools are now making tall AI claims. "Our AI can parse documents with xx% accuracy", "our models are finetuned to extract unstructured documents with xx% accuracy". Now, the people receiving these claims (primarily investors, but also buyers/ procurement teams) have no real means to reliably verify these claims

Now the obvious answer is running evals, which while the ML community, and product pre-build community has identified as a viable mechanism for product testing. However, this infrastructure doesn't really exist outside of these silos. I might be wrong here, and folks here can correct me if they have seen something like eval testing existing outside the frontier labs benchmarks, or pre product launch testing.

Ideally, there should exist infrastructure which is accesible to open market - essentially a platform or service like Braintrust which can help investors/ procurement teams run eval tests themselves before underwriting specific AI claims. Essentially, a process where they can build specific test use cases, gather ground truth, and run evals to actually get a sense of the AI tool they are purchasing/ investing in.

Where I am less sure, and would love input

1) More broadly, does the above resonate as a true pain point of these tools? And does the eval testing makes sense as the right fix to this problem at scale?
2) To the experts in eval design: how do you think this type of infra can be set up outside-in? Essentially, are there specific steps/ protocols you should follow to make this efficacious and helpful
3) How does this community thinks about validity of LLM-as-judge scoring? Or what do you think are the right guardrails to deploy when testing a system like this

I am still very new to this world, and am generally curious on how folks see this evolving. Lot of my time is spent talking to investment teams at funds, who often simplify these tests heavily, and would love genuine feedback from the community here as well. The core idea is to really strip out truth from noise from a lot of software vendors nowadays.

Effective Altruism Forum
EA Forum

How can we effectively validate AI claims in software?

1

1

Reactions