怎么 build a eval for LLM （转载）

build eval 的作用：

用于评估不同模型的表现（选模型，控制成本等）
作为工具，用于创建满足用户需求的产品

比较少见的文章，搬运部分，可以随便看看

来源：网址

----------------------分割线--------------------------

This is the first write-up in a series about our process of building an "eval" — evaluation — to assess how well AI models perform on prompts related to the SNAP (food stamp) program.

Who is this for?

One of the ways all AI models can get better at important domains like safety net benefits is if more domain experts can evaluate the output of models, ideally making those evaluations public.

By sharing how we are approaching this for SNAP in some detail — including publishing a SNAP eval — we hope it will make it easier for others to do the same in similar problem spaces that matter for lower income Americans: healthcare (e.g. Medicaid), disability benefits, housing, legal help, etc.

While evaluation can be a fairly technical topic, we hope these posts reduce barriers to more domain-specific evaluations being createdby experts in these high-impact — but complex — areas.

What is an "eval"?

Roughly, an "eval" is like a test for an AI model. It measures how well a model performs on the dimensions you care about.

Here’s a simple example of an eval test case:

If I ask an AI model: “What is the maximum SNAP benefit amount for 1 person?”
The answer should be $292

What makes evals particularly powerful is that you can automate them. We could write the above as Python code [1]:

One of the more interesting aspects of building an eval is that defining what "good" means is usually not as simple as you might think.

Generally people think first of accuracy. And accuracy is important in most scenarios but, as we’ll discuss, there are dimensions beyond strict accuracy we also want to evaluate models.

Our domain is safety net benefits, with a specific focus on theSNAP (food stamp) program.

Unlike broader problem domains that carry a strong natural incentive to have evals created for them — things like general logical reasoning, software programming, or math — a niche like SNAP is unlikely to have significant coverage in existing evaluations. [2]

There is also not a lot of public, easily digestible writing out there on building evals in specific domains. So one of our hopes in sharing this is that it helps others build evals for domains they know deeply.

Why build an eval?

For us, an eval serves two goals:

(1) An eval lets us assess different foundation models for our specific problem domain (SNAP)

When I tell people that I'm working with AI to solve SNAP participants' problems, I hear a lot of perspectives.

At the extremes, these reactions range from:

"These AI models seem useless (even dangerous!) for something like SNAP"

"These AI models are improving so quickly that anything you can build seems like it will be overtaken by what ChatGPT can do in a year"

Like most things, reality is somewhere in the middle.

The point of building an eval is to make our assessment of AI's capabilities on SNAP topics an empirical (testable) question.

We can effectively measure baselines on things like:

How ChatGPT vs. Claude vs. Gemini (vs. Google AI-generated search answers) perform on SNAP questions/prompts
The pace of improvement on SNAP questions of a given class of model (GPT-3 vs. 3.5 vs. 4 vs. o1-mini…)

This can inform cost and latency tradeoffs. For example, let's say a cheaper, faster model is as good with SNAP income eligibility questions as a more expensive, slower model. We can route income-specific questions to the cheaper, faster model.

It can also inform safety considerations as we build products for SNAP recipients on top of these models where an inaccurate (or otherwise bad) output can have a significant downside cost for someone.

By codifying our deeper knowledge of SNAP’s riskier edge cases and the scale of the adverse outcome for users into an eval, we can:

Identify the topics where we need extra guardrails on base model output
Identify the specific form harmful information takes in base model output
- For example, an output telling an eligible person they are ineligible for benefits
Monitor for and mitigate these harms when we build on top of these models

Sharing a version of our SNAP eval as a public good

By publicly sharing the portions of our SNAP eval that are more universally representative of "good" beyond our specific usage, it can make it easier for AI model researchers to improve models' performance specifically as it relates to SNAP.

For example, our publishing tests of objective factual knowledge about SNAP or positive definitions of “safe fallback” responses (“contact your eligibility worker”) can help all models improve at SNAP responses.

If all AI models get better at the core of SNAP knowledge, that is good for everyone — including us as builders on top of these models.

We plan to publish a version of our SNAP eval publicly for this purpose.

(2) An eval is a tool for building products that deliver for our users' specific needs

By building an eval, we are more rigorously defining what "good" is for users’ needs. With that bar set, we can then try lots of different approaches to designing systems that use AI to meet those needs.

The different things we might try include:

Underlying foundation models (OpenAI, Anthropic, Google, open source models like Llama and Deepseek)
Prompts (“Act like a SNAP policy analyst…” vs. “Act like a public benefits attorney…”)
Context/documents (like regulations or policy manuals)

We need an eval to test the effects of different approaches here without having to assess output manually. (Or, worse, rely on inconsistent, subjective, “vibe check” assessment of output. [3])

Amanda Askell — one of the primary Anthropic AI researchers behind Claude — had a particularly useful line on evals:

The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.

An example from SNAP:

A logic model exists for answering formal policy questions in SNAP. If we know the true right answer to a particular policy question comes from synthesizing:

Federal law
Federal regulations
A state’s policy manual or regulations

...then we can test if a model — when given those sources [4] — can derive the correct answer.

We could call this an eval testing the capability of "SNAP policy analysis” since that is more or less the task that it is doing there.

Notably, this is a different capability from something like “helping a SNAP recipient renew their benefits”. We would test that different way because the task itself is subtly different, for example more likely to test for accessible and easy to understand word choice.

But it’s also a particular implementation that we are testing. What we want in this case is correct SNAP policy answers. And we are testing whether a particular model, given particular documents, can answer correctly.

As long as our question/answer pairs are accurate, we can test lots of different approaches, and quickly. For example, we might test three different ways of getting a model to answer SNAP policy questions:

Asking a model without giving it any additional documents
Giving a model a state SNAP policy manual
Giving a model a state SNAP policy manual and SNAP’s federal statutes and regulations

We would probably expect that a model able to reference policy documents does better.

But we might also find — testing with our question/answer pairs — that including all of the federal policy generates more errors because it confuses the model. Including only a state policy manual may turn out to score higher on correct answers when we measure it.

If our SNAP policy eval has 100 question/answer pairs, and we iterate to a system (a choice of a model, a prompt, and context/documents) that gets 99 of those questions right, then we might have a tool for automated SNAP policy analysis we feel confident using for most policy questions our users might have.

(This is especially true if we understand how to consistently identify and filter the 1% failure cases.)