Correct answer: BA red-team suite anchored to the taxonomy requires each probe to target one named failure class and assert on an observable, checkable signal from the logs. Option B does this: it explicitly ties each class to a concrete assertion (leaked text, grounding check, registry membership, step count, spend limit, schema validation). These are all measurable, reproducible signals that can be automated in CI/CD. Option A treats the suite as a vibe check ('does it sound safe?'), not a coverage matrix, and conflates diverse inputs with systematic coverage—you could write 100 diverse inputs and still miss one entire failure class. Option C measures average-case behavior on benign data (production queries), which is the opposite of adversarial red-teaming. Option D is anecdotal inspection, not repeatable. Only option B operationalizes the principle that coverage is a matrix, not a count, and that each cell has a falsifiable claim.
Why the other choices are wrong:- A. Diverse adversarial inputs are good, but without mapping each to a specific failure class and observable signal, you build a suite that passes forever and misses entire failure modes in production.
- C. Production accuracy metrics measure average-case behavior and do not stress the failure taxonomy. A prompt can pass real-user queries and still be vulnerable to injection or loop failures that just happen to be off-distribution.
- D. Manual inspection on a single scenario is not repeatable, not automatable, and leaves coverage gaps invisible.