Structured Outputs and Reliability Engineering

Q1 Analyze mcq_single

An extraction prompt asks a model to 'Return invoice line items as JSON.' On live traffic, roughly 3% of calls fail because the model wraps the output in ```json markdown fences. You want to completely eliminate this class of failure without latency penalties. Which output format best matches this requirement?

A. Free-form JSON in a string with improved parser error handling B. XML-tagged sections with regex-based boundary matching C. Schema-constrained decoding enforced by the API provider D. Function-calling-as-format to get typed argument validation

Correct answer: C
Schema-constrained decoding enforces a grammar at token-sampling time, eliminating structural violations like markdown fences entirely. The provider prevents the model from emitting invalid JSON by restricting token choices during generation. This closes the parser-fragility failure class completely, with acceptable latency for batch ETL and most real-time scenarios.
Why the other choices are wrong:

A. Improving the parser only handles the fence problem after it occurs; it does not prevent the model from wrapping output in the first place.
B. XML-tagged sections can survive embedded braces, but this does not directly address markdown fence wrapping; the model would still emit the fences, they would just sit inside the XML.
D. Function-calling-as-format ensures typed output, but the real failure class here is structural violations. The model still might invent or abdicate meaning while satisfying the type signature.

Q2 Apply mcq_multi

You are implementing schema-constrained decoding for a vendor-name extraction prompt. The model occasionally emits empty strings or zero values when uncertain instead of refusing the extraction. You need to harden the prompt against at least three adversarial input classes: (1) embedded curly braces in vendor names, (2) numeric fields with currency symbols, (3) intentionally malformed input. Which of the following defense mechanisms would be most effective? (Select 2)

A. Reframe the schema validation rule to reject zero-values and empty strings as contract violations B. Shift prompt budget from 'Return as JSON' to detailed field semantics (e.g., 'amount in minor currency units, integer, no symbols') C. Add a repair loop that catches semantic violations and feeds targeted error feedback back to the model D. Increase model temperature to encourage more diverse outputs and reduce deterministic abdication

Correct answer: B, C
The core issue is semantic abdication—the model satisfies structural constraints while lying semantically (empty string, zero value). Option B reallocates prompt budget to teach field semantics, removing ambiguity about what 'amount' means. Option C detects semantic violations post-generation (empty, zero) and triggers a repair loop to correct the specific field, converting abdication into targeted retries. Together, they address both prevention (clearer semantics) and recovery (repair on failure).
Why the other choices are wrong:

A. Schema validation can only enforce structure, not semantic honesty. Rejecting zero-values in the schema might prevent zero-amounts, but empty strings and other abdication patterns would persist; the problem is the model's behavior, not the validation rule.
D. Higher temperature increases randomness and stochasticity, making the model less predictable. This would likely worsen the semantic abdication problem by encouraging more unpredictable outputs, not fewer.

Q3 Evaluate mcq_single

Your team is building a real-time chat agent that invokes a payment API mid-conversation. Users see tokens streaming word-by-word on the screen. You must choose between free-form JSON with careful prompt engineering, XML-tagged sections, or schema-constrained decoding. Which format best aligns with the production constraint of real-time token streaming?

A. Schema-constrained decoding, because the provider enforces correctness and eliminates all parser risk B. Free-form JSON with engineer-designed escaping rules, because it has minimal latency overhead C. XML-tagged sections, because they support partial-document parsing and progressive rendering D. Function-calling-as-format, because it provides type safety for the tool invocation payload

Correct answer: C
In real-time streaming scenarios, users expect to see partial output immediately. XML-tagged sections support incremental parsing—you can extract and render `<claim>...</claim>` spans before the full response arrives. Schema-constrained decoding requires the full response to be generated before it is valid (schema verification happens on the complete output), introducing first-token delay and preventing progressive rendering. Streaming incompatibility is a critical constraint for chat UIs.
Why the other choices are wrong:

A. While schema-constrained decoding eliminates parser risk, it introduces first-token latency (50–200ms) because grammar validation happens at token-sampling time, and the full response must be complete before it is structurally valid. For streaming, this is a hidden-cost mistake.
B. Free-form JSON can stream, but engineer-designed escaping rules are fragile; they require maintaining complex parser heuristics and will occasionally fail on adversarial input (quotes, embedded braces).
D. Function-calling-as-format also requires the full response before validation, just like schema-constrained decoding. You cannot progressively render a tool call until its argument object is complete.

Q4 Create mcq_single

You are designing an output contract for a support-ticket triage system where tickets route to specialized queues (billing, technical, account, other). Tickets with missing customer_id cause downstream database joins to fail silently, corrupting dashboards for hours before detection. Tickets with incorrect status values cause routing to wrong queues. Which combination of contract dimensions best prevents both failure classes?

A. Make all fields optional; add a validate-and-repair loop to catch and fix violations B. Mark customer_id as required + non-null, status as a closed enum with 'other' for uncertainty, add confidence field, set additionalProperties: false C. Mark customer_id nullable (not required), enum-constrain status, add confidence, use open-ended enum to allow model flexibility D. Require all fields, use a free-text status field with a post-processing regex to normalize common misspellings

Correct answer: B
This design addresses both failure classes: (1) customer_id as required + non-null prevents the silent corruption cascade—missing ID now fails loudly; (2) closed enum with 'other' sentinel plus confidence field prevents wrong-status routing—the model cannot invent new categories, and low-confidence outputs route explicitly to 'other'; (3) additionalProperties: false catches schema drift across model versions. This contracts makes three explicit promises to downstream code: identity is guaranteed, status is closed, and format is stable.
Why the other choices are wrong:

A. Keeping fields optional defers the failure to downstream code. A validate-and-repair loop only catches syntactic violations, not the semantic problem that missing customer_id is meaningful data loss, not a retry-able error.
C. Nullable customer_id still allows null to propagate downstream. An open-ended enum defeats the entire purpose of enum discipline—the model can invent categories, and you are back to silent routing bugs.
D. Free-text status with post-processing regex is brittle. The model can still emit variations the regex misses, and you are now patching the schema after the fact instead of preventing the problem at generation time.

Q5 Apply mcq_single

Your validate-and-repair loop catches a violation: the model returned status='BILLING?' (not in enum). You regenerate with error context: 'status must be one of [billing, technical, account, other]; your previous output had BILLING?, which is invalid.' On the next attempt, the model returns status='maybe_billing' (still wrong, just a different invalid value). What is the correct action?

A. Retry a third time with the same error context but emphasizing the enum more strongly B. Escalate instead of retrying further; the model is oscillating between invalid values, which signals a systematic prompt misunderstanding, not a correctable error C. Switch to a lower temperature to reduce variance and force the model toward a single deterministic answer D. Accept the output and convert 'maybe_billing' to 'other' in the downstream post-processor

Correct answer: B
When a repair loop produces different invalid values on successive attempts ('BILLING?' then 'maybe_billing'), the model is not converging toward the correct format—it is oscillating. This pattern signals that the model fundamentally misunderstands the enum constraint, not that it made a one-time typo. Continuing to retry amplifies the hallucination and burns through your cost budget. Escalating to human review or explicit fallback is the correct defensive move. The loop's cost ceiling (typically max_retries=2) is exactly designed to catch this.
Why the other choices are wrong:

A. Repeating the same error context when the model ignored it twice will not suddenly click on the third attempt. You are just wasting tokens and cost.
C. Lowering temperature might reduce variance, but it will not fix the underlying problem. The model may simply lock onto a single wrong answer and repeat it deterministically.
D. Post-processing 'maybe_billing' to 'other' bypasses the contract. You just silently converted a probable misrouting into a catch-all bucket, hiding the true failure from your metrics.

Q6 Analyze mcq_single

Your financial research prompt's output contract specifies: optional report_id, optional confidence, required research_findings (free text), optional data_sources (array of strings). A model returns valid output, but the research_findings is an empty string and data_sources is an empty array. Downstream, the team's BI tool treats empty array as 'no sources needed' (does not alert) and empty string as 'valid content' (renders a blank card). Where does this graceful-degradation gap first surface as a user-facing problem?

A. The BI tool raises a schema validation error when it encounters empty arrays and strings B. Users see blank research cards without any warning signal, and have no way to distinguish between 'we have no sources' and 'the model gave up but the contract allowed it' C. The validate-and-repair loop catches empty strings as semantic violations and triggers a retry D. The API logs a warning because required fields are missing (research_findings was supposed to be non-empty)

Correct answer: B
The gap is exactly in the schema design: optional fields with no semantic constraint create a path for the model to abdicate. Empty string and empty array are both structurally valid (they match the type), but semantically they signal model failure. Since the schema allows them, downstream code treats them as valid outputs. The BI tool renders a blank card, users see nothing, and the failure is invisible. There is no error, no alert, no retry—just silent data loss masquerading as normal operation.
Why the other choices are wrong:

A. The schema allows empty arrays and empty strings (they are valid values for the types), so validation passes. The tool does not raise an error.
C. The validate-and-repair loop checks *structural* validity against the schema, not semantic validity. Empty strings and arrays pass the structure check, so no retry is triggered.
D. research_findings is required and non-null in the contract, so a truly empty field would fail validation. But an empty *string* is not the same as an absent field—the field is present, it just has zero content. The API logs nothing.

Q7 Analyze mcq_single

A research assistant model outputs: 'In FY2023, TechCorp invested $2.5B in R&D, as reported in their 10-K filing. The company also announced a major acquisition of StartupX for $500M, making it their largest deal of the year. This follows Chen et al., 2023's research on AI acquisitions.' The provided context contains TechCorp's 10-K (which actually reports $2.3B R&D), mentions no acquisition, and the citation is not in any provided source list. Categorize the hallucinations present.

A. All extrinsic (everything is invented outside the context) B. Intrinsic (R&D), extrinsic (acquisition), and fabricated citation (Chen et al.) C. One conflation (mixing R&D figure with company details) and one fabricated citation D. Only the citation is hallucinated; the R&D and acquisition details are accurate

Correct answer: B
Intrinsic hallucination: the R&D figure is $2.5B but the provided 10-K says $2.3B—the model has access to the context but misrepresents it. Extrinsic hallucination: the acquisition detail is not in the provided context at all; the model invents it from training data. Fabricated citation: 'Chen et al., 2023' does not exist in any provided source list; the model invents a reference. These are three distinct failure modes requiring three different defenses: evidence gating for intrinsic, calibrated abstention for extrinsic, and citation allowlist for fabricated.
Why the other choices are wrong:

A. The R&D contradiction is not extrinsic. The context contains R&D data; the model just misrepresented it. Extrinsic is when the context is *silent*, not when it is contradicted.
C. There is no conflation here—no fusion of two real entities into a false claim. The R&D and acquisition are separate statements, each independently wrong for different reasons.
D. The R&D is contradicted by the context (2.5B vs 2.3B), and the acquisition is entirely absent. These are not accurate; the acquisition is a clear hallucination.

Q8 Evaluate mcq_single

You are auditing a Q&A prompt used in production. The prompt includes: (1) instructions to 'answer only using provided context', (2) a required `abstain` field in the schema, (3) N=3 self-consistency sampling with disagreement filtering, but (4) no citation verification step. A user report surfaces a confident, well-reasoned answer that is completely fabricated (cites a nonexistent expert). Which hallucination defense layer did your current posture fail to close?

A. Calibrated abstention; the model should have abstained rather than inventing B. Evidence gating; the claim should have been tied to a provided context quote C. Fabricated citation; the allowlisted sources do not include the cited expert, but there is no post-hoc verification D. Self-consistency; the false answer appeared in all three samples, so the voting mechanism did not catch it

Correct answer: C
Your defenses successfully address the first three hallucination categories: (1) 'answer only from context' plus abstention field closes extrinsic (the model had the option to abstain); (2) if there was an evidence-gating field, it would have closed intrinsic; (3) self-consistency catching disagreement would surface conflation. But the fabricated citation category—inventing references that do not exist—requires an *external* post-generation verification step. No amount of in-prompt instruction makes a fake citation real. The only defense is citation allowlist verification applied after generation: check each citation ID against a registry of valid sources, drop non-matching citations. Your current posture is missing this layer.
Why the other choices are wrong:

A. The prompt explicitly allows abstention as a structured field. If the model confidently asserted the false answer instead of abstaining, that is a sign the model believed it, not that the abstention mechanism failed.
B. If the prompt includes evidence-gating instructions, the model should have quoted a source. But the question does not say whether evidence gating is present, only that context-only instructions exist.
D. Self-consistency catches *stochastic* disagreement. If the model confidently asserts the same false answer three times (a systematic error from training), self-consistency will keep it. This is a known pitfall of self-consistency with systematic hallucinations.

Structured Outputs and Reliability Engineering · Quiz