Structured Outputs and Reliability Engineering · Quiz

8 items · Bloom: Analyze:3, Apply:2, Evaluate:2, Create:1

Q1 Analyze mcq_single

An extraction prompt asks a model to 'Return invoice line items as JSON.' On live traffic, roughly 3% of calls fail because the model wraps the output in ```json markdown fences. You want to completely eliminate this class of failure without latency penalties. Which output format best matches this requirement?

Correct answer: C
Schema-constrained decoding enforces a grammar at token-sampling time, eliminating structural violations like markdown fences entirely. The provider prevents the model from emitting invalid JSON by restricting token choices during generation. This closes the parser-fragility failure class completely, with acceptable latency for batch ETL and most real-time scenarios.
Why the other choices are wrong:
  • A. Improving the parser only handles the fence problem after it occurs; it does not prevent the model from wrapping output in the first place.
  • B. XML-tagged sections can survive embedded braces, but this does not directly address markdown fence wrapping; the model would still emit the fences, they would just sit inside the XML.
  • D. Function-calling-as-format ensures typed output, but the real failure class here is structural violations. The model still might invent or abdicate meaning while satisfying the type signature.
Q2 Apply mcq_multi

You are implementing schema-constrained decoding for a vendor-name extraction prompt. The model occasionally emits empty strings or zero values when uncertain instead of refusing the extraction. You need to harden the prompt against at least three adversarial input classes: (1) embedded curly braces in vendor names, (2) numeric fields with currency symbols, (3) intentionally malformed input. Which of the following defense mechanisms would be most effective? (Select 2)

Correct answer: B, C
The core issue is semantic abdication—the model satisfies structural constraints while lying semantically (empty string, zero value). Option B reallocates prompt budget to teach field semantics, removing ambiguity about what 'amount' means. Option C detects semantic violations post-generation (empty, zero) and triggers a repair loop to correct the specific field, converting abdication into targeted retries. Together, they address both prevention (clearer semantics) and recovery (repair on failure).
Why the other choices are wrong:
  • A. Schema validation can only enforce structure, not semantic honesty. Rejecting zero-values in the schema might prevent zero-amounts, but empty strings and other abdication patterns would persist; the problem is the model's behavior, not the validation rule.
  • D. Higher temperature increases randomness and stochasticity, making the model less predictable. This would likely worsen the semantic abdication problem by encouraging more unpredictable outputs, not fewer.
Q3 Evaluate mcq_single

Your team is building a real-time chat agent that invokes a payment API mid-conversation. Users see tokens streaming word-by-word on the screen. You must choose between free-form JSON with careful prompt engineering, XML-tagged sections, or schema-constrained decoding. Which format best aligns with the production constraint of real-time token streaming?

Correct answer: C
In real-time streaming scenarios, users expect to see partial output immediately. XML-tagged sections support incremental parsing—you can extract and render `<claim>...</claim>` spans before the full response arrives. Schema-constrained decoding requires the full response to be generated before it is valid (schema verification happens on the complete output), introducing first-token delay and preventing progressive rendering. Streaming incompatibility is a critical constraint for chat UIs.
Why the other choices are wrong:
  • A. While schema-constrained decoding eliminates parser risk, it introduces first-token latency (50–200ms) because grammar validation happens at token-sampling time, and the full response must be complete before it is structurally valid. For streaming, this is a hidden-cost mistake.
  • B. Free-form JSON can stream, but engineer-designed escaping rules are fragile; they require maintaining complex parser heuristics and will occasionally fail on adversarial input (quotes, embedded braces).
  • D. Function-calling-as-format also requires the full response before validation, just like schema-constrained decoding. You cannot progressively render a tool call until its argument object is complete.
Q4 Create mcq_single

You are designing an output contract for a support-ticket triage system where tickets route to specialized queues (billing, technical, account, other). Tickets with missing customer_id cause downstream database joins to fail silently, corrupting dashboards for hours before detection. Tickets with incorrect status values cause routing to wrong queues. Which combination of contract dimensions best prevents both failure classes?

Correct answer: B
This design addresses both failure classes: (1) customer_id as required + non-null prevents the silent corruption cascade—missing ID now fails loudly; (2) closed enum with 'other' sentinel plus confidence field prevents wrong-status routing—the model cannot invent new categories, and low-confidence outputs route explicitly to 'other'; (3) additionalProperties: false catches schema drift across model versions. This contracts makes three explicit promises to downstream code: identity is guaranteed, status is closed, and format is stable.
Why the other choices are wrong:
  • A. Keeping fields optional defers the failure to downstream code. A validate-and-repair loop only catches syntactic violations, not the semantic problem that missing customer_id is meaningful data loss, not a retry-able error.
  • C. Nullable customer_id still allows null to propagate downstream. An open-ended enum defeats the entire purpose of enum discipline—the model can invent categories, and you are back to silent routing bugs.
  • D. Free-text status with post-processing regex is brittle. The model can still emit variations the regex misses, and you are now patching the schema after the fact instead of preventing the problem at generation time.
Q5 Apply mcq_single

Your validate-and-repair loop catches a violation: the model returned status='BILLING?' (not in enum). You regenerate with error context: 'status must be one of [billing, technical, account, other]; your previous output had BILLING?, which is invalid.' On the next attempt, the model returns status='maybe_billing' (still wrong, just a different invalid value). What is the correct action?

Correct answer: B
When a repair loop produces different invalid values on successive attempts ('BILLING?' then 'maybe_billing'), the model is not converging toward the correct format—it is oscillating. This pattern signals that the model fundamentally misunderstands the enum constraint, not that it made a one-time typo. Continuing to retry amplifies the hallucination and burns through your cost budget. Escalating to human review or explicit fallback is the correct defensive move. The loop's cost ceiling (typically max_retries=2) is exactly designed to catch this.
Why the other choices are wrong:
  • A. Repeating the same error context when the model ignored it twice will not suddenly click on the third attempt. You are just wasting tokens and cost.
  • C. Lowering temperature might reduce variance, but it will not fix the underlying problem. The model may simply lock onto a single wrong answer and repeat it deterministically.
  • D. Post-processing 'maybe_billing' to 'other' bypasses the contract. You just silently converted a probable misrouting into a catch-all bucket, hiding the true failure from your metrics.
Q6 Analyze mcq_single

Your financial research prompt's output contract specifies: optional report_id, optional confidence, required research_findings (free text), optional data_sources (array of strings). A model returns valid output, but the research_findings is an empty string and data_sources is an empty array. Downstream, the team's BI tool treats empty array as 'no sources needed' (does not alert) and empty string as 'valid content' (renders a blank card). Where does this graceful-degradation gap first surface as a user-facing problem?

Correct answer: B
The gap is exactly in the schema design: optional fields with no semantic constraint create a path for the model to abdicate. Empty string and empty array are both structurally valid (they match the type), but semantically they signal model failure. Since the schema allows them, downstream code treats them as valid outputs. The BI tool renders a blank card, users see nothing, and the failure is invisible. There is no error, no alert, no retry—just silent data loss masquerading as normal operation.
Why the other choices are wrong:
  • A. The schema allows empty arrays and empty strings (they are valid values for the types), so validation passes. The tool does not raise an error.
  • C. The validate-and-repair loop checks *structural* validity against the schema, not semantic validity. Empty strings and arrays pass the structure check, so no retry is triggered.
  • D. research_findings is required and non-null in the contract, so a truly empty field would fail validation. But an empty *string* is not the same as an absent field—the field is present, it just has zero content. The API logs nothing.
Q7 Analyze mcq_single

A research assistant model outputs: 'In FY2023, TechCorp invested $2.5B in R&D, as reported in their 10-K filing. The company also announced a major acquisition of StartupX for $500M, making it their largest deal of the year. This follows Chen et al., 2023's research on AI acquisitions.' The provided context contains TechCorp's 10-K (which actually reports $2.3B R&D), mentions no acquisition, and the citation is not in any provided source list. Categorize the hallucinations present.

Correct answer: B
Intrinsic hallucination: the R&D figure is $2.5B but the provided 10-K says $2.3B—the model has access to the context but misrepresents it. Extrinsic hallucination: the acquisition detail is not in the provided context at all; the model invents it from training data. Fabricated citation: 'Chen et al., 2023' does not exist in any provided source list; the model invents a reference. These are three distinct failure modes requiring three different defenses: evidence gating for intrinsic, calibrated abstention for extrinsic, and citation allowlist for fabricated.
Why the other choices are wrong:
  • A. The R&D contradiction is not extrinsic. The context contains R&D data; the model just misrepresented it. Extrinsic is when the context is *silent*, not when it is contradicted.
  • C. There is no conflation here—no fusion of two real entities into a false claim. The R&D and acquisition are separate statements, each independently wrong for different reasons.
  • D. The R&D is contradicted by the context (2.5B vs 2.3B), and the acquisition is entirely absent. These are not accurate; the acquisition is a clear hallucination.
Q8 Evaluate mcq_single

You are auditing a Q&A prompt used in production. The prompt includes: (1) instructions to 'answer only using provided context', (2) a required `abstain` field in the schema, (3) N=3 self-consistency sampling with disagreement filtering, but (4) no citation verification step. A user report surfaces a confident, well-reasoned answer that is completely fabricated (cites a nonexistent expert). Which hallucination defense layer did your current posture fail to close?

Correct answer: C
Your defenses successfully address the first three hallucination categories: (1) 'answer only from context' plus abstention field closes extrinsic (the model had the option to abstain); (2) if there was an evidence-gating field, it would have closed intrinsic; (3) self-consistency catching disagreement would surface conflation. But the fabricated citation category—inventing references that do not exist—requires an *external* post-generation verification step. No amount of in-prompt instruction makes a fake citation real. The only defense is citation allowlist verification applied after generation: check each citation ID against a registry of valid sources, drop non-matching citations. Your current posture is missing this layer.
Why the other choices are wrong:
  • A. The prompt explicitly allows abstention as a structured field. If the model confidently asserted the false answer instead of abstaining, that is a sign the model believed it, not that the abstention mechanism failed.
  • B. If the prompt includes evidence-gating instructions, the model should have quoted a source. But the question does not say whether evidence gating is present, only that context-only instructions exist.
  • D. Self-consistency catches *stochastic* disagreement. If the model confidently asserts the same false answer three times (a systematic error from training), self-consistency will keep it. This is a known pitfall of self-consistency with systematic hallucinations.