Agentic Patterns: Tool Use and Prompt Chaining

9 items · Bloom: Remember:0, Understand:0, Apply:2, Analyze:2, Evaluate:2, Create:3

Q1 Create short_answer

You are designing a tool schema for a customer refund agent. Your first attempt uses: `refund_order(order_id: string)` with description "Refunds an order." In production, the model invokes it to answer status inquiries and generates fake order IDs. Design a complete four-part tool schema (name, description, parameter shape, error contract) that prevents both the wrong-tool misuse and hallucinated arguments. Explain which part of your schema addresses each failure mode.

Rubric:

full_credit: Student designs all four contract parts with precision. Description must include precondition ("only after customer explicitly requests") and anti-use clauses ("do not use for status checks"). Parameter shape must include regex pattern like ^ORD-[0-9]{8}$ and enum for reason field. Error contract must include code, message, and suggested_next_action fields. Student explicitly maps each repair to the failure mode it closes.
partial_credit: Student designs 3 of 4 contract parts competently, or all 4 parts but with incomplete detail (e.g., missing enum constraint). Explanation of failure-mode mapping is present but incomplete.
minimal_credit: Student addresses 2 of 4 contract parts or provides superficial design (e.g., description without precondition language). Mapping logic is absent or confused.
no_credit: Student provides generic schema design or fails to address the specific failure modes.

Q2 Analyze mcq_single

Production logs show the following tool-misuse incidents. Match each incident to the specific schema contract part (name, description, parameter shape, or error contract) whose defect caused it: 1. User asks "What is the status of order ORD-12345?" and the model invokes `refund_order` to answer. 2. Model passes order_id="ORD_ABC123" (correct format is ORD-XXXXXXXX). 3. Model invokes a hallucinated tool `refund_order_async` that does not exist in the tool registry.

A. 1→description defect; 2→parameter shape defect; 3→name collision (description overlap) B. 1→parameter shape defect; 2→error contract defect; 3→description defect C. 1→error contract defect; 2→name collision; 3→parameter shape defect D. 1→name collision; 2→description defect; 3→error contract defect

Correct answer: A
Each tool misuse maps to a specific contract layer. Incident 1 (wrong tool selected for status query) stems from a loose description that does not encode the precondition—the description says "Refunds an order" with no guidance that it is only for refund requests after explicit customer confirmation. Incident 2 (malformed order_id) reflects missing parameter validation—the parameter lacks a regex pattern constraint that would reject ORD_ABC123 and enforce the format ^ORD-[0-9]{8}$. Incident 3 (hallucinated tool name) occurs when adjacent tools like `refund_order` and `cancel_order` have overlapping descriptions, creating ambiguity in the model's selection process—the defect is name collision via description overlap, not in a single tool's schema, but in how the neighboring tool descriptions interact.

Q3 Apply mcq_single

Your refund agent returns error: "Order already refunded." The model treats this as transient failure and retries with mutated order IDs (ORD-12345 → ORD-12346 → ORD-12347) until it cycles through a dozen variations. What is the most effective error-contract improvement to break this loop?

A. Add a "retry: false" boolean field to the error response B. Include structured error code (e.g., ALREADY_REFUNDED) and suggested_next_action enum (e.g., abstain, ask_customer_to_verify, escalate_to_support) C. Increase the maximum retry count in the runtime loop control D. Add a regex pattern in the error response to show the expected order ID format

Correct answer: B
The error contract's purpose is to instruct the model on what action to take in the next turn. A bare string like "Order already refunded" leaves the model guessing—it cannot distinguish transient from permanent failures. A structured error response like `{code: "ALREADY_REFUNDED", message: "This order was already refunded on [date].", suggested_next_action: "abstain"}` explicitly tells the model: stop retrying, do not mutate the ID, and instead inform the customer. This transforms the error response from a description into a directive, eliminating the guessing loop.

Q4 Analyze mcq_single

Match each prompt-chain topology to its dominant failure mode: 1. Sequential (A → B → C): each stage's output feeds directly into the next 2. Fan-out/Fan-in (A → [B₁, B₂, ...] → merge): one stage spawns parallel branches, then results recombine 3. Conditional (A → if condition(A.output) then B else C): first stage output routes execution 4. Iterative (A → B → A, repeat until predicate): loop until a termination condition

A. 1→error compounding; 2→merge incoherence; 3→misrouting; 4→non-termination B. 1→non-termination; 2→error compounding; 3→misrouting; 4→merge incoherence C. 1→misrouting; 2→non-termination; 3→error compounding; 4→merge incoherence D. 1→merge incoherence; 2→misrouting; 3→non-termination; 4→error compounding

Correct answer: A
Sequential chains fail via error compounding: if stage 1 misses a detail, that blind spot propagates through stage 2 and 3 with no recovery path. Fan-out/fan-in fails at the merge boundary—results that followed divergent reasoning paths now must recombine, and incoherence appears when aggregating conflicting conclusions. Conditional chains fail via misrouting—if stage 1 misclassifies, stage 2 solves the wrong branch entirely. Iterative loops fail via non-termination—without an explicit, runtime-checked predicate, the model may loop indefinitely.

Q5 Create short_answer

Design a three-step pipeline that transforms a meeting transcript into personalized action-item emails. Step 1: transcript → StructuredNotes; Step 2: StructuredNotes → ActionItems[]; Step 3: ActionItems[] per owner → Email. For each boundary, specify: (a) the state contract (schema, nullability rules), (b) the error-propagation policy (fail-fast, fallback, repair-and-retry, or skip-and-degrade), and (c) the rationale for that choice. Also specify what data should not flow from step to step to minimize token overhead.

Rubric:

full_credit: Student designs all three state contracts with explicit field names, types, and nullability. Each error policy is chosen with clear rationale tied to the boundary's criticality (e.g., fail-fast for step 1 because schema violation corrupts downstream; repair-and-retry for low-confidence items). Student identifies at least two fields that should be dropped (e.g., raw transcript not carried past stage 1, all decisions not carried to stage 3) to reduce token cost.
partial_credit: Student specifies 2 of 3 contracts competently with policies and rationale, or all 3 contracts but missing explicit nullability rules or token-reduction strategy.
minimal_credit: Student outlines basic contracts but policies are vague or mismatched (e.g., skip-and-degrade for schema violations). No mention of state-bloat reduction.
no_credit: Student provides generic pipeline outline without specific contract structure, policies, or rationale.

Q6 Evaluate mcq_single

Your team deployed a single-shot prompt that classifies customer emails and drafts responses. Success rate: 78%. A colleague proposes decomposing it into a two-stage pipeline: (stage 1) classify, (stage 2) draft response. To evaluate whether chaining is justified, which decision criterion is most load-bearing?

A. Is the chained version faster? Measure latency, and verify the success-rate gain justifies the extra calls. B. When the single-shot prompt fails, can we pinpoint whether classification failed, response-drafting failed, or both? Does chaining make failures attributable to a specific stage? C. If the single-shot already works 78% of the time, is the added cost of extra model calls worth it? D. Will fine-tuning the chained version on past failures improve the model's classification accuracy?

Correct answer: B
The core value of chaining is observability and debuggability, not performance. When the monolith fails 22% of the time, you cannot tell which component broke. Chaining makes each stage loggable and inspectable—when stage 1 fails, you know exactly which stage, and can target the fix. This diagnostic power is what justifies the latency cost. Options A and C discuss tradeoffs but miss the core decision driver: whether you can afford the cost of opacity versus the cost of extra calls. Option D conflates chaining (structural decomposition) with fine-tuning (model retraining), which are independent.

Q7 Create short_answer

Design a complete agent loop with explicit termination criteria, step and token budgets, oscillation detection, and observation summarization. For each of the four properties, specify: (a) the property definition, (b) a concrete implementation detail (e.g., success predicate, budget values, detection heuristic, summarization trigger), and (c) the failure mode it prevents. Justify your budget values and heuristics relative to a realistic production scenario (e.g., a research task that should complete in <30 seconds and <40k tokens).

Rubric:

full_credit: Student specifies all four properties with implementation details and tied failure modes. Success predicate is explicitly checkable (e.g., "model called final_answer with non-empty output"), budgets are concrete (max_steps, max_tokens, optionally max_wall_clock), oscillation heuristic is precise (e.g., "same tool+args hash appears 3× in last 5 turns"), summarization trigger is clear (e.g., "when history > 8k tokens, replace observations 1-(N-2) with working_memory"). Justification relates budget choices to production constraints.
partial_credit: Student specifies 3 of 4 properties with detail, or all 4 with less precise implementation details. Failure-mode mapping is present but incomplete.
minimal_credit: Student mentions all 4 properties but with vague or generic definitions (e.g., "set max_steps to something reasonable"). Failure modes are absent.
no_credit: Student provides generic loop structure without addressing the four properties.

Q8 Apply mcq_single

Your agent loop must enforce these budgets deterministically: max_steps=12, max_tokens=40,000, max_wall_clock=30 seconds. The model is currently calling a tool on every turn regardless of whether it needs to. Which code change ensures budget enforcement remains deterministic even if the model misbehaves?

A. Instruct the model in the system prompt: 'You must call final_answer within 12 steps or the system will stop you.' B. Check `step >= max_steps` and `cumulative_tokens >= max_tokens` and `elapsed_time >= max_wall_clock` at the start of each iteration; halt immediately if any condition is true. C. Trust the model to respect budget constraints internally and only log a warning if budgets are exceeded. D. Request the model to output estimated token count before each action; halt if the estimated spend would exceed remaining budget.

Correct answer: B
Deterministic budget enforcement is the runtime's responsibility, not the model's. Option A relies on the model's promise—the model cannot be trusted to self-police. Option C disables the safety net entirely. Option D still depends on the model's estimates, which can be inaccurate. Option B is the only approach that guarantees termination: at every step, the runtime evaluates the hard conditions (step_count >= max_steps, cumulative_tokens >= max_tokens, wall_clock >= max_wall_clock) and halts immediately if any is true. No matter what the model attempts, the runtime enforces the boundary.

Q9 Evaluate mcq_single

Your agent loop exhibits three pathologies: (1) On simple queries like "What is 2+2?", the agent calls search_docs unnecessarily before returning final_answer. (2) The agent oscillates between search_docs and search_web—each tool's output triggers the other in a loop. (3) On open-ended research queries, the agent runs 47 steps, burns 180k tokens, and returns silence. You can deploy exactly one targeted fix. Which fix most directly closes all three pathologies?

A. Add max_steps=12 and max_tokens=40,000 budgets. B. Add a success predicate: terminate immediately when the model calls final_answer tool with non-empty output; remove any requirement for a minimum number of tool calls. C. Implement oscillation detection: if the same tool+args hash appears 3 times in the last 5 turns, inject a corrective message. D. Add a loop-detection heuristic that bans repeated tool calls within 3 turns.

Correct answer: A
The three pathologies stem from different missing controls: (1) unnecessary tool calls indicate termination criteria are misaligned (the model does not have permission to skip tools); (2) oscillation indicates loop detection is absent; (3) runaway exploration indicates budget enforcement is absent. However, only one fix addresses all three: budgets. Pathology 1 is solved by a success predicate (option B), not budgets—but that alone leaves pathologies 2 and 3 unresolved. Option C addresses oscillation but not budgets. Option D is too restrictive. Option A (budgets) is the most leveraged fix: it guarantees pathology 3 halts (token budget expires), and it bounds pathology 2 to a finite loop (step budget expires). Pathology 1 (unnecessary calls) is not directly solved by budgets, but when budgets are tight and the model has permission to call final_answer directly, it will conserve calls. Budget enforcement is the foundational safety net.

Agentic Patterns: Tool Use and Prompt Chaining · Quiz

Your refund agent returns error: "Order already refunded." The model treats this as transient failure and retries with mutated order IDs (ORD-12345 → ORD-12346 → ORD-12347) until it cycles through a dozen variations. What is the most effective error-contract improvement to break this loop?