Essential Work Sample Exercises for Evaluating Tool-using LLM Application Skills

Tool-using Large Language Model (LLM) applications represent one of the most exciting frontiers in AI development today. These applications extend beyond basic prompt-response interactions by enabling LLMs to use external tools, APIs, and functions to accomplish complex tasks. As organizations increasingly adopt these technologies, the ability to design, implement, and optimize tool-using LLM applications has become a highly sought-after skill set.

Evaluating candidates for roles involving tool-using LLM applications presents unique challenges. Traditional interviews often fail to reveal a candidate's practical abilities in this rapidly evolving field. Technical knowledge alone doesn't guarantee that someone can effectively architect systems that leverage LLMs with tools, debug complex interactions, or optimize for real-world performance constraints.

Work samples provide a window into how candidates approach the specific challenges of tool-using LLM applications. They reveal not just technical competence but also problem-solving approaches, architectural thinking, and attention to critical considerations like error handling, user experience, and system reliability. These practical exercises help identify candidates who can bridge the gap between theoretical understanding and practical implementation.

The following work samples are designed to evaluate different facets of tool-using LLM application development. They assess a candidate's ability to design architectures, implement functional systems, troubleshoot complex issues, and evaluate performance—all critical skills for success in this domain. By observing candidates as they work through these exercises, you'll gain valuable insights into their capabilities that simply can't be gleaned from resume reviews or traditional interviews.

Activity #1: LLM Tool Architecture Design

This exercise evaluates a candidate's ability to design a coherent architecture for a tool-using LLM application. Architectural design is fundamental to creating effective LLM applications, as it requires balancing technical constraints, user needs, and system performance. This activity reveals how candidates think about system components, data flows, error handling, and the integration points between LLMs and external tools.

Directions for the Company:

Provide the candidate with a written brief describing a business problem that could be solved with a tool-using LLM application. For example: "Design a customer support system that uses an LLM to handle initial queries and can access customer data, product information, and order status through appropriate tools/APIs."
Include specific requirements such as: must handle authentication, must maintain conversation context, must gracefully handle tool failures, etc.
Allow the candidate 45-60 minutes to create an architecture diagram and brief explanation document.
Prepare a computer with diagramming software (like draw.io, Lucidchart, or even Google Slides) for the candidate to use.
Have a technical evaluator familiar with LLM applications available to review the design and ask follow-up questions.

Directions for the Candidate:

Review the business problem and requirements carefully.
Create an architecture diagram showing the key components of your proposed solution, including:
The LLM integration point(s)
External tools/APIs and how they connect
Data flows between components
Error handling and fallback mechanisms
User interaction points
Prepare a brief (1-2 page) explanation document that describes your architecture, key design decisions, and any trade-offs you considered.
Be prepared to explain your design choices and answer questions about alternative approaches.

Feedback Mechanism:

The evaluator should provide feedback on one strength of the architecture (e.g., "Your approach to error handling is particularly robust") and one area for improvement (e.g., "The design could better address how to maintain context across multiple interactions").
Give the candidate 15 minutes to revise their architecture based on the improvement feedback, focusing specifically on that aspect.
Observe how receptive the candidate is to feedback and how effectively they incorporate it into their revised design.

Activity #2: Implement a Basic Tool-using LLM Application

This exercise assesses a candidate's ability to implement a working tool-using LLM application. Implementation skills are crucial as they demonstrate whether a candidate can translate conceptual understanding into functional code. This activity reveals coding proficiency, familiarity with LLM APIs, understanding of tool integration patterns, and attention to practical details like error handling and user experience.

Directions for the Company:

Prepare a starter repository with basic scaffolding code for a tool-using LLM application. Include necessary API keys (or dummy placeholders) and documentation for available tools.
The repository should include a README with clear instructions and a specific implementation task, such as: "Implement a weather-aware travel planning assistant that can look up weather forecasts and suggest activities based on the forecast."
Provide access to necessary APIs (e.g., OpenAI API, weather API) or mock versions if needed.
Allow 90-120 minutes for the implementation.
Ensure the development environment is properly set up with required dependencies.

Directions for the Candidate:

Review the starter code and requirements document.
Implement the specified tool-using LLM application, focusing on:
Proper integration with the LLM API
Correct implementation of tool calling functionality
Appropriate error handling and fallback mechanisms
Clear and helpful user interactions
Your implementation should include:
Code to define and register the tools/functions
Logic to process tool calls from the LLM
Handling of tool results and incorporation into the conversation
Basic input/output interface for testing
Document any assumptions or design decisions you make during implementation.

Feedback Mechanism:

The evaluator should test the application with a few sample inputs and provide feedback on one strength (e.g., "Your error handling is particularly robust") and one area for improvement (e.g., "The tool response parsing could be more resilient to unexpected formats").
Give the candidate 20-30 minutes to address the improvement feedback.
Observe how the candidate approaches debugging and refining their implementation based on feedback.

Activity #3: Debug a Problematic Tool-using LLM Application

This exercise evaluates a candidate's ability to identify and resolve issues in an existing tool-using LLM application. Debugging skills are essential in this domain, as tool-using LLM applications can fail in subtle and complex ways. This activity reveals a candidate's systematic problem-solving approach, understanding of common failure modes, and ability to navigate the interactions between LLMs and external tools.

Directions for the Company:

Prepare a functional but flawed tool-using LLM application with 3-5 deliberately introduced bugs of varying complexity. For example:
A tool definition that doesn't match the expected schema
Improper handling of tool execution errors
Incorrect parsing of tool results
Issues with maintaining conversation context
Inefficient prompt design leading to unnecessary tool calls
Include a document describing the expected behavior and the observed problematic behavior.
Provide access to logs, code, and a running environment for testing.
Allow 60-90 minutes for the debugging exercise.

Directions for the Candidate:

Review the application code, documentation, and reported issues.
Systematically identify the bugs in the application by:
Testing the application to reproduce the issues
Examining logs and error messages
Reviewing the code for logical errors or misconfigurations
For each bug you find:
Document the issue clearly
Explain the root cause
Implement a fix
Verify that your fix resolves the issue
Prioritize bugs based on their severity and impact on functionality.
Be prepared to explain your debugging process and the reasoning behind your fixes.

Feedback Mechanism:

The evaluator should review the candidate's fixes and provide feedback on one strength (e.g., "Your systematic approach to isolating the context management bug was excellent") and one area for improvement (e.g., "Consider adding more robust validation to prevent similar issues in the future").
Give the candidate 15-20 minutes to implement the suggested improvement.
Observe how the candidate incorporates the feedback to strengthen their solution beyond just fixing the immediate issues.

Activity #4: Evaluate and Optimize Tool Selection Strategy

This exercise assesses a candidate's ability to analyze and improve the tool selection strategy in a tool-using LLM application. Optimizing when and how an LLM uses available tools is critical for application performance, cost efficiency, and user experience. This activity reveals a candidate's analytical skills, understanding of LLM behavior patterns, and ability to balance multiple optimization objectives.

Directions for the Company:

Prepare a functional tool-using LLM application that has suboptimal tool usage patterns. For example, an application that:
Calls tools unnecessarily when the LLM could answer directly
Fails to use tools when they would be beneficial
Uses tools in an inefficient sequence
Has poorly designed tool definitions leading to misuse
Include a dataset of sample conversations showing the current behavior.
Provide metrics on current performance (e.g., completion time, token usage, accuracy).
Allow 60-90 minutes for the evaluation and optimization exercise.

Directions for the Candidate:

Analyze the provided sample conversations and performance metrics to identify patterns of suboptimal tool usage.
Evaluate the current tool definitions, system prompts, and any other relevant configuration.
Develop a set of specific recommendations to improve the tool selection strategy, which may include:
Refining tool descriptions and parameters
Adjusting system prompts to provide better guidance on tool usage
Implementing pre-processing logic to determine when tools should be available
Adding post-processing to validate and potentially override tool selection
Implement at least two of your highest-priority recommendations.
Test your changes against the provided sample conversations and document the improvements in performance metrics.

Feedback Mechanism:

The evaluator should review the candidate's analysis and implementations, providing feedback on one strength (e.g., "Your analysis of the unnecessary API calls was particularly insightful") and one area for improvement (e.g., "Consider how your changes might affect edge cases not represented in the sample data").
Give the candidate 15-20 minutes to address the improvement feedback.
Observe how the candidate balances multiple optimization objectives (accuracy, efficiency, cost) and considers both immediate and long-term implications of their changes.

Frequently Asked Questions

How should we adapt these exercises for candidates with different experience levels?

For junior candidates, consider simplifying the requirements and providing more scaffolding code. You might focus more on implementation of a single tool rather than a multi-tool system. For senior candidates, increase complexity by adding requirements around scalability, security, or advanced features like chained tool calls or parallel tool execution.

What if we don't have access to paid LLM APIs for the exercises?

You can use open-source LLMs that support tool-using capabilities, such as certain versions of Llama or local deployments of open models. Alternatively, you can create mock LLM interfaces that simulate tool-calling behavior for the exercises, focusing on the candidate's implementation of the surrounding architecture.

How do we evaluate candidates who use different approaches than we expected?

Focus on the effectiveness of their solution rather than adherence to a specific approach. The key questions should be: Does their solution solve the problem effectively? Is the code well-structured and maintainable? Did they make reasonable trade-offs given the constraints? Different approaches often reveal valuable diverse perspectives that could benefit your team.

Should candidates be allowed to use reference materials or search for information during these exercises?

Yes, allowing reference materials more closely simulates real-world working conditions. Tool-using LLM applications involve many specific APIs and patterns that professionals regularly look up. Focus your evaluation on problem-solving approach, architectural thinking, and implementation quality rather than memorization of specific syntax or parameters.

How can we ensure these exercises don't take too much of the candidate's time?

Be transparent about time expectations upfront. Consider offering these exercises as take-home assignments with clear time limits, or schedule appropriate time blocks for on-site exercises. Focus the requirements on specific aspects rather than building complete applications, and provide sufficient scaffolding code to allow candidates to demonstrate relevant skills without building everything from scratch.

What if a candidate has no experience with a specific LLM API we use in our exercises?

The core concepts of tool-using LLM applications are transferable across different APIs. Consider providing a brief overview of your specific API before the exercise, or design the exercise to focus more on the architecture and implementation patterns rather than specific API details. Evaluate the candidate's ability to quickly adapt to new APIs, which is a valuable skill in this rapidly evolving field.

Tool-using LLM applications represent a powerful new paradigm in AI development, and finding candidates who can effectively design and implement these systems requires going beyond traditional interview methods. The work samples outlined above provide a comprehensive evaluation of the key skills needed for success in this domain, from architectural design to implementation, debugging, and optimization.

By incorporating these practical exercises into your hiring process, you'll gain deeper insights into candidates' capabilities and identify those who can truly deliver value in building tool-using LLM applications. Remember that the field is evolving rapidly, so look for candidates who demonstrate not just current knowledge but also the ability to learn and adapt as new capabilities and best practices emerge.

For more resources to enhance your hiring process, check out Yardstick's AI Job Descriptions, AI Interview Question Generator, and AI Interview Guide Generator.

Ready to build a complete interview guide for evaluating tool-using LLM application skills? Sign up for a free Yardstick account today!

Generate Custom Interview Questions

With our free AI Interview Questions Generator, you can create interview questions specifically tailored to a job description or key trait.

Generate Questions

Raise the talent bar.

Learn the strategies and best practices on how to hire and retain the best people.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Raise the talent bar.

Learn the strategies and best practices on how to hire and retain the best people.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

How It Works Pricing Our Story Resources Support Book A Call

Terms & Conditions