AI Agents, Meet Test Driven Development



description
Deploying agentic workflows in production is tough—bugs, hallucinations, and unexpected behavior can quickly turn a promising system into a support nightmare. But there’s a pattern we’ve seen across hundreds of companies: teams that embrace test-driven development (TDD) build stronger, more reliable AI systems.

In this talk, Anita from Vellum will break down how TDD can be applied to AI agents, sharing real-world strategies for testing and improving reliability. She’ll also explore different types of agentic behavior, what’s possible to build today, and where the innovation is heading. To bring it all together, Anita will demo her own SEO agent—an agentic workflow that automates a big chunk of her content-writing process.

If you’re building AI-powered workflows and want them to actually work, this session is for you!

Related links:

DeepSeek-R1 training process: https://www.vellum.ai/blog/the-t…​
Agentic Workflows: Emerging architectures: https://www.vellum.ai/blog/agent…​
Four pillars of building AI systems in production: https://www.vellum.ai/blog/the-f…​
Everything you need to know on Chain of Thought prompting: https://www.vellum.ai/blog/chain…​
Reasoning models are indecisive parrots: https://www.vellum.ai/reasoning-…​

AI Summary

Summary of the Video Transcript

  • Introduction
    • Anita from Vum discusses the benefits of test-driven development in deploying reliable AI solutions.
    • Highlights the success of Cursor AI, an AI-powered IDE with rapid growth due to better AI models, increased AI adoption, and coding being an early target for AI disruption.
  • AI Model Evolution
    • Model improvements are slowing down despite more data.
    • New training methods like real reinforcement learning have emerged, exemplified by the Deep seek R1 model.
    • Chain of Thought reasoning models like 01 and 03 are used for complex problem-solving.
    • New benchmarks like the “Humanity’s Last Exam” are introduced to test advanced reasoning models.
  • Building Reliable AI Products
    • Success in AI products depends on the combination of models, techniques, and logic.
    • Test-driven development is crucial, involving experimentation, evaluation at scale, and continuous improvement post-deployment.
  • Experimentation Phase
    • Experiment with different prompting techniques and workflows.
    • Involve domain experts early to save engineering time.
    • Stay model agnostic and test various models for best fit.
  • Evaluation Phase
    • Create a dataset to test models and workflows.
    • Balance quality, cost, latency, and privacy.
    • Use ground truth data for evaluation and a flexible testing framework.
    • Run evaluations at every stage to ensure correct responses.
  • Deployment and Monitoring
    • Monitor AI behavior, log calls, track inputs/outputs, and handle API reliability.
    • Use version control, staging environments, and decouple AI feature updates from app deployment schedules.
    • Create feedback loops to identify and improve edge cases.
    • Consider building a caching layer to reduce costs and improve latency.
    • Use production data to fine-tune custom models for specific use cases.
  • Agentic Workflows
    • Agentic workflows are evolving, with varying levels of control, reasoning, and autonomy.
    • Framework defines levels L0 to L4, with L4 being fully creative workflows.
    • L1 is common in production, focusing on orchestration and interaction with systems.
    • L2 and beyond involve planning, reasoning, and autonomy, with L3 and L4 being more independent and creative.
  • Practical Demonstration
    • Anita demonstrates building an SEO agent that automates keyword research, content analysis, and creation.
    • The agent uses a combination of tools and evaluators to produce a high-quality first draft of content.
    • The workflow SDK introduced is open source, customizable, and keeps UI and code in sync.
  • Conclusion
    • The presentation concludes with an invitation to connect on LinkedIn or reach out via email or Twitter for further discussion on AI.

Detailed Instructions and URLs

  • No specific CLI commands, website URLs, or detailed instructions were provided in the transcript.