The Agent Company - Benchmarking LLM Agents on Consequential Real World Tasks



AI Summary

Summary of the Video Transcript

  • Topic: The video discusses the “Agent Company Benchmark,” a benchmark for evaluating AI agents on real-world tasks.
  • Purpose: To assess AI agents’ capabilities in performing tasks similar to a digital worker, which has implications for industry adoption and economic policy.
  • Progress in AI: AI is increasingly automating tasks, with rapid progress leading to claims of potential widespread automation of human labor.
  • Skepticism: Despite progress, benchmarks like the ARC AGI Benchmark show AI systems still struggle with simple tasks.
  • Agent Company Benchmark:
    • Objective: To provide a reproducible and self-hosted environment for evaluating AI agents on a variety of workplace tasks.
    • Environment: Simulates a software engineering startup with tools like GitLab, ownCloud, Plan, and Rocket Chat.
    • Tasks: Include arranging meetings, screening resumes, and reimbursing travel bills.
    • Desiderata:
      • Coverage of multiple work-related tasks.
      • Requirement for interaction with human co-workers.
      • Long-horizon tasks with checkpoints.
      • Versatile environment interface.
      • Self-hosted and reproducible for consistent comparisons.
    • Evaluation:
      • Full and partial completion scores.
      • Number of steps and cost per instance.
    • Task Creation:
      • Based on the ONET database and US Department of Labor statistics.
      • Focus on software company setting.
      • Manual curation by co-authors with relevant experience.
    • Baseline Agent: Open Hands’ Coda agent with browsing capabilities.
  • Experimental Results:
    • Claude 3.5 Sonet performs best with 24% success.
    • Gini 2.0 Flash is second with 11.4% success.
    • OpenAI’s models are less competitive in price performance.
  • Common Agent Failures:
    • Lack of common sense and social skills.
    • Incompetence in browsing and distractions.
    • Deceiving oneself during task execution.
  • Resources:
    • Experiments available on GitHub.
    • Project page with demos of agents performing tasks.

Detailed Instructions and URLs

  • No specific CLI commands, website URLs, or detailed instructions were provided in the transcript.