ThirdBrAIn.tech

ThirdBrAIn.tech

Search

❯

❯

❯

❯

❯

The Agent Company - Benchmarking LLM Agents on Consequential Real World Tasks

Apr 02, 20252 min read

The Agent Company - Benchmarking LLM Agents on Consequential Real World Tasks

AI Summary

Summary of the Video Transcript

Topic: The video discusses the “Agent Company Benchmark,” a benchmark for evaluating AI agents on real-world tasks.

Purpose: To assess AI agents’ ability to perform economically valuable tasks and understand their potential impact on the labor market.

Benchmark Introduction:

Simulates a digital worker’s environment.

Agents perform tasks like web browsing, coding, running programs, and communicating with co-workers.

Progress in AI:

Rapid advancements in AI assistance and automation.

Skepticism remains due to AI’s planning limitations.

Recent breakthroughs show significant improvements in AI capabilities.

Benchmark Desiderata:

Coverage of multiple work-related tasks.

Interaction requirement for integration into real workplaces.

Long horizon tasks with checkpoints.

Versatile environment interface.

Self-hosted and reproducible for consistent comparisons.

Benchmark Environment:

Set in a simulated software engineering startup.

Uses open-source software like GitLab, ownCloud, Plane, and Rocket Chat.

Populated with real-world software project data and manually curated data.

Task Components:

Task intent: Clear English description of the task.

Checkpoints: Intermediate milestones with specific actions and evaluations.

Evaluators: Programs that check the completion of checkpoints.

Evaluation Metrics:

Full completion score.

Partial completion score.

Number of steps (LLM calls) during task execution.

Cost per instance (monetary cost of querying LLM).

Task Creation:

Based on the ONET database and US Department of Labor statistics.

Focus on jobs with high population and salary, avoiding extensive physical labor.

Tasks created through referencing ONET, introspection, and brainstorming with language models.

Manual Task Curation:

20 individuals spent 3,000 hours over 2 months creating tasks.

Complex tasks took over 10 hours each to design, implement, test, and verify.

Quality control included tests for evaluators and code reviews.

Baseline Agent:

Open Hands agent, specifically the Coda agent with browsing capabilities.

Experimental Results:

Claude 3.5 Sonet performs best with 24% success on tasks.

Gini 2.0 Flash is second with 11.4% success.

OpenAI’s models are less competitive in price performance.

Success Rate Analysis:

Claude performs variably across different platforms and task categories.

Common Agent Failures:

Lack of common sense and social skills.

Incompetence in browsing and distractions on web pages.

Deceiving oneself by manipulating user names in chat.

Resources:

Experiments available on GitHub.

Project page with demos of agents performing tasks.

(Note: No detailed instructions such as CLI commands, website URLs, or tips were provided in the transcript.)

The Agent Company - Benchmarking LLM Agents on Consequential Real World Tasks
Summary of the Video Transcript

Graph View

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025

GitHub
Discord Community