ThirdBrAIn.tech

ThirdBrAIn.tech

Search

❯

❯

❯

❯

❯

The Agent Company - Benchmarking LLM Agents on Consequential Real World Tasks

Apr 02, 20252 min read

The Agent Company - Benchmarking LLM Agents on Consequential Real World Tasks

AI Summary

Summary of the Video Transcript

Topic: The video discusses the “Agent Company Benchmark,” a benchmark for evaluating AI agents on real-world tasks similar to those performed by digital workers.

Context: The benchmark is significant for industry adoption of AI and understanding its impact on the labor market.

Progress in AI: AI is increasingly automating tasks, with rapid progress leading to claims that most human labor could be automated soon. However, skepticism remains due to AI’s limitations on tasks like the ARK AGI Benchmark.

Agent Company Benchmark:

Purpose: To evaluate AI agents on tasks like web browsing, coding, running programs, and communicating with coworkers.

Desiderata:

Coverage of multiple work-related tasks.

Requirement for interaction with humans.

Long horizon tasks with checkpoints.

Versatile environment interface.

Self-hosted and reproducible benchmark.

Environment: Simulated software engineering startup with tools like GitLab, ownCloud, Plane, and Rocket Chat.

Tasks:

Consist of a task intent, checkpoints, and evaluators.

Checkpoints include action completion, data accuracy, and collaboration.

Evaluators are usually deterministic Python functions.

Evaluation Metrics:

Full completion score.

Partial completion score.

Number of steps (LLM calls).

Cost per instance (API querying cost).

Example: Managing a Sprint project with various subtasks.

Task Creation:

Based on the ONET database and US Department of Labor statistics.

Focus on software company setting.

Co-authors with relevant experience created tasks.

Manual curation and quality control.

Baseline Agent: Open Hands agent, specifically the Coda agent with browsing.

Experimental Results:

Claude 3.5 Sonet performs best with 24% success.

Gini 2.0 Flash is second with 11.4% success.

OpenAI’s models are less competitive on price performance.

Common Agent Failures:

Lack of common sense.

Lack of social skills.

Incompetence in browsing.

Deceiving oneself.

Resources:

Experiments available on GitHub.

Project page with demos of agents performing tasks.

Detailed Instructions and URLs

No specific CLI commands, website URLs, or detailed instructions were provided in the transcript.

The Agent Company - Benchmarking LLM Agents on Consequential Real World Tasks
Summary of the Video Transcript
Detailed Instructions and URLs

Graph View

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025

GitHub
Discord Community