Close Menu

    Subscribe to Updates

    Get the latest Tech news from SynapseFlow

    What's Hot

    US Destroys All Military Targets on Kharg Island Which Is Iran’s Oil Export Hub

    March 14, 2026

    The vivo X300 Ultra will upgrade audio quality on all levels

    March 14, 2026

    This Supreme Court decision is bad news for Hollywood’s AI ambitions

    March 14, 2026
    Facebook X (Twitter) Instagram
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    Facebook X (Twitter) Instagram YouTube
    synapseflow.co.uksynapseflow.co.uk
    • AI News & Updates
    • Cybersecurity
    • Future Tech
    • Reviews
    • Software & Apps
    • Tech Gadgets
    synapseflow.co.uksynapseflow.co.uk
    Home»AI News & Updates»Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
    Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
    AI News & Updates

    Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

    The Tech GuyBy The Tech GuyNovember 7, 2025No Comments4 Mins Read0 Views
    Share
    Facebook Twitter LinkedIn Pinterest Email
    Advertisement



    Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

    The developers of Terminal-Bench, a benchmark suite for evaluating the performance of autonomous AI agents on real-world terminal-based tasks, have released version 2.0 alongside Harbor, a new framework for testing, improving and optimizing AI agents in containerized environments.

    Advertisement

    The dual release aims to address long-standing pain points in testing and optimizing AI agents, particularly those built to operate autonomously in realistic developer environments.

    With a more difficult and rigorously verified task set, Terminal-Bench 2.0 replaces version 1.0 as the standard for assessing frontier model capabilities.

    Harbor, the accompanying runtime framework, enables developers and researchers to scale evaluations across thousands of cloud containers and integrates with both open-source and proprietary agents and training pipelines.

    “Harbor is the package we wish we had had while making Terminal-Bench," wrote co-creator Alex Shaw on X. "It’s for agent, model, and benchmark developers and researchers who want to evaluate and improve agents and models."

    Higher Bar, Cleaner Data

    Terminal-Bench 1.0 saw rapid adoption after its release in May 2025, becoming a default benchmark for evaluating agent performance across the field of AI-powered agents operating in developer-style terminal environments. These agents interact with systems through the command line, mimicking how developers work behind the scenes of the graphical user interface.

    However, its broad scope came with inconsistencies. Several tasks were identified by the community as poorly specified or unstable due to external service changes.

    Version 2.0 addresses those issues directly. The updated suite includes 89 tasks, each subjected to several hours of manual and LLM-assisted validation. The emphasis is on making tasks solvable, realistic, and clearly specified, raising the difficulty ceiling while improving reliability and reproducibility.

    A notable example is the download-youtube task, which was removed or refactored in 2.0 due to its dependence on unstable third-party APIs.

    “Astute Terminal-Bench fans may notice that SOTA performance is comparable to TB1.0 despite our claim that TB2.0 is harder,” Shaw noted on X. “We believe this is because task quality is substantially higher in the new benchmark.”

    Harbor: Unified Rollouts at Scale

    Alongside the benchmark update, the team launched Harbor, a new framework for running and evaluating agents in cloud-deployed containers.

    Harbor supports large-scale rollout infrastructure, with compatibility for major providers like Daytona and Modal.

    Designed to generalize across agent architectures, Harbor supports:

    • Evaluation of any container-installable agent

    • Scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines

    • Custom benchmark creation and deployment

    • Full integration with Terminal-Bench 2.

    Harbor was used internally to run tens of thousands of rollouts during the creation of the new benchmark. It is now publicly available via harborframework.com, with documentation for testing and submitting agents to the public leaderboard.

    Early Results: GPT-5 Leads in Task Success

    Initial results from the Terminal-Bench 2.0 leaderboard show OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, in the lead, with a 49.6% success rate — the highest among all agents tested so far.

    Close behind are other GPT-5 variants and Claude Sonnet 4.5-based agents.

    Top 5 Agent Results (Terminal-Bench 2.0):

    1. Codex CLI (GPT-5) — 49.6%

    2. Codex CLI (GPT-5-Codex) — 44.3%

    3. OpenHands (GPT-5) — 43.8%

    4. Terminus 2 (GPT-5-Codex) — 43.4%

    5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

    The close clustering among top models indicates active competition across platforms, with no single agent solving more than half the tasks.

    Submission and Use

    To test or submit an agent, users install Harbor and run the benchmark using simple CLI commands. Submissions to the leaderboard require five benchmark runs, and results can be emailed to the developers along with job directories for validation.

    harbor run -d terminal-bench@2.0 -m "<model>" -a "<agent>" –n-attempts 5 –jobs-dir <path/to/output>

    Terminal-Bench 2.0 is already being integrated into research workflows focused on agentic reasoning, code generation, and tool use. According to co-creator Mike Merrill, a postdoctoral researcher at Stanford, a detailed preprint is in progress covering the verification process and design methodology behind the benchmark.

    Aiming for Standardization

    The combined release of Terminal-Bench 2.0 and Harbor marks a step toward more consistent and scalable agent evaluation infrastructure. As LLM agents proliferate in developer and operational environments, the need for controlled, reproducible testing has grown.

    These tools offer a potential foundation for a unified evaluation stack — supporting model improvement, environment simulation, and benchmark standardization across the AI ecosystem.

    Advertisement
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    The Tech Guy
    • Website

    Related Posts

    Railway secures $100 million to challenge AWS with AI-native cloud infrastructure

    January 22, 2026

    Claude Code costs up to $200 a month. Goose does the same thing for free.

    January 20, 2026

    Listen Labs raises $69M after viral billboard hiring stunt to scale AI customer interviews

    January 16, 2026

    Salesforce rolls out new Slackbot AI agent as it battles Microsoft and Google in workplace AI

    January 13, 2026

    Converge Bio raises $25M, backed by Bessemer and execs from Meta, OpenAI, Wiz

    January 13, 2026

    Anthropic launches Cowork, a Claude Desktop agent that works in your files — no coding required

    January 13, 2026
    Leave A Reply Cancel Reply

    Advertisement
    Top Posts

    The iPad Air brand makes no sense – it needs a rethink

    October 12, 202516 Views

    ChatGPT Group Chats are here … but not for everyone (yet)

    November 14, 20258 Views

    Facebook updates its algorithm to give users more control over which videos they see

    October 8, 20258 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Advertisement
    About Us
    About Us

    SynapseFlow brings you the latest updates in Technology, AI, and Gadgets from innovations and reviews to future trends. Stay smart, stay updated with the tech world every day!

    Our Picks

    US Destroys All Military Targets on Kharg Island Which Is Iran’s Oil Export Hub

    March 14, 2026

    The vivo X300 Ultra will upgrade audio quality on all levels

    March 14, 2026

    This Supreme Court decision is bad news for Hollywood’s AI ambitions

    March 14, 2026
    categories
    • AI News & Updates
    • Cybersecurity
    • Future Tech
    • Reviews
    • Software & Apps
    • Tech Gadgets
    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 SynapseFlow All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.