
AGI Inc
Building the Evaluations Foundation for the Next Generation of Autonomous AI Agents
Project details
The Client
AGI Inc. is an applied research lab building the next generation of agentic AI systems. The company was founded by Div Garg, the computer scientist and Stanford AI PhD researcher behind Multion, one of the first companies to achieve fully autonomous browser control. AGI Inc. is focused on how AI agents interact with the real web, and their flagship product, Agent-0, is a personalized AI co-worker that completes tasks autonomously on your smartphone and browser.
The Challenge
AGI Inc. had the technical talent to build an ambitious autonomous agent. What they didn't have was an evaluations infrastructure. A system that navigates real websites, completes multi-step tasks, and handles unpredictable edge cases requires more than a working prototype. It requires a way to measure whether it works, catch what's broken before users do, and build confidence in the product before investors see it.
At the time Harbinger came on, AGI Inc. had no formalized QA processes, no benchmarks, and no structured way to evaluate agent performance at scale. The product was early. The stakes were high. And the timeline to investment was short.
Why Harbinger
Harbinger's founder, Julian Brooks, served as Head of QA and Evaluations and Head of Customer Support simultaneously at Multion, the company AGI Inc. spun out of. At Multion, he pioneered AI agent evaluation methodologies when no playbook existed, built automated support systems that achieved 93% customer satisfaction, and helped scale the company from 8 to 20 employees while managing both departments as a single operator. Much of that early evaluation work in 2024 laid the groundwork for what would eventually become REAL Bench, now used by labs like OpenAI and Anthropic to benchmark their own agent systems. (Full Multion case study →)
That history meant zero ramp-up time at AGI Inc. The evaluation frameworks and testing methodologies developed at Multion translated directly, and the AGI Inc. engagement was an opportunity to formalize and finalize that foundational work into a production-grade benchmark. No learning curve. No discovery phase spent figuring out what an autonomous browser agent is supposed to do. Day one was productive.
What We Did
Over a four-month engagement, Harbinger embedded directly into AGI Inc.'s operation and built their evaluations capability from the ground up.
Evaluations Department Buildout
We designed and implemented the QA and evaluation processes that AGI Inc. didn't have yet. This included defining what "success" looks like for an autonomous agent navigating live websites, building the testing frameworks to measure it, and establishing workflows the team could own and run independently after our engagement ended.
REAL Bench — A New Industry Benchmark
We assisted in the creation of REAL Bench, AGI Inc.'s open-source web agent evaluation benchmark. REAL Bench is a compact, fully functional "mini-Internet" featuring replicas of 11 real-world applications (Amazon, DoorDash, Airbnb, and more) with 112 standardized evaluation tasks spanning e-commerce, social media, productivity, and other domains.
Since release, the AGI SDK repository (which houses REAL Bench) has earned 450+ GitHub stars and been forked 37 times. Major AI labs now use it to evaluate their own agent systems, with published scores from OpenAI's CUA, Anthropic's Computer Use, Browser Use, and Stagehand. It has become a credible, independent evaluation standard for the browser agent space.
Product Readiness and User Testing
Before users could test the product, it needed to actually work reliably. We identified and documented over 150 bugs across the system, spanning both agent and model-level evaluation issues and hard UI/application-level defects. Once the product was stable enough for real user sessions, we worked alongside AGI Inc.'s UX teams to run early user testing, identifying friction points and routing findings directly to the development team for iteration.
Mobile QA
We conducted thorough quality assurance testing on AGI Inc.'s mobile application, covering functionality, edge cases, and the kinds of unpredictable real-world scenarios that autonomous agents encounter when operating on actual devices.
Automated QA Solutions Analysis
We completed a comprehensive analysis of available automated QA solutions, evaluating tools across capability, reliability, and fit for AGI Inc.'s specific use case. The result was a detailed report with recommendations the team could act on immediately to scale their testing infrastructure.
Pre-Investment Issue Triage
With an investment round on the horizon, we diagnosed and triaged critical issues across the product. The goal was straightforward: make sure nothing broke in front of investors that could have been caught beforehand. We prioritized the issues that mattered most, ensuring the product presented well when it counted. The investment round closed successfully, and AGI Inc. has since gone through multiple rounds of funding. The company has also secured partnerships with Visa and Mastercard for agentic payment systems.
Results
In four months, Harbinger delivered:
A fully operational evaluations department built from scratch, with processes the AGI Inc. team could own and run independently
150+ bugs identified and documented across agent/model evaluation issues and application-level defects
30% improvement in agent success rate through systematic evaluation and testing
REAL Bench — an open-source benchmark now used across the AI agent industry (450+ GitHub stars, scores published by OpenAI, Anthropic, and others)
A product stable enough for user testing and investor presentations, with critical issues resolved before they became liabilities
An automated QA roadmap based on a thorough analysis of available tools and their fit for autonomous agent testing
A successful investment close, followed by multiple additional funding rounds and partnerships with Visa and Mastercard for agentic payments
Why This Engagement Worked
This project succeeded because the person building AGI Inc.'s evaluation systems had already built evaluation systems for the company AGI Inc. spun out of. That's institutional knowledge applied at the exact moment it was needed.
AGI Inc. was building in a space where no playbook existed, on a tight timeline to investment. Having an operator who had already solved a version of these problems meant the engagement produced results from week one instead of spending months on ramp-up. The four-month timeline reflects that: everything shipped, nothing wasted.
Harbinger embeds AI and automation specialists directly into businesses — not to replace teams, but to make them dramatically more effective. Learn more →
Julian came in already understanding our problem space at a level that would have taken anyone else months to reach. In four months, he helped build out our evaluations infrastructure, helped us ship REAL Bench, and got the product ready for investors.
Div Garg
Founder / CEO

