AI benchmarks show 14-hour task capability, token surge

Investment analysts present benchmarks showing AI can now complete 14-hour human tasks with 80% accuracy, while token consumption has gone parabolic.

A recent webinar hosted by an investment firm laid out hard numbers that make it difficult to dismiss the current AI moment as hype. Over the course of a fireside chat, three senior investment professionals presented benchmarks showing that AI models can now reliably complete tasks that would take a human roughly 14 hours, and they did so with nearly 80% success rates. The numbers point to an inflection point that is already reshaping how the firm views both the technology sector and the broader economy.

The webinar, part of the firm's FLP series, featured Andrew Wetzel, a principal and managing director of sustainable investing; Nate, who leads work on AI and semiconductors; and Ellen Hazen, chief market strategist. The format was described as a fireside chat, with Scott acting as moderator.

The metrics that matter

Nate walked through three slides that captured the pace of change. The first tracked what he called "human equivalent task time" — the length of a task an AI can complete successfully at least 50% of the time. In fall 2024, with ChatGPT o1, that figure stood at 19 minutes. By November 2025, it had jumped to three and a half hours. In December 2025, with GPT-5.2, it reached over six hours. And by February 2026, Claude could handle a 14-hour task — nearly two full workdays — and do everything correctly almost 80% of the time.

That is not an incremental improvement. It is a step change in what the technology can deliver. As Nate put it, "AI systems are suddenly better at language, better at coding, better at reasoning, and just simply much more capable all around."

The second slide showed global token consumption — the fundamental unit of computation that AI models use to process information. Token usage tracked along at a steady pace for years before going parabolic in October 2025. That coincided with the emergence of more capable models that can reason for longer periods, using vastly more tokens to solve complex problems.

The third slide introduced the Artificial Analysis Intelligence Index, a broad benchmark that runs 10 different tests to determine an overall score. Nate used ChatGPT 5.5 as an example: with roughly 4 million tokens across the test suite, it scored about 45%. At 16 million tokens, the score exceeded 50%. At 60 million tokens, it hit 60%. The correlation between compute spent and performance was clear and steep.

What this means for investors

The conversation then turned to the costs of sustaining this trajectory. Ellen Hazen, the chief market strategist, fielded the question of how much capital is needed and who will foot the bill. Her response was characteristically dry: "It's not going to be me."

The implication is that the hardware required to run these increasingly token-hungry models does not come cheap. Data centers packed with GPUs, networking gear, and cooling systems represent hundreds of billions in capital expenditure. The firms building that infrastructure — both the chip designers and the cloud providers — have been the primary beneficiaries of the AI boom so far. But the webinar suggested that the opportunity is widening as models become capable enough to automate whole categories of white-collar work.

Beyond the benchmarks

The panel also touched on specific capabilities that have emerged in the last few months. Nate mentioned OpenClaw, a popular open-source agentic software that is now being used to "perform entire work days of tasks or perform scientific research" — and it was only four months old at the time of the webinar. He also referenced an unreleased model from Anthropic, codenamed Methos, that can find decades-old flaws in existing software systems.

These are not academic demos. They are tools that are already being deployed. When Nate said that AI can now "search the web, execute tasks, and do things that humans typically do, even control household devices or robotics," he was describing a world where the line between human and machine labor is blurring faster than most people realize.

Context and caution

The webinar was clearly aimed at an investment audience, not a technical one. That meant the panelists simplified some concepts. For example, Nate defined AI as "software that finds patterns and makes predictions," which is a useful shorthand but leaves out the architectural innovations — transformers, attention mechanisms, reinforcement learning from human feedback — that made the latest wave possible.

Still, the numbers they presented are striking. A three-month span that saw human equivalent task time jump from 3.5 hours to 14 hours is not business as usual. And the parabolic rise in token consumption suggests that the demand for compute is not leveling off. If anything, it is accelerating.

What the panel did not discuss in detail is the energy cost of this trend, the geopolitical risks around chip supply chains, or the regulatory response that seems increasingly likely. Those are all important factors for any long-term investment thesis. But for the purpose of this webinar, the message was clear: AI's capabilities are climbing a steep curve, and the hardware required to support that climb represents both a huge cost and a huge opportunity.

The bottom line

The FLP webinar provided a data-rich snapshot of where AI stands in early 2026. The benchmarks show that models have crossed a threshold — they can now handle tasks that used to require a full day of human effort, with reliability that makes them economically meaningful. The token explosion indicates that the industry is pouring resources into making these models even better, and the early results suggest that strategy is working.

For anyone following AI, the key takeaway is that the technology is not plateauing. It is still in the phase where throwing more compute at a problem yields noticeably better results. The question is how long that dynamic holds, and whether the infrastructure can keep pace with the ambition.

Syscall News

AI capability leap is real: benchmarks show agents now handle 14-hour tasks

Comments

Leave a comment

Related Stories

Anthropic Withheld Claude Mythos. Here's What's Real.

The 732-byte Python script that breaks every Linux machine

Activist perched atop DC bridge protests AI development and Iran war in multi-day standoff

Linux copyfail flaw, Pentagon AI deals, and more: This week's top tech stories