The real test for AI agents is not the answer. It is the spreadsheet clients can trust
A banking review room shows AI deliverables stopped at a client-ready gate marked 0%.📷 AI-generated / Tech&Space
- ★About 500 current and former bankers evaluated AI outputs on 100 realistic investment-banking tasks
- ★No model produced a client-ready output, and 41% of outputs needed major rework
- ★AI looks more useful as a starting draft than as a final product because formula, logic, and sourcing errors break trust
BankerToolBench targets exactly the zone where generative AI often looks convincing until the cost of an error appears. According to The Decoder and the public BankerToolBench repository, Handshake AI and McGill University built a benchmark around junior-banker work: Excel financial models, PowerPoint decks for clients, PDF reports, and Word memos. This is not a chat prompt with one neat answer.
It is multi-file work where the formula, number, source, and style all have to stay aligned. About 500 current and former investment bankers participated. A subset of 172 bankers designed the tasks and logged more than 5,700 hours of work. The benchmark contains 100 tasks, and a human banker needed five hours on average to complete one, with some taking up to 21 hours. That workload explains why the result is uncomfortable for AI marketing: none of the outputs from the nine tested models was rated ready to send to a client.
"Client-ready" matters here. It does not mean the text sounds professional. It means the deliverable can be sent without hidden calculation errors, wrong assumptions, inconsistent numbers across slides, or missing audit trails. A banker sending a report is not selling a vibe. They are sending a document that has to survive review. That is why the criteria covered technical correctness, client readiness, compliance, auditability, and consistency across files. The numbers are blunt. Bankers rated 41% of AI outputs as needing major rework, and 27% as completely unusable.
Only 13% could pass with light edits, but no output could go to a client as-is. GPT-5.4 led the benchmark, but it still missed the banking bar: only 16% of its outputs qualified as a useful starting point, and when consistency across three attempts was required, that fell to 13%.
BankerToolBench did not test pretty answers alone. It tested Excel models, decks, and reports that investment banks must trust under formulas, audit trails, and client pressure.
A financial model is marked with audit notes for hardcoded values, broken formulas, and source gaps.📷 AI-generated / Tech&Space
The strongest part of the benchmark is also the harshest: these systems do not fail only on grand reasoning problems. They fail on small business details. Claude Opus 4.6, according to the researchers, could look polished on the surface, but its Excel models often used hardcoded values. In plain terms, a number is typed in as a fixed value instead of being calculated by a formula. In investment banking, that is a serious flaw because a scenario cannot update. Change the purchase price, and the model should recalculate.
If the number is just pasted into a cell, the model is performing reliability rather than providing it. BankerToolBench also measures how agents use tools. A single task can trigger up to 539 language-model calls, and 97% of those calls are tied to tool use or code execution. This is not just a question of fluent writing. The AI has to open data rooms, pull market data, read SEC filings, manipulate files, and return a deliverable that survives verification. The longer the chain, the more chances a small error has to become expensive.
The researchers describe four recurring failure modes for GPT-5.4. The largest category, 41%, involves code and formula errors. Another 27% comes from broken business logic, such as adding cost synergies to revenue instead of costs. Aborted data queries account for 18%, and in 13% of cases the model fabricates missing numbers and presents them as sourced. That is the dangerous kind of error: it does not look like a gap. It looks like confidence. The benchmark does not say AI is useless in banking.
More than half of the bankers said they would use the output as a starting point. That is a sensible threshold. A model can sketch structure, assemble early pieces, and speed up a first draft. But a draft is not a delivery. In finance, the difference between those two words can be the difference between saving time and creating a professional liability. BankerToolBench is also a training tool. The authors report that Dr. GRPO and DPO methods improved Qwen model performance by a factor of five to thirteen, though from a very low baseline.
That is the useful signal: the benchmark is not only an indictment of current models, but a map of where they need to improve. For now, the grounded conclusion is simple. AI agents can enter banking work as drafting assistants, but client delivery still needs humans who understand formulas, sources, and accountability.

