ChatGPT vs Claude vs Copilot for Excel (I Tested All 3)
Most AI tool comparisons I come across test the wrong things. They ask each tool to write a cover letter, summarize a document, or explain a concept. Then they rank the outputs based on which one sounded the most polished. That’s useful if you’re a content writer. It tells a finance professional almost nothing.
The question I actually needed answered was simpler and more specific: if I’m building a forecast model or writing a CFO memo under deadline, which of these tools gets me to something usable on the first pass? That’s it. Not which one has the most features. Not which one has the cleanest interface. Which one reduces the amount of work I have to do after it finishes.
So I set up a test of ChatGPT vs Claude vs Copilot for Excel. Same dataset going into every tool. Same three prompts, in the same order, with no extra coaching beyond answering direct questions if the tool asked. I scored on three things: output quality, time to a usable result, and cleanup required after the tool finished. I ran all three tasks through ChatGPT for Excel, Claude for Excel, and Copilot for Excel, and I let the outputs speak for themselves.
I went in with a hypothesis. I came out with a clear verdict and a couple of surprises. If you’re currently using one of these tools and wondering whether you picked right, this is the test that should help you answer that.
My Testing Framework
The dataset I used was built around a fictional coffee shop business with six months of actuals from January through June 2023. It had revenue, cost of goods sold, and operating expenses broken out across three locations, with actuals versus budget included. Nothing exotic. The kind of file that lands in your inbox every month.
The three tasks were:
- Task 1: Build a twelve-month forecast model with trend assumptions, structured so scenarios can be run on it.
- Task 2: Run a scenario where foot traffic falls 10% while food costs increase 5% simultaneously.
- Task 3: Turn the forecast and scenario results into a CFO update memo with findings and a recommendation.
I scored each output on how much of it was usable without rebuilding, how long it took to get there, and how much cleanup it required after the tool finished. Same criteria, applied consistently across all three tools.
How to Enable AI Inside Excel
Before we get into the results, a quick note on setup, because this is where a lot of people hit friction and give up before they even get started.
Copilot for Excel
If you’re in a Microsoft 365 environment, Copilot is already there. You don’t need to install anything or connect a third-party plugin. What you do need to enable is what Microsoft now calls “Edit with Copilot,” which was previously labeled as agent mode.
This is the setting that gives Copilot the ability to work across the entire workbook rather than just responding in the chat panel. Without it turned on, Copilot is answering questions but not touching your file. With it on, it can build, modify, and populate sheets directly. That’s the version you want for anything beyond a simple lookup.

Claude and ChatGPT for Excel
Both of these are third-party tools, which means they don’t connect to Excel by default. You have to manually enable that connection, and the process is slightly different for each one. The good news is that once it’s set up, both tools can interact directly with your workbook the same way Copilot does. If you need a walkthrough on getting either of them set up, I’ve covered both in separate deep-dive videos and the links are below this article.

One thing worth flagging: both Claude for Excel and ChatGPT for Excel are still technically in beta. That’s not a dealbreaker, but it’s worth knowing going in. The outputs we’re about to look at are coming from tools that are still being refined, which is part of why running a structured test matters. You want to know what you’re working with before you depend on it under deadline.
Task 1: Building a Forecast Model From Raw Data
What I Asked Each Tool to Do
The prompt was deliberately simple: build me a forecast model based on this data, twelve months forward, with trend assumptions, structured so I can run scenarios on it. I didn’t give any tool extra guidance on format or methodology. I didn’t coach them on what a good output looks like. The only exception was Claude, which came back and asked me to confirm its plan before proceeding. I said yes and kept it moving. Beyond that, every tool got the same inputs and the same level of engagement.
The clock started when I hit send.
How Each Tool Responded
ChatGPT worked in the background without showing its progress. Two minutes and forty-six seconds later it came back with a forecast model, and that’s where the problems started. It built everything into a single cluttered tab, merging the assumptions and the forecast into the same block of cells in a way that overwrote part of its own structure.
Cost of goods sold was missing entirely from the forecast even though it was clearly present in the source data. The twelve months were there, technically, but the output was hard to read, harder to use, and would have required significant reconstruction before it went anywhere near a director.

Claude did something none of the other tools did: it came back before building anything and told me its plan. Here’s the structure I’m thinking about, does this work for you? I said yes. That added maybe thirty seconds. What came back made it worth it.
Claude built a separate assumptions tab with driver-based logic for each account, notes on why it chose each assumption, a consolidated P&L forecast by month, individual location-level detail, and a chart it added without being asked. The full run came in under two minutes. Cost of goods sold was included. The structure was clean. If I had to hand that output to someone in a pinch, I could have done it with a light review.

Copilot ran in just under two minutes and was transparent about its steps throughout, which was a nice contrast to ChatGPT doing everything in the background. The problem wasn’t the process, it was the methodology. Instead of building a driver-based forecast with assumptions, Copilot used FORECAST.LINEAR, which is a native Excel function that runs a straight statistical projection off the historical numbers.
It’s not wrong, exactly. It’s just not how a finance team actually builds a forecast model. There were no assumptions to interrogate, no drivers to adjust, and no way to run a meaningful scenario off of it. The formatting was clean and easy to read. The analytical foundation wasn’t there.

CASE STUDY: The Night Before the Board Deck]
Here’s a situation I’ve seen play out more than once. It’s 9pm. The board deck goes out at 7am. An analyst needs a first-draft forecast model to hand to a manager for review by midnight. Building one from scratch takes three to four hours. With Claude, that first draft was ready in under two minutes at 90% usability, meaning light review and some company-specific formatting and it’s done.
With Copilot, the analyst gets a well-formatted starting point but has to rebuild the methodology before anyone can run scenarios off it. With ChatGPT, the output is missing a core line item and has structural problems that take longer to fix than starting over. The tool you pick in that moment determines whether the analyst gets home before 2am.
Task 1 Verdict
Claude finished first and it wasn’t close. A clean assumptions table, driver-based logic, location-level detail, and a consolidated P&L that was close to ready. Copilot came second on the strength of its formatting and the fact that FORECAST.LINEAR is at least a stable and accurate Excel function even if it’s the wrong methodology for this kind of work. ChatGPT came in last with an output that was unusable, structurally broken, and missing a core line item that was sitting right there in the source data.
Task 2: Scenario Analysis — Where the Real FP&A Work Lives
The scenario prompt was this: what happens if foot traffic falls 10% while food costs increase 5% at the same time?
That’s a rate-volume problem. Revenue goes down because fewer customers are coming in. Cost of goods sold has two things happening simultaneously. It goes down from a volume standpoint because you’re serving fewer people, but it goes up from a rate standpoint because each unit of food costs more. A correct output separates those two effects. It doesn’t just apply 5% to the cost line and call it a scenario. It adjusts cost of sales down for the volume decrease first, then applies the rate increase on top of that.
This is the task that tells you whether a tool actually understands what you’re asking or whether it’s pattern-matching to something that looks like a finance output without understanding the underlying logic.
How Each Tool Handled It
ChatGPT ran fast, under a minute, which given what came back was not a good sign. The scenario was built on top of a forecast that was already missing cost of goods sold entirely. So when ChatGPT applied the scenario, it had nothing to work with on the cost side.
The operating margin in the output swung from roughly zero to over 100% across the scenario period, which is what happens when you run math on a model with a missing line item. It was also a static output. No editable inputs, no ability to adjust the assumption and see the results change. It was a snapshot of a flawed calculation.

Claude took about two minutes and came back with something meaningfully different from what the other tools produced. It built a dynamic scenario model with editable inputs sitting right in the sheet. The foot traffic decrease and the food cost increase were both adjustable, and changing either one updated the full P&L in real time. More importantly, it actually separated the rate and volume effects correctly.
The cost of sales decreased from volume before the rate increase was applied on top of it. Net operating income dropped 25.6% over the twelve-month period under the scenario. When I changed the foot traffic assumption to test the model, every number adjusted correctly. It caught the intention behind the prompt, not just the surface instruction.

Copilot ran fast and came back with an output that was directionally reasonable but mechanically off in a specific way. It applied the 5% cost increase to the gross cost of sales number without first adjusting it down for the volume decrease. That’s the rate impact without the volume benefit, which means the scenario overstates the cost pressure.
It also made some questionable choices, including applying the impact to utilities and marketing but leaving labor and rent unchanged, without a clear explanation for why. The output was functional. It’s not the kind of thing you’d want to present to a CFO without checking the math first.

CASE STUDY: The Scenario That Goes Into the CFO Presentation
A finance manager I worked with was building a sensitivity analysis for a leadership review. The scenario was a revenue shortfall combined with an input cost increase, essentially the same structure as this test. She ran it through one of these tools, got an output that looked right, and started building the presentation around it. It wasn’t until the night before the meeting that someone caught that the cost of sales impact was calculated on the wrong base.
The tool had applied the rate increase without accounting for the volume decrease first, which overstated the margin compression by almost four percentage points. She rebuilt it manually in two hours. The scenario that actually understands rate-volume isn’t just saving time. It’s reducing the risk of presenting numbers that don’t hold up when someone in the room asks how you got there.
Task 2 Verdict
Same ranking as Task 1, but the reasoning is different and worth paying attention to. Claude’s advantage in Task 1 was largely about structure and output quality. In Task 2, the advantage was analytical accuracy under constraint.
When you change an assumption and ask the model to cascade that change correctly through a P&L, you find out fast whether the tool understood the structure of what it built or was just producing something that looked like a forecast. Claude understood it. Copilot got the direction right but the math wrong. ChatGPT was working off a broken foundation from the start, which compounded into a scenario output that couldn’t be used or trusted.
Task 3: Writing a CFO Memo — The Task That Exposes Everything
A CFO memo is not a reformatted P&L. It’s not a data dump with a header on it. It leads with the finding, explains what changed and why it matters, and ends with a recommendation the reader can act on. The CFO doesn’t want to see the methodology. They want to know what the numbers mean and what to do about them.
That’s a specific kind of writing. It requires judgment about what to include and what to leave out. It requires understanding that quarterly figures are more useful to an executive than twelve months of monthly detail. It requires a section on what you recommend, not just what you found. Most finance professionals know this instinctively because they’ve sat across from a CFO and watched their eyes glaze over at a slide full of monthly variances.
The question in this task was whether any of these tools know it too.
What Each Tool Produced
ChatGPT placed the memo in a range of cells on a new tab, which is a reasonable approach. What went inside those cells was not. The body of the memo was a number dump. It pulled the base case and scenario figures and dropped them into the document as a table, which is precisely what a CFO memo is not supposed to do. The key takeaways section read as generic observations with no analytical weight behind them.
And because the forecast from Task 1 was already missing cost of goods sold, the operating margin figure cited in the memo was mathematically inconsistent with any reasonable reading of the business. I’ll be direct: if that memo went out with a CFO’s name on it, it would raise questions about whoever prepared it.

Claude built an entirely new tab and structured it the way a finance professional would actually structure a CFO memo. Executive summary up front, not buried on page two. Key metrics in a dashboard format. Quarterly P&L instead of twelve months of monthly detail, because that’s what belongs in an executive update. A location performance comparison. A stress test summary from the scenario. And a section explicitly labeled observations and recommendations, because decisions are the point.
There’s one moment from this run worth calling out specifically. At some point during the test, I had manually adjusted the cost of goods sold figure in the dataset. I didn’t mention it in the prompt. Claude caught it, noted the change, and adjusted the memo accordingly. I didn’t ask it to do that. It just understood that accuracy matters and flagged it without being told to look for it.
The output needed polish. There are always company-specific details to add, language to tighten, and formatting preferences to apply. But the structure was right, the judgment calls were right, and it was 90% of the way there on the first pass.

Copilot produced something that was a real step up from ChatGPT in terms of structure and readability. The header was clean. Key metrics were prioritized near the top, which is exactly where a CFO wants them. The forecast summary was coherent.
But the memo was missing the one section that makes a CFO memo a CFO memo: a recommendation. There were observations in the output, but nothing that told the reader what to do with the information. The analytical narrative needed work too, partly because the underlying scenario from Task 2 had the rate-volume calculation off.
You can write a well-formatted memo around a flawed analysis and it will still be a well-formatted memo around a flawed analysis. Copilot got to about 60 to 70% of the way there, which is meaningful progress over a blank page but still requires a finance professional to do real work before it goes anywhere.

CASE STUDY: The 11pm Memo
A finance manager is preparing a CFO update after month-end close. It’s 11pm. The leadership call is at 7am. The benchmark isn’t perfection. It’s whether the output can go out with light review or requires a full rewrite. With ChatGPT, it requires a full rewrite and the underlying numbers have integrity problems that need to be found and fixed before the rewrite starts.
With Copilot, the manager gets a structurally sound document that needs an analytical narrative and a recommendation added, plus a check on the scenario math from Task 2. That’s meaningful work at 11pm. With Claude, the manager is reviewing and polishing. Tweaking the language, adding company context, confirming the numbers tie to the source. That’s a different night entirely.
Task 3 Verdict
Claude was the only output that was close to send-ready on the first pass. Copilot produced something structurally competent that needed analytical reinforcement and a recommendation before it was usable. ChatGPT produced a document that would require more work to fix than to replace, and that’s before accounting for the math problems it carried forward from Task 1.
Across all three tasks, the ranking never changed. What changed was the margin. Claude’s advantage on the CFO memo was wider than on the forecast model, because this task requires judgment and context, not just structured output. That’s the kind of gap that matters most when the work is due in the morning.
Final Verdict: ChatGPT vs Claude vs Copilot For Excel
After running all three tasks, the ranking was consistent from start to finish. That consistency is itself worth noting. I went into this test expecting some variability, expecting one tool to win one task and lose another, expecting to come out with a nuanced “it depends on what you’re doing” answer. That’s not what happened.
Claude finished first across all three tasks, with outputs that landed at roughly 90% usability on the first pass. Copilot came in second, consistently in the 60 to 70% range, with outputs that were structurally reasonable but required meaningful work before they were ready to use.
ChatGPT came in last, and it wasn’t a close last. The forecast model was structurally broken and missing a core line item. The scenario analysis compounded that flawed foundation into a result that couldn’t be trusted. The CFO memo was a number dump with an operating margin that didn’t reflect any coherent reading of the business. Across three separate tasks, ChatGPT produced outputs that were either unusable or would cost more time to fix than to replace.
When to Use Claude for Excel
Claude is the right tool when the output has to hold up. Driver-based financial modeling, scenario analysis where the math needs to cascade correctly through a P&L, executive communication where the judgment calls matter as much as the structure. These are the tasks where the gap between a 90% first draft and a 60% first draft is the difference between a late review and a late night.
It’s also the right tool when you want the tool to meet you where you are on the prompt. Claude produced clean, structured, analytically sound outputs from the same simple prompts that sent the other tools sideways. That’s not a small thing when you’re running this workflow every month.
When Copilot Still Makes Sense
Copilot isn’t a bad tool. It’s a mismatched tool for the specific tasks I ran it through. If your team is deep in the Microsoft 365 ecosystem and switching costs are real, Copilot is a functional starting point for simpler reporting workflows where FORECAST.LINEAR is an acceptable methodology and the deliverable doesn’t require analytical judgment. It’s also more stable in the Excel environment than either third-party option, which matters on complex workbooks where you can’t afford unpredictable behavior.
The honest framing is this: Copilot will get you somewhere. It just won’t get you as far, as fast, on the tasks that require the most from you.
Why ChatGPT for Excel Isn’t Ready for Core FP&A Work
I want to be careful here because this is a verdict on the Excel plugin specifically, tested at a specific point in time, not a verdict on ChatGPT as a tool across every use case. But within the scope of this test, the results were consistent enough that I can say this clearly: I would not use the ChatGPT for Excel plugin for core FP&A work right now.
Missing cost of goods sold in the forecast, building a scenario on top of that flawed foundation, producing a CFO memo with inconsistent numbers, these aren’t prompt engineering problems. These are output quality problems. Spending time iterating on prompts to try to close those gaps is time you could have spent doing the work yourself or using a tool that got closer on the first pass.
How to Replicate This Test on Your Own Data
The most useful thing you can do after reading this isn’t to take my verdict and apply it permanently to your workflow. It’s to run the test yourself on your own data and see what comes back. Your data structure, your prompts, your specific use cases will all influence the results. Here’s how to set it up in a way that gives you a clean, comparable read.
Step 1: Prepare Your Dataset
You want a file that’s representative of the work you actually do, not something built for the test. At minimum it should have actuals versus budget, multiple line items across revenue and expenses, more than one dimension such as location, product, or business unit, and at least three to six months of history. The more it looks like the file that lands in your inbox every month, the more the test results will tell you something useful.
Keep the file clean before you start. Remove anything that isn’t relevant to the tasks. You’re testing the tool’s analytical capability, not its ability to navigate a messy workbook.
Step 2: Build Your Prompt Set
Use the same three prompts across all three tools and resist the temptation to iterate. The point of the test is to see what each tool produces on a fair, consistent ask, not to see how good you can get each one with enough coaching.
The prompts that drove this test were:
- Build me a forecast model based on this data, twelve months forward, with trend assumptions and structured so I can run scenarios on it.
- Run a scenario analysis. What happens if [specific assumption] changes by [specific amount] while [second assumption] changes by [second amount]?
- Turn the forecast and the scenario results into a CFO update memo with findings and a recommendation.
Adjust the scenario variables to match something your business actually cares about. The structure of the prompt matters less than the consistency. Every tool gets the same words.
Step 3: Score the Outputs
Apply the same three criteria I used. First, how much of the output is usable without rebuilding it, expressed as a rough percentage. Second, how long it actually took to get there, including any prompt iteration you did. Third, how much cleanup the output required after the tool finished, and what kind of cleanup. Structural problems, like a missing line item or a broken methodology, are more expensive than formatting cleanup. Make that distinction when you’re scoring.
Write it down. The gap between tools tends to feel more obvious when you have to articulate it in words rather than just scanning the outputs side by side.
Step 4: Make the Call
The goal isn’t to find one tool that handles everything. The goal is to identify which tool earns its place in each part of your workflow and stop using the wrong one by default. If Claude wins your forecast model task and Copilot is good enough for your monthly reporting summary, that’s a useful answer. Use it.
What you’re trying to eliminate is the hidden time cost of a default choice made without testing. Most finance professionals pick a tool early, stick with it, and absorb the cleanup time without ever questioning whether a different tool would have gotten them closer on the first pass. Running this test once, on your own data, answers that question in an afternoon.
The work that follows is just using the right tool for the job.
