A while back, I asked Claude, ChatGPT, and Gemini to build a solar system simulator. This was back when Claude was the talk of the town, and more people, myself included, were realizing that maybe they shouldn’t limit themselves to ChatGPT.
That test produced one of the clearest results I’ve had. Claude knocked it out of the park and won decisively. Gemini produced working but not especially impressive code, and, interestingly, ChatGPT failed.
Today, I decided to test a different parameter. This time, instead of writing code, I tasked the LLMs with debugging code. Specifically, I asked them to fix ChatGPT’s broken code.
ChatGPT’s broken code
A small but catastrophic bug
The previous test had a simple constraint in place: no retries. Whatever we got in the first reply was the result we judged. Unfortunately for ChatGPT, the code it generated had a small, subtle mistake that rendered the whole thing unusable.
I talked about this in the original article too. ChatGPT used kilometers for some distances, and then AU (astronomical units) for others. It then mixed them up and ended up rendering space and diameters in AU, but the distance between the planets in kilometers. The result was that the planets were only kilometers apart from each other, which, on the scale of the solar system, meant they were effectively all inside each other.
ChatGPT’s code was neat, and it had implemented everything, but, alas, this small mistake made it unusable to a degree where I couldn’t really judge the other aspects of the code.
Writing code and debugging code are two different skills. It’s the same in humans. One developer might excel at writing code, while another excels at debugging it. Same concept applies elsewhere too. A good writer isn’t necessarily a good editor, and vice versa.
Anyways, let’s jump into the test. The prompt is below. It’s simple, and I made a point of asking them to specify what they fixed so it would be obvious to us.
You are given a broken codebase for a solar system simulator. The code contains a flaw that makes the simulator unusable. Your tasks: Identify the exact issue causing the failure. Fix the code. Return the fully corrected version of the code. Briefly explain what was broken and how you fixed it. Keep the explanation concise and focus only on the root cause and resolution.
The main test, initially, was to see whether Claude could fix ChatGPT’s code, but just to add more context, I asked all the usual suspects to do the same.
Gemini gets it right
Better than I expected
Gemini has lately been the worst of the bunch, at least to me. It’s overzealous. In my experience, even if you send it perfectly fine code and ask it to fix the bugs in it, it’ll hallucinate a bug, “fix” it, and destroy the code. So I was interested in whether it would be able to pinpoint the actual bug in code that was genuinely broken. For the record, we’re using Gemini 3.1 Thinking.
To my surprise, Gemini nailed it. It pinpointed the bug exactly: the projection engine expects positions in kilometers, but orbitalPosition and drawOrbit calculate in astronomical units. It even went as far as stating the result of the bug: it was, in effect, stacking the planets on top of the sun, making the system appear empty.
The fix it used was correct too, and the corrected code worked. I could finally see ChatGPT’s solar system simulator! You can too, I put it up on a Vercel page.
- OS
-
Android
- Developer
-
Google
ChatGPT redeems itself
The redemption arc
I know, I know. This is, after all, ChatGPT’s broken code. We can’t ask the same chatbot that wrote the broken code to find the fault in it. But, ChatGPT has had an uptick recently. It has been performing better than Gemini, in my experience at least. ChatGPT got so annoying at one point that I switched to Claude, but now, when I use it occasionally, I tend to get good results.
That’s natural. These models are constantly being tuned and updated, even when the version name doesn’t change. In my experience, ChatGPT is also much more sensitive to custom instructions than other chatbots, so your custom instructions in ChatGPT can drastically affect the experience.
Very surprisingly, ChatGPT nailed it. It found the root cause, gave a concise but clear explanation just as I had asked, and fixed the code. And the fixed code works. I used ChatGPT 5.4 Thinking.
I want to infer something here, but I’m not fully sure. ChatGPT thought the longest on the initial task. Maybe all the thinking that went into the task before generating the code filled up the context window and contributed to the mistake. Or maybe it’s just tuned in a way that makes it better at smaller tasks and tweaks than at generating a project from scratch.
I’m not certain. But the contrast is interesting. However, what’s most interesting is what comes in the next paragraph.
- OS
-
Android, iOS, Web
- Developer
-
OpenAI
Claude!
The biggest surprise
Claude’s output in the last test was simply in another league. It was more thorough, more textured, more factually rich, and more scientifically feasible. It set itself miles apart from ChatGPT and Gemini.
But here’s the surprise: Claude failed to find the main bug in ChatGPT’s code.
Instead, it found another bug involving the camera panning mechanism. To be fair, it didn’t hallucinate it. That bug is real, but it only shows up when you’re Shift-drag panning, and at most camera angles it’s subtle enough that it’s easy to miss. But considering there was a much, much bigger bug involved, one that made the simulator effectively useless, Claude completely missed it.
Weird, right? The chatbot that wrote the best code was now the worst at debugging someone else’s code. This time, Claude failed where ChatGPT and Gemini succeeded. For the record, we’re using Claude Sonnet 4.6, the same model we used before.
I was disappointed, and because I like Claude, I bent the rules and gave it another chance:
That is a bug, but it’s a subtle bug. There’s a much bigger bug in the code. One that renders it useless.
Claude promised to look at the code more carefully and went at it again. It then came back with another result, which was again false. But, interestingly, it didn’t stop there. It said, “let’s look harder,” and continued.
Then it replied with another epiphany: “Now I see it.” But again, it got the wrong bug. This still wasn’t the scale-related showstopper. But Claude still didn’t stop. It kept thinking and generating until it finally hit a “wait, no” moment. That’s when it realized the AU-versus-kilometers bug. The consequence it deduced was off, but the solution was correct regardless. You can see snippets of the conversation in the gallery above.
Finally, it reported the correct bug along with the correct fix. Although I wonder how many tokens it burnt through with all that thinking, given Claude’s more restrictive token limits.
- Developer
-
Anthropic PBC
- Price model
-
Free, subscription available
Get a second pair of eyes, even if it’s artificial
Claude is more of a bot than the other chatbots. It’s part of why I switched to Claude and why I’ve been using it the most lately. I want my chatbot to be a chatbot. I don’t want it trying so hard to be my best friend. I don’t want or need it to act human.
Whereas more mainstream chatbots like Gemini and ChatGPT seem more geared toward the average person and try harder to sound and feel human, Claude doesn’t. I like to think that distinction leaked into this test too. Claude found real bugs, but not the decisive one first. In Claude’s reasoning, it seemed to go something like this: this is a bug; the code shouldn’t have bugs; this matters; therefore this must be the important bug; end task.
Claude produced the strongest original simulator, but it was the weakest at identifying the most important bug when constraints were involved. That, to me, is the main lesson here. Get a second pair of eyes on any project that matters, even if they’re artificial eyes.
One LLM is much like another on the surface, but they differ in meaningful ways. Start a separate conversation and ask it to find the bug in its own code there. Send that same code to another chatbot and ask it to do the same.
More and more, it turns out there isn’t one model to rule them all. We’re probably going to need a combination of them, just in case.

