Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you share an example of the tasks you found Codex being much better? From my experience Claude Code is much better.


Codex works much better for long-running tasks that require a lot of planning and deep understanding.

Claude, especially 4.5 Sonnet, is a lot nicer to interact with, so it may be a better choice in cases where you are co-working with the agent. Its output is nicer, it "improvises" really well even if you give it only vague prompts. That's valueable for interactive use.

But for delegating complete tasks, Codex is far better. The benchmarks indicate that, as do most practicioners I talk to (and it is indeed my own experience).

In my own work, I use Codex for complete end-to-end tasks, and Claude Sonnet for interactive sessions. They're actually quite different.


I disagree, Codex always gets stuck and wants to double check and clarify things, its like "dammit just execute the plan and don't tell me until its completely finished"

The output of codex is also not as great. Codex is great at the planning and investigation portion but sucks at execution and code quality.


I've been dealing with this on Codex a lot lately. It confidently wraps up a task, I go to check it's work... and it's not even close.

Then I do a double take and re-read the summary message and realize that it pulled a "and then draw the rest of the owl", seemingly arbitrarily picking and choosing what it felt like doing in that session and what it punted over to "next steps to actually get it running".

Claude is more prone to occasional "cheating" with mocked data or "tbd: make this an actual conditional instead of hardcoded If True" stuff when it gets overwhelmed which is annoying and bad. But it at least has strong task adherence for the user's prompt and doesn't make me write a lawyer-esque contract to avoid any loopholes Codex will use to avoid doing work.


Are you using something like spec-kit?


Can / Does Codex actually check docker logs and other things for feedback while iterating on something that isnt working ? That is where the true magic of Claude comes for me. Often things cant be one shot, but being able to iteratively check logs, make an adjustment, rebuild the docker containers, send a curl, and confirm fixed is huge improvement.


Yes, in this regard it's very similar. It works as an agent and does whatever you need it to do to complete the task. In comparison to Claude it tends to plan more and improvise less.


I'm on the same page here. I have seen this sentiment about Codex suddenly being good a few times now, so I booted Codex CLI thinking-high back up after a break and asked it to look for bugs. It promptly found five bugs that didn't actually exist. It was the kind of truly impressively stupid mistake that I haven't seen Claude Code make essentially ever, and made me wonder if this isn't the sort of thing that's making people downplay the power of LLMs for agentic coding.


I asked Sonnet 4.5 to find bugs in the code, it found five high-impact bugs that, when I prompted it a second time, it admitted weren't actually bugs. It's definitely not just Codex.


In my case codex fixed a bug in one shot. Took 10 min to debug and find it.

Claude struggled long time and still didn’t find.


Same here. I tried codex a few days ago for a very simple task (remove any references of X within this long text string) and it fumbled it pretty hard. Very strange.


yeah I'm in the same boat. Codex can't do this one task, and constantly forgets what I've told it, and I'm reading these comments saying how is so great to the point that I'm wondering if I'm the one taking the crazy pills. Maybe we're being A/B tested and don't know about it?


No, no one that's super boosting the LLMs ever tells you what they are working on or give any reasonable specifics about how and why it's beneficial. When someone does, it's a fairly narrow scope and typically inline with my experience.

They can save you some time by doing some fairly complex basic tasks that you can write in plain language instead of coding. To get good results you really need a lot of underlying knowledge yourself and essentially, I think of it as a translator. I can write a program in very good detail using normal language and then the LLM can convert it to code with reasonable accuracy.

I haven't been able to depend on it to do anything remotely advanced. They all make up API endpoints or methods or fill in data with things that simply don't exist, but that's the nature of the model.


You misread me. I'm one of the people you're complaining about. Claude code has been great in my experience and no I don't have a GitHub repo of code that's been generated for you to tell me that's trivial and unadvanced and that a child could do it.

What I'm saying was to compare my experience with Claude code vs Codex with GPT-5. CC's better than codex in my experience, contrary to GP's comment.


Maybe, just maybe, people are lying on the internet. And maybe those people have a financial interest in doing so.


IMO gpt5-codex medium is much better as soon as the task becomes slightly complex, or the context grows a bit.

Sora 4.5 tends to randomly hallucinate odd/inappropriate decisions and goes to make stupid changes that have to be patched up manually.


Yes Sora hallucinates significantly more than Claude.

I find that Codex generally requires me to remove code to get to what I want, whereas Claude I tend to use what it gives me and I add to it. Whether this is from additional prompting or from manual typing, i just find that codex requires removal to get to desired state, and Claude requires adding to get to desired state. I prefer adding incrementally than removing.


Curiously, you yourself did not provide an example where, from your experience, Claude Code was much ebtter.


I can not. We're all racing very hard to take full advantage of these new capabilities before they go mainstream. And to be honest, sharing problem domains that are particularly attractive would be sharing too much. Go forth and experiment. Have fun with it. You'll figure it out pretty fast. You can read my other post here about the kinds of problem spaces I'm looking at.


Ah, super secret problem domains that have been thoroughly represented in the LLM training data. Nice.


Why would you even comment that Codex CLI is potentially worth switching an enormous amount of spend over ($70k) and give literally 0 evidence of why it's better? That's all you've got? "Trust me bro"?


I'm seeing the downvotes. I'm sorry folks feel that way. I'm regretting my honesty.

Edit: I'd like to reply to this comment in particular but can't in a threaded reply, so will do that here: "Ah, super secret problem domains that have been thoroughly represented in the LLM training data. Nice."

This exhibits a fundamental misunderstanding of why coding agents powered by LLMs are such a game changer.

The assumption this poster is making is that LLMs are regurgitating whole cloth after being trained on whole cloth.

This is a common mistake among lay people and non-practitioners. The reality is that LLMs have gained the ability to program, by learning from the code of others. Much like a human would learn from the code of others, and then be able to create a completely novel application.

The difference between a human programmer an an agentic coder is that the agent has much broader and deeper expertise across more programming languages, and understands more design patterns, more operating systems, more about programming history, etc etc and it uses all this knowledge to fulfill the task you've set it to. That's not possible for any single human.

It's important for the poster to take two realities on board: Firstly, agentic coding agents are not regurgitating whole cloth from whole cloth. Instead they are weaving new creations because they have learned how to program. Secondly, agentic coding agents have broader and deeper knowledge than any human that will ever exist, and they never tire, and their mood and energy level never changes. In fact that improves on a continuous basis as the months go by and progress continues. This means we can, as individual practitioners or fast moving teams, create things that were never before possible for us without raising huge amounts of money and hiring large very expensive teams, and then having the overhead of lining everyone up behind a goal AND dealing with the human issues that arise, including communication overhead.

This is a very exciting time. Especially if you're curious, energetic, and are willing to suspend disbelief to go and take a look.


I don't have any particular horse in this race, but looking at this exchange, I hope its clear where the issue is coming from.

The original post states "I am seeing Codex do much better than Claude Code", and when asked for examples, you have replied with "I don't have time to give you examples, go do it yourself, its obvious."

That is clearly going to rub folks (anyone) the wrong way. This refrain ("Wheres the data?") pops up frequently on HN, if its so obvious, giving 1 prompt where Codex is much greater than Claude doesn't seem like a heavy lift.

In absence of such an example, or any data, folks have nothing to go on but skepticism. Replying with such a polarizing comment is bound to set folks off further.


We've all been hearing from people talking about how amazing AI coding agents are for a while now. Many skeptics have tried them out, looked into how to make good use of them, used modern agentic tools, done context engineering, etc. and found that they did not live up to the claims being made, at least for their problem domain.

Talk is cheap, and we're tired of hearing people tell us how it's enabling them to make incredible software without actually demonstrating it. Your words might be true, or they might be just another over-exaggeration to throw on the pile. Without details we have no way of knowing, and so many make the empirically supported choice.


I agree. It’s pretty easy to put-up or shut up.

I recently vibe coded a video analysis pipeline with some related arduino-driven machine control. It was work to prototype an experience on some 3D printed hardware I’ve been skunking out.

By describing the pipeline and filters clearly, I had the analysis system generating useful JSON in an hour or so, including machine control simulation, all while watching TV and answering emails/slacks. Notable misses were that the JSON fields were inconsistent, and the python venvs were inconsistent for the piped way that I wanted the system to operate with.

Small fixes.

Then I wired up the hardware, and the thing absolutely crapped itself, swapping libraries, trying major structural changes, and creating two whole new copies of the machine control host code (asking me each time along the way). This went on for more than three hours, with me debugging the mess for about 20 minutes before resorting to 1) ChatGPT, which didn’t help, followed by 2) a few minutes of good old fashioned googling on serial port behavior on Mac, which, with an old sitting on the shelf Uno R3, meant that I needed to use the cu.* ports instead of tty.*, something that Claude Code had buried deeply in a tangle of files.

Curious about the failure, I told Claude Code to stop being an idiot and use a web browser to go research the problem of specifically locking up on the open operation. 30 seconds later, and with some reflective swearing from Opus 4.1, which I appreciate, I had the code I should have had 3 hours prior (along with other garbage code to clean up).

For my areas of sensing, computer vision, machine learning, etc., these systems are amazingly helpful if the algorithms can be completely and clearly described (e.g., Kalman filter to IoU, box blur followed by subsampling followed by split exponential filtering, etc.).

Attempts to let the robots work complex pipelines out for themselves haven’t gone as well for me.


I just had Claude code convert all my personal projects over to be dockerized, and then setup the deployment infra and scripts for everything, and finally move my server off of the nightmare nginx config file I was using.


Never hold regret for having honesty, it tends to lose its value completely if you only care about it when you have good news to deliver. If for anything, hold regret for when you didn't have something better appreciated to be honest about.

The easier threading-focused approach to the conversation might be to add the additional comment as an edit at the end of the original and reply to the child https://news.ycombinator.com/item?id=45649068 directly. Of course, I've broken the ability to do that by responding to you now about it ;).


Thanks. I wasn't able to reply in a thread earlier - I guess HN has a throttle on that. So I edited the comment above to add a few more thoughts. It's a very exciting time to be alive.


Just click on the time. Where yours says ‘two hours ago’ now, if you click on that you can reply directly to any sub comment in a thread.


lol, thanks.


You’re getting downvoted because the amount of weight I place on your original comment is contingent on whether or not you’re actually using AI to do meaningful work ot not. Without clarifying what you’re doing, it’s impossible to distinguish you from one of those guys that says he’s using AI to do tons of work and then you peek under the hood and he’s made like 15 markdown files and his code is a mess that doesn’t do anything.

Well, that, and it’s just a bit annoying to claim that you’ve found some amazing new secret but that you refuse to share what the secret is. It doesn’t contribute to an interesting discussion whatsoever.


> I'm seeing the downvotes. I'm sorry folks feel that way. I'm regretting my honesty.

What honesty? We're not at the point of "the Godfather was a good/bad movie", we're at "no, trust, there's a really good movie called the Godfather".

Your honesty means nothing for an issue that isn't about taste or mostly subjectivness. How useful AI is and in what way is a technical discussion where the meat of the subject matter is. You've shared nothing on that front. I am not saying you have to, but like obviously people are going to downvote you - not because they might agree/disagree but because it's contributed nothing different from every other ai-hype man selling a course or something.


this is absurd. no one needs or wants your AI generated answer that's a whole lot of nothing


Comments like this reveal the magnitude of polarization around this issue in tech circles. Most people actually feel this kind of animosity towards AI, and so having comment threads like this even be visible on HN is unusual. Needless to say, all my comments here are hand written. But the poster knows that, of course.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: