Hacker Newsnew | past | comments | ask | show | jobs | submit | lairv's commentslogin

That's why I always disliked calling null the "billion dollar mistake", null and Options<T> are basically the same, the mistake is not checking it at compile time


...and if everything was wrapped in Option<>.

If my grandmother had wheels, she'd be a bike.


Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data

Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively

Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days


I also used Gemini 3 Pro Preview. It finished it 271s = 4m31s.

Sadly, the answer was wrong.

It also returned 8 "sources", like stackexchange.com, youtube.com, mpmath.org, ncert.nic.in, and kangaroo.org.pk, even though I specifically told it not to use websearch.

Still a useful tool though. It definitely gets the majority of the insights.

Prompt: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...


Terrence Tao claims [0] contributions by the public are counter-productive since the energy required to check a contribution outweighs its benefit:

> (for) most research projects, it would not help to have input from the general public. In fact, it would just be time-consuming, because error checking

Since frontier LLMs make clumsy mistakes, they may fall into this category of 'error-prone' mathematician whose net contributions are actually negative, despite being impressive some of the time.

[0] https://www.youtube.com/watch?v=HUkBz-cdB-k&t=2h59m33s


It depends a lot about the ratios here. There's a fast flip between "interesting but useless" and "useful" when the tradeoff flips.

How fast can you check the contribution? How small of a part is it? An unsolicited contribution is different from one you immediately directed. Do you need to reply? How fast are followups? Multi-day back and forths are a pain, a fast directed chat is different. You don't have to worry about being rude to an LLM.

Then it comes down to how smart a frontier model is vs the people who write to mathematicians. The latter groups will be filled with both smart helpful people and cranks.


Unlike general public the models can be trained. I mean if you train a member of general public, you've got a specialist, who is no longer a member of general public.


Unlike the general public though, these models have advanced dementia when it comes to learning from corrections, even within a single session. They keep regressing and I haven't found a way to stop that yet.

What boggles the mind: we have gone for so long to try to strive for correctness and suddenly being right 70% of the time and wrong the remaining 30% is fine. The parallel with self driving is pretty strong here: solving 70% of the cases is easy, the remaining 30% are hard or maybe even impossible. Statistically speaking these models do better than most humans, most of the time. But they do not do better than all humans, and they can't do it all of the time and when they get it wrong they make such tremendously basic mistakes that you have to wonder how they manage to get things right.

Maybe it's true that with an ever increasing model size and more and more (proprietary, the public sources are exhausted by now so private data is the frontier where model owners can still gain an edge) we will reach a point where the models will be right 98% of the time or more but what would be the killer feature for me is an indication of the confidence level of the output. Because no matter whether junk or pearls it all looks the same and that is more dangerous than having nothing at all.


A common resistor has a +/- 10% tolerance. A milspec one is 1%. Yet we have ways of building robust systems using such “subpar” components. The trick is to structure the system in a way that builds the error rate into the process and corrects for it. Easier said than done of course for a lot of problems but we do have techniques for doing this and we are learning more.


I think the real killer feature would be that they stop making basic mistakes, and that they gain some introspection. It's not a problem if they're wrong 30% of the time if they're able to gauge their own confidence like a human would. Then you can know to disregard the answer, or check it more thoroughly.


> It's not a problem if they're wrong 30% of the time if they're able to gauge their own confidence like a human would.

This is a case where I would not use human performance as the standard to beat. Training people to be both intellectually honest and statistically calibrated is really hard.


Perhaps, but an AI that can only answer like a precocious child who's spent years reading encyclopedias but has not learned to detect when it's thinking poorly or not remembering clearly is much less useful.


> the killer feature for me is an indication of the confidence level of the output.

I don't think I did something special too ChatGPT to get it to do this, but it's started reporting confidence levels to me, eg from my most recent chat:

> In China: you could find BEVs that cost same or even less than ICE equivalents in that size band. (Confidence ~0.70)


I would counter that any computationally correct code that accelerates any existing research code base is a net positive. I don't care how that is achieved as long as it doesn't sacrifice accuracy and precision.

We're not exactly swimming in power generation and efficient code uses less power.


It's perhaps practical, though, to ask it to do a lot of verification and demonstration of correctness in Lean or another proof environment-- to both get its error rate down and to speed up the review of its results. After all, its time is close to "free."


But he actually uses frontier LLMs in his own work. Probably that's stronger evidence.


It is, but biased evidence, as he's both directing and checking that frontier LLM output and not everyone is Terrence Tao.


> It also returned 8 "sources"

well, there's your problem. it behaves like a search summary tool and not like a problem solver if you enable google search


Exactly this - and how chatGPT behaves too. After a few conversations with search enabled you figure this out, but they really ought to make the distinction clearer.


The requested prompt does not exist or you do not have access. If you believe the request is correct, make sure you have first allowed AI Studio access to your Google Drive, and then ask the owner to share the prompt with you.


I thought this was a joke at first. It actually needs drive access to run someone else's prompt. Wild.


On iOS safari, it just says “Allow access to Google Drive to load this Prompt”. When I run into that UI, my first instinct is that the poster of the link is trying to phish me. That they’ve composed some kind of script that wants to read my Google Drive so it can send info back to them. I’m only going to click “allow” if I trust the sender with my data. IMO, if that’s not what is happening, this is awful product design.


After ChatGPT accidentally indexed everyones shared chats (and had a cache collision in their chat history early on) and Meta build a UI flow that filled a public feed full of super private chats... seems like a good move to use a battle tested permission system.


Imagine the metrics though. "this quarter we've had a 12% increase on people using AI solutions in their google drive".


Google Drive is one of the bigger offenders when it comes to “metrics-driven user-hostile changes”, in gsuite, and its Google Meet is one of its peers.


In The Wire they asked Bunny to "juke the stats" - and he was having none of that.


Not a chance I'll ever click 'ok'. I'd love to be able to opt-out of anything AI related near my google environment.


To clarify, the message above is what I got after giving it Google Drive access.


Not really, that's just basic access control. If you've used Colab or Cloud Shell (or even just Google Cloud in general, given the need to explicitly allow the usage of each service), it's not surprising at all.


Why does AI studio need access to my drive in order to run someone else's prompt? It's not a prompt for authentication with my Google account. I'm already signed in. It's prompting for what appears to be full read/write access to my drive account. No thanks.


Why is this sad. You should bw rooting for these LLMs to be as bad as possible..


If we've learned anything so far it's that the parlor tricks of one-shot efficacy only gets you so far. Drill into anything relatively complex with a few hundred thousand tokens of context and the models all start to fall apart roughly the same. Even when I've used Sonnet 4.5 with 1M token context the model starts to flake out and get confused with a codebase of less than 10k LoC. Everyone seems to keep claiming these huge leaps and bounds, but I really have to wonder how many of these are just shilling for their corporate overlord. I asked Gemini 3 to solve a simple, yet not well documented problem in Home Assistant this evening. All it would take is 3-5 lines of YAML. The model failed miserably. I think we're all still safe.


Same. I've been needing to update an userscript (JS) that takes stuff like "3 for the price of 1", "5 + 1 free", "35% discount!" from a particular site and then converts the price to a % discount and the price per item / 250 grams.

Its an old userscript so it is glitchy and halfway works. I already pre-chewed the work by telling Gemini 3 exactly which new HTML elements it needs to match and which contents it needs to parse. So basically, the scaffolding is already there, the sources are already there, it just needs to put everything in place.

It fails miserably and produces very convincing looking but failing code. Even letting it iterate multiple times does nothing, nor does nudging it in the correct direction. Mind you that Javascript is probably the most trained-on language together with Python, and parsing HTML is one of the most common usecases.

Another hilarious example is MPV, which has very well-documented settings. I used to think that LLMs would mean you can just tell people to ask Gemini how to configure it, but 9 out of 10 times it will hallucinate a bunch of parameters that never existed.

It gives me an extremely weird feeling when other people are cheering that it is solving problems at superhuman speeds or that it coded a way to ingest their custom XML format in record time, with relatively little prompting. It seems almost impossible that LLMs can both be so bad and so good at the same time, so what gives?


1. Coding with LLMs seems to be all about context management. Getting the LLM to deal with the minimum amount of code needed to fix the problem or build the feature, carefully managing token limits and artificially resetting the session when needed so the context handover is managed, all that. Just pointing an LLM at a large code base and expecting good things doesn't work.

2. I've found the same with Gemini; I can rarely get it to actually do useful things. I have tried many times, but it just underperforms compared to the other mainstream LLMs. Other people have different experiences, though, so I suspect I'm holding it wrong.


The problem is by that point it's much less useful in projects. I still like them but when I get to the point of telling it exactly what to do I'm mostly just being lazy. It's useful in that it might give me some ideas I didn't consider but I'm not sure it's saving time.

Of course, for short one-off scripts, it's amazing. It's also really good at preliminary code reviews. Although if you have some awkward bits due to things outside of your power it'll always complain about them and insist they are wrong and that it can be so much easier if you just do it the naive way.

Amazon's Kiro IDE seems to have a really good flow, trying to split large projects into bite sized chunks. I, sadly, couldn't even get it to implement solitaire correctly, but the idea sounds good. Agents also seem to help a lot since it can just do things from trial and error, but company policy understandably gets complicated quick if you want to provide the entire repo to an LLM agent and run 'user approved' commands it suggests.


From my experience vibe coding, you spend a lot of time preparing documentation and baseline context for the LLM.

On one of my projects, I downloaded a library’s source code locally, and asked Claude to write up a markdown file explaining documenting how to use it with examples, etc.

Like, taking your example for solitaire, I’d ask a LLM to write the rules into a markdown file and tell the coding one to refer to those rules.

I understand it to be a bit like mise en place for cooking.


It's kind of what Kiro does.

You tell it what you want and it gives you a list of requirements, which are in that case mostly the rules for Solitaire.

You adjust those until you're happy, then you let it generate tasks, which are essentially epics with smaller tickets in order of dependency.

You approve those and then it starts developing task by task where you can intervene at any time if it starts going off track.

The requirements and tasks, it does really well, but the connection of the epics/larger tasks is where it crumbles mostly. I could have made it work with some more messing around but I've noticed over a couple projects that, at least in my tries, it always crumbles either at the connection of the epics/large tasks or when you ask it to do a small modification later down the line and it causes a lot of smaller, subtle changes all over the place. (could say skill issue since I oversaw something in the requirements, but that's kind of how real projects go, so..)

It also eats tokens like crazy for private usage but that's more so a 'playing around' problem. As it stands I'll probably blow 100$ a day if I connect it to an actual commercial repo and start experimenting. Still viable with my salary, but still..


Honestly, in my biased unscientific testing, what gives is that Gemini isn't actually all that good. I mean, it's fine. but it's not actually good.


>documented problem in Home Assistant this evening. All it would take is 3-5 lines of YAML. The model failed miserably. I think we're all still safe.

This is mostly because HA changes so frequently and the documentation is sparse. To get around this and increase my correction rate, I give it access to the source code of the same version I'm running. Then instructions in CLAUDE.md on where to find source and it must use source code.

This fixes 99% of my issues.


For this issue, additional Media Player storage locations, the configuration is actually quite old.

It does showcase that LLMs don't truly "think" when it's not even able to search for and find the things mentioned. But, even then this configuration has been stable for years and the training data should have plenty of mentions.


Feel like sharing that prompt? I have a feeling that the phrasing on the "must use source code" part needs to be just right.


It's not really magic, in my project folder I will git clone the source code of whatever I'm working on. I will put something in the the local md file like:

Use ./home-assistant/core for the source code of home assistant, its the same version that I'm running. Always search and reference the source when debugging a problem.

I also have it frequently do deep dives into source code on a particular problem and write a detailed md file so it only needs to do that once.

"Deep dive into this code, find everything you can find about automations and then write a detailed analysis doc with working examples and source code, use the source code."


I am so glad that I asked! The deep dive method will actually help me on my current project. Thank you.


Note to self: strategy to defeat the terminator


It depends on your definition of safe. Most of the code that gets written is pretty simple — basic crud web apps, WP theme customization, simple mobile games… stuff that can easily get written by the current gen of tooling. That already has cost a lot of people a lot of money or jobs outright, and most of them probably haven’t reached their skill limit a as developers.

As the available work increases in complexity, I reckon more will push themselves to take jobs further out of their comfort zone. Previously, the choice was to upskill for the challenge and greater earnings, or stay where you are which is easy and reliable; the current choice is upskill or get a new career. Rather than switch careers to something you have zero experience in. That puts pressure on the moderately higher-skill job market with far fewer people, and they start to upskill to outrun the implosion, which puts pressure on them to move upward, and so on. With even modest productivity gains in the whole industry, it’s not hard for me to envision a world where general software development just isn’t a particularly valuable skill anymore.


Everything in tech is cyclical. AI will be no different. Everyone outsourced, realized the pain and suffering and corrected. AI isn't immune to the same trajectory or mistakes. And as corporations realize that nobody has a clue about how their apps or infra run, you're one breach away from putting a relatively large organization under.

The final kicker in this simple story is that there are many, many narcissistic folks in the C-suite. Do you really think Sam Altman and Co are going to take blame for Billy's shitty vibe coded breach? Yeah right. Welcome to the real world of the enterprise where you still need an actual throat to choke to show your leadership skills.


I absolutely don’t think vibe coding or barely supervised agents will replace coders, like outsourcing claimed to, and in some cases did and still does. And outsourcing absolutely affected the job market. If the whole thing does improve and doesn’t turn out to be too wildly unprofitable to survive, what it will do is allow good quality coders— people who understand what can and can’t go without being heavily scrutinized— to do a lot more work. That is a totally different force than outsourcing, which to some extent, assumed software developers were all basically fungible code monkeys at some level.


There's a lot to unpack here. I agree - outsourcing did affect the job market. You're just seeing the negative (US) side. If anything outsourcing was hugely beneficial to the Indian market where most of those contracts landed. My point was that it was sold as a solution that didn't net the value proposition it claimed. And that is why I've said AI is not immune to being cyclical, just like outsourcing. AI is being sold as worker replacement. It's not even close and if it were then OpenAI, Anthropic and Google would have all replaced a lot of people and wouldn't be allowing you and I to use their tool for $20/month. When it does get that good we will no longer be able to afford using these "enterprise" tools.

With respect to profitability - there's none in sight. When JP Morgan [0] is saying that $650B in annual revenue is needed to make a paltry 10% on investment there is no way any sane financial institution would pump more money into that sunk cost. Yet, here we are building billions of dollars in datacenters for what... Mediocre chat bots? Again these thing don't think. They don't reason. They're massive word graphs being used in clever ways with cute, humanizing descriptions. Are they useful for helping a human parse way more information than we can reason about at once? For sure! But that's not worth trillions in investment and won't yield multiples of the input. In fact I'd argue the AI landscape would be much better off if the dollars stopped flowing because that would mean real research would need to be done in a much more efficient and effective manner. Instead we're paying individual people hundreds of millions of dollars who, and good for them, have no clue or care on what actually happens with AI because: money in the bank. No, AI in it's current form is not profitable, and it's not going to be if we continue down this path. We've literally spent world changing sums of money on models that are used to create art that will displace the original creators well before they will solve any level of useful world problems.

Finally, and to your last point: "...good quality coders...". How long do you think that will be a thing with respect to how this is all unfolding? Am I writing better code (I'm not a programmer by day) with LLMs? Yes and no. Yes when I need to build a visually appealing UI for something. And yes when it comes to a framework. But what I've found is if I don't put all of the right pieces in the right places before I start I end up with an untenable mess into the first couple thousand lines of that code. So if people stop becoming "good quality programmers" then what? These models only get better with better training data and the web will continue to go insular against these IP stealing efforts. The data isn't free, it never has been. And this is why we're now hearing the trope of "world models". A way to ask for trillions more to provide millionths of a penny on the invested dollar.

[0] https://www.tomshardware.com/tech-industry/artificial-intell...


That ship has sailed long ago.

I'm rooting for biological cognitive enhancement through gene editing or whatever other crazy shit. I do not want to have some corporation's AI chip in my brain.


Confirmed less wrong psyop victim


Generally, any expert hopes their tool/paintbrush/etc is as performant as possible.


And in general I'm all for increasing productivity, in all areas of the economy.


To what goal?


To increase livings standards for the people.


How do you reconcile that with people getting fired because they're replaced with technology?


See https://en.wikipedia.org/wiki/Lump_of_labour_fallacy

Why does this need any reconciliation? That's working as expected: when productivity improves in some sectors, we don't need as much labour there as before, and thus it needs to be shuffled around. This can have all kinds of knock-on effects.

As long as central bank is doing at least a halfway competent job, overall unemployment will stay low and stable. Ideally, you have people quit for a new job instead of getting fired, but in the grand scheme of things it doesn't make too much of a difference, as long as in aggregate they find new jobs.

An interesting example is furnished by the US between early 2006 and late 2007: hundreds of thousand people left employment in construction, and during that same period, the overall US unemployment rate stayed remarkably flat (hovering around 4.5% to 4.7%). The US economy was robust enough to handle a housing construction bust.

(Of course, after this was all done and dusted, some people declared that house prices were too high and the public demanded that they be brought down. So obligingly in 20008 the Fed engineered a recession that accomplished exactly that..)


> As long as central bank is doing at least a halfway competent job, overall unemployment will stay low and stable. Ideally, you have people quit for a new job instead of getting fired, but in the grand scheme of things it doesn't make too much of a difference, as long as in aggregate they find new jobs.

Two big ifs: the central bank is competent and enough and people find new jobs.

Don't get me wrong: I am for progress and technological innovation. That's why we're working, to make our lives easier. But progress needs to be balanced, so that the changes it brings are properly absorbed by society.


> Two big ifs: the central bank is competent and enough and people find new jobs.

That's only one 'if'. Well, the second 'people finding jobs' is a given if you have a half-way competent central bank and a regulations even slightly less insane than South Africa's.

But let's worry about technological unemployment once we actually see it. So far it has been elusive. (Even in South Africa, it's not technology but their own boneheaded policies that drive the sky high unemployment. They ain't technically more advanced than the rest of the world.)


How do you know we're not seeing technological unemployment? There have been quite a few layoffs, some of them attributed directly or indirectly to "AI".

Second, there are far fewer junior jobs in software development, again attributed to the advance of AI.


> As long as central bank is doing at least a halfway competent job, overall unemployment will stay low and stable.

That’s... not at all a valid generalization. There’s all kinds of things that other actors can do to throw things too out of whack for the the monetary policies tools typically available to central banks to be sufficient to keep things sailing nicely. One big danger here is bad action (or inaction in the face of exogenous crisis) by the main body of the government itself.


You'd think so, yes. But outside of wars, recessions caused by 'real' factors are surprisingly rare. They are almost all caused by central bank 'nominal' incompetence. (I'm using 'nominal' and 'real' here in the sense of 'to do with the number of zeros on your banknotes' vs 'actual goods and services and other real stuff in the economy'.)

One rare counter-example was perhaps Covid, where we had a real issue cause a recession.

That's not to say that real issues don't cause problems. Far from it! They just don't cause a recession, if the central bank is alert. The prototypical example is perhaps the UK economy after the Brexit referendum in 2016:

The leave vote winning was a shock to the British economy, but the Bank of England wisely let the Pound exchange rate take the hit, instead of tanking the economy trying to defend the exchange rate. As a result, British GDP (as eg measured in Euro) immediately shrank by a few percent and the expected path of future real GDP also shrank; but crucially: there was no recession nor its associated surge in unemployment.

For another example have a look at Russia in the last few years. Thanks to the very competent hands of Elvira Nabiullina at the Bank of Russia, the Russian economy has perhaps been creaking under the strain of war and sanctions but has not slid into recession.

Summary: real issue cause problems for the economy, but they don't have to cause a recession, if the central bank is alert. (That's in economies with a central bank. Central banks are actually more of an arsonist than a fire fighter here.)


Star Trek communism.

There are two separate issues here: whether tech itself is bad, and whether the way it is deployed is bad. Better AI is, in principle, the kind of tech that can massively change the world for the better. In practice it is being deployed to maximize profits because that's what we chose to incentivize in our society above everything else, but the problem is obviously the incentives (and the people that they enable), not the tech itself.


Profit is fine. It's how society tells you that your customers value what you are producing more than it costs you to produce (after paying suppliers and workers etc). That's how you avoid the massive misallocations of Soviet communism.

(Well, the Soviets did have one sector that performed reasonably well, and that's partially because they set plenty of decent incentives there: weapons production and the military.)

Now you could say that the 'wrong' activities are profitable. And, I agree and I am all for eg CO2 taxes or making taxes on equity financing cheaper than those on debt and deposits (to incentivise companies, especially banks, to rely more on stocks than on debt, to decrease brittle leverage in the economy); or lowering subsidies for meat production or for burning food instead of eating it etc.


Rooting is useless. We should be taking conscious action to reduce the bosses' manipulation of our lives and society. We will not be saved by hoping to sabotage a genuinely useful technology.


How is it useful other than for people making money off token outout. Continue to fry your brain.


They’re fantastic learning tools, for a start. What you get out of them is proportional to what you put in.

You’ve probably heard of the Luddites, the group who destroyed textile mills in the early 1800s. If not: https://en.wikipedia.org/wiki/Luddite

Luddites often get a bad rap, probably in large part because of employer propaganda and influence over the writing of history, as well as the common tendency of people to react against violent means of protest. But regardless of whether you think they were heroes, villains, or something else, the fact is that their efforts made very little difference in the end, because that kind of technological progress is hard to arrest.

A better approach is to find ways to continue to thrive even in the presence of problematic technologies, and work to challenge the systems that exploit people rather than attack tools which can be used by anyone.

You can, of course, continue to flail at the inevitable, but you might want to make sure you understand what you’re trying to achieve.


Arguably the Luddites don't get a bad enough rep. The lump of labour fallacy was as bad then as it is now or at any other time.

https://en.wikipedia.org/wiki/Lump_of_labour_fallacy


Again, that may at least in part be a function of how history was written. The Luddite wikipedia link includes this:

> Malcolm L. Thomas argued in his 1970 history “The Luddites” that machine-breaking was one of the very few tactics that workers could use to increase pressure on employers, undermine lower-paid competing workers, and create solidarity among workers. "These attacks on machines did not imply any necessary hostility to machinery as such; machinery was just a conveniently exposed target against which an attack could be made."[10] Historian Eric Hobsbawm has called their machine wrecking "collective bargaining by riot", which had been a tactic used in Britain since the Restoration because manufactories were scattered throughout the country, and that made it impractical to hold large-scale strikes.

Of course, there would have been people who just saw it as striking back at the machines, and leaders who took advantage of that tendency, but the point is it probably wasn’t as simple as the popular accounts suggest.

Also, there’s a kind of corollary to the lump of labor fallacy, which is arguably a big reason the US is facing such a significant political upheaval today: when you disturb the labor status quo, it takes time - potentially even generations - for the economy to adjust and adapt, and many people can end up relatively worse off as a result. Most US factory workers and miners didn’t end up with good service industry jobs, for example.

Sure, at a macro level an economist viewing the situation from 30,000 feet sees no problem - meanwhile on the ground, you end up with millions of people ready to vote for a wannabe autocrat who promises to make things the way they were. Trying to treat economics as a discipline separate from politics, sociology, and psychology in these situations can be misleading.


> [...] undermine lower-paid competing workers, and create solidarity among workers.

Nice 'solidarity' there!

> Most US factory workers and miners didn’t end up with good service industry jobs, for example.

Which people are you talking about? More specifically, when?

As long as overall unemployment stays low and the economy keeps growing, I don't see much of a problem. Even if you tried to keep everything exactly as is, you'll always have some people who do better and some who do worse; even if just from random chance. It's hard to blame that on change.

See eg how the draw down of the domestic construction industry around 2007 was handled: construction employment fell over time, but overall unemployment was low and flat. Indicating an orderly shuffling around of workers from construction into the wider economy. (As a bonus point, contrast with how the Fed unnecessarily tanked the wider economy a few months after this re-allocation of labour had already finished.)

> Sure, at a macro level an economist viewing the situation from 30,000 feet sees no problem - meanwhile on the ground, you end up with millions of people ready to vote for a wannabe autocrat who promises to make things the way they were. Trying to treat economics as a discipline separate from politics, sociology, and psychology in these situations can be misleading.

It would help immensely, if the Fed were more competent in preventing recessions. Nominal GDP level targeting would help to keep overall spending in the economy on track.


The Fed is capable of doing no such thing. They can soften or delay recessions by socializing mistakes and redistributing wealth using interest rates, but an absence of recessions would imply perfect market participants.


> [...] but an absence of recessions would imply perfect market participants.

No, not at all. What makes you think so? Israel (and to a lesser extent Australia) managed to skip the Great Recession on account of having competent central banks. But they didn't have any more 'perfect' market participants than any other economy.

Russia, of all places, also shows right now what a competent central bank can do for your economy---the real situation is absolutely awful on account of the 'special military operation' and the sanctions both financial and kinetic. See https://en.wikipedia.org/wiki/Elvira_Nabiullina for the woman at the helm.

See also how after the Brexit referendum the Bank of England wisely let the Pound exchange rate take the hit---instead of tanking the real economy trying to defend the exchange rate.

> They can soften or delay recessions by socializing mistakes and redistributing wealth using interest rates, [...]

Btw, not all central banks even use interest rates for their policies.

You are right that the central banks are sometimes involved in bail outs, but just as often it's the treasury and other more 'fiscal' parts of the government. I don't like 'Too big to fail' either. Keeping total nominal spending on a stable path would help ease the temptation to bail out.


Today, we found better ways to prevent machines from crushing children, e.g., more regulation from democracy.


Obviously, some people love that machine crushing kids in the past. Looks like they hope it happen again...


are you pretending to be confused?


I see millions of kids cheating on their schoolwork, many adults substituting reading and thinking to GPUs. There's like 0.001% of people that use them to learn responsibly. You are genuinely a fool.


Hey, I wrote a long response to your other reply to me, but your comment seems to have been flagged so I can no longer reply there. Since I took the time to write that, I'm posting it here.

I'm glad I was able to inspire a new username for you. But aren't you concerned that if you let other people influence you like that, you're frying your brain? Shouldn't everything originate in your own mind?

> They don't provide any value except to a very small percentage of the population who safely use them to learn

There are many things that only a small percentage of the population benefit from or care about. What do you want to do about that? Ban those things? Post exclamation-filled comments exhorting people not to use them? This comes back to what I said at the end of my previous comment:

You might want to make sure you understand what you’re trying to achieve.

Do you know the answer to that?

> A language model is not the same as a convolution neural network finding anomalies on medical imagining.

Why not? Aren't radiologists "frying their brains" by using these instead of examining the images themselves?

The last paragraph of your other comment was literally the Luddite argument. (Sorry I can't quote it now.) Do you know how to weave cloth? No? Your brain is fried!

The world changes, and I find it more interesting and challenging to change with it, than to fight to maintain some arbitrary status quo. To quote Ghost in the Shell:

All things change in a dynamic environment. Your effort to remain what you are is what limits you.

For me, it's not about "getting ahead" as you put it. It's about enjoying my work, learning new things. I work in software development because I enjoy it. LLMs have opened up new possibilities for me. In that 5 year future you mentioned, I'm going to have learned a lot of things that someone not using LLMs will not have.

As for being dependent on Altman et al., you can easily go out and buy a machine that will allow you to run decent models yourself. A Mac, a Framework desktop, any number of mini PCs with some kind of unified memory. The real dependence is on the training of the models, not running them. And if that becomes less accessible, and new open weight models stop being released, the open weight models we have now won't disappear, and aren't going to get any worse for things like coding or searching the web.

> Keep falling for lesswrong bs.

Good grief. Lesswrong is one of the most misleadingly named groups around, and their abuse of the word "rational" would be hilarious if it weren't sad. In any case, Yudkowsky advocated being ready to nuke data centers, in a national publication. I'm not particular aware of their position on the utility of AI, because I don't follow any of that.

What I'm describing to you is based on my own experience, from the enrichment I've experienced from having used LLMs for the past couple of years. Over time, I suspect that kind of constructive and productive usage will spread to more people.


Out of respect the time you put into your response, I will try to respond in good faith.

> There are many things that only a small percentage of the population benefit from or care about. What do you want to do about that?

---There are many things from our society that I would like to ban that are useful to a small percentage of the population, or at least should be heavily regulated. Guns for example. A more extreme example would be cars. Many people drive 5 blocks when they could walk to their (and everyone else's) detriment. Forget the climate, it impacts everyone ( break dust, fumes, pedestrian deaths). Some cities create very expensive tolls / parking fees to prevent this, this angers most people and is seen as irrational by the masses but is necessary and not done enough. Open Free societies are a scam told to us by capitalist that want to exploit without any consequences.

--- I want to air-gap all computers in classrooms. I want students to be expelled for using LLMs to do assignments, as they would have been previously for plagiarism (that's all an llm is, a plagiarism laundering machine).

---During COVID there was a phenomenon where some children did not learn to speak until they were 4-5 years old, and some of those children were even diagnosed with autism. In reality, we didn't understand fully how children learned to speak, and didn't understand the importance of the young brain's need to subconsciously process people's facial expressions. It was Masks!!! (I am not making a statement on masks fyi) We are already observing unpredictable effects that LLMs have on the brain and I believe we will see similar negative consequences on the young mind if we take away the struggle to read, think and process information. Hell I already see the effects on myself, and I'm middle aged!

> Why not? Aren't radiologists "frying their brains" by using these instead of examining the images themselves?

--- I'm okay with technology replacing a radiologist!!! Just like I'm okay with a worker being replaced in an unsafe textile factory! The stakes are higher in both of these cases, and obviously in the best interest of society as a whole. The same cannot be said for a machine that helps some people learn while making the rest dependent on it. Its the opposite of a great equalizer, it will lead to a huge gap in inequality for many different reasons.

We can all say we think this will be better for learning, that remains to be seen. I don't really want to run a worldwide experiment on a generation of children so tech companies can make a trillion dollars, but here we are. Didn't we learn our lesson with social media/porn?

If Uber's were subsidized and cost only $20.00 a month for unlimited rides, could people be trusted to only use it when it was reasonable or would they be taking Uber's to go 5 blocks, increasing the risk for pedestrians and deteriorating their own health. They would use them in an irresponsible way.

If there was an unlimited pizza machine that cost $20.00 a month to create unlimited food, people would see that as a miracle! It would greatly benefit the percentage of the population that is food insecure, but could they be trusted to not eat themselves into obesity after getting their fill? I don't think so. The affordability of food, and the access to it has a direct correlation to obesity.

Both of these scenarios look great on the surface but are terrible for society in the long run.

I could go on and on about the moral hazards of LLMs, there are many more outside of just the dangers of learning and labor. We are being told they are game changing by the people who profit off them..

In the past, empires bet their entire kingdom's on the words of astronomers and magicians who said they could predict the future. I really don't see how the people running AI companies are any different than those astronomers (they even say they can predict the future LOL!)

They are Dunning Kruger plagiarism laundering machines as I see it. Text extruding machines that are controlled by a cabal of tech billionaires who have proven time and time again they do not have societies best interest at heart.

I really hope this message is allowed to send!


Just replying that I read your post, and don't disagree with some of what you wrote, and I'm glad there are some people that peacefully/respectfully push back (because balance is good).

However, I don't agree that AI is a risk to the extreme levels you seem to think it is. The truth is that humans have advanced by use of technology since the first tool and we are horrible predictors at what the use case of these technologies will bring.

So far they have been mostly positive, I don't see a long term difference here.


The kids went out and found the “cheating engines” for themselves. There was no plot from Big Tech, and believe me academia does not like them either.

They have, believe it or not, very little power to stop kids from choosing to use cheating engines on their personal laptops. Universities are not Enterprise.


They're just exploiting a bug in the Educational System where instead of testing if students know things, we test if they can produce a product that implies they know things. We don't interrogate them in person with questions to see if they understand the topic, we give them multiple choice questions that can be marked automatically to save time


Ok, so there’s a clear pattern emerging here, which is that you think we should do much more to manage our use of technology. An interesting example of that is the Amish. While they take it to what can seem like an extreme, they’re doing exactly what you’re getting at, just perhaps to a different degree.

The problem with such approaches is that it involves some people imposing their opinions on others, “for their own good”. That kind of thing often doesn’t turn out well. The Amish address that by letting their children leave to experience the outside world, so that their return is (arguably) voluntary - they have an opportunity to consent to the Amish social contract.

But what you seem to be doing is making a determination of what’s good for society as a whole, and then because you have no way to effect that, you argue against the tools that we might abuse rather than the tendencies people have to abuse them. It seems misplaced to me. I’m not saying there are no societal dangers from LLMs, or problems with the technocrats and capitalists running it all, but we’re not going to successfully address those issues by attacking the tools, or people who are using them effectively.

> In the past, empires bet their entire kingdom's on the words of astronomers and magicians who said they could predict the future.

You’re trying to predict the future as well, quite pessimistically at that.

I don’t pretend to be able to predict the future, but I do have a certain amount of trust in the ability of people to adapt to change.

> that's all an llm is, a plagiarism laundering machine

That’s a possible application, but it’s certainly not all they are. If you genuinely believe that’s all they are, then I don’t think you have a good understanding of them, and it could explain some of our difference in perspective.

One of the important features of LLMs is transfer learning: their ability to apply their training to problems that were not directly in their training set. Writing code is a good example of this: you can use LLMs to successfully write novel programs. There’s no plagiarism involved.


Hmm so I read this today. By happen chance someone sent it to me, it applies aptly to our conversation. It made me think a little differently about your argument and the luddite pursuasion all together. And why we shouldnt call people luddites (in a negative connotation)!!

https://archive.nytimes.com/www.nytimes.com/books/97/05/18/r...


> You should bw rooting for these LLMs to be as bad as possible..

Why?


To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.

But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.


I’d love if anyone could provide examples of such AND(“ground truth”, “absolutely ridiculous”) solutions! Even if they took clever humans a long time to create.

I’m curious to explore such fun programming code. But I’m also curious to explore what knowledgeable humans consider to be both “ground truth” as well as “absolutely ridiculous” to create within the usual time constraints.


I'm not explaining myself right.

Stockfish is a superhuman chess program. It's routinely used in chess analysis as "ground truth": if Stockfish says you've made a mistake, it's almost certain you did in fact make a mistake[0]. Also, because it's incomparably stronger than even the very best humans, sometimes the moves it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them in tournament conditions.

Obviously software development in general is way more open-ended, but if we restrict ourselves to puzzles and competitions, which are closed game-like environments, it seems plausible to me that a similar skill level could be achieved with an agent system that's RL'd to death on that task. If you have base models that can get there, even inconsistently so, and an environment where making a lot of attempts is cheap, that's the kind of setup that RL can optimize to the moon and beyond.

I don't predict the future and I'm very skeptical of anybody who claims to do so, correctly predicting the present is already hard enough, I'm just saying that given the progress we've already made I would find plausible that a system like that could be made in a few years. The details of what it would look like are beyond my pay grade.

---

[0] With caveats in endgames, closed positions and whatnot, I'm using it as an example.


Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good. However, it only happens in very specific positions.


Does that happen because the player understands some tendency of their opponent that will cause them to not play optimally? Or is it genuinely some flaw in the machine’s analysis?


Both, but perhaps more often neither.

From what I've seen, sometimes the computer correctly assesses that the "bad" move opens up some kind of "checkmate in 45 moves" that could technically happen, but requires the opponent to see it 45 moves ahead of time and play something that would otherwise appear to be completely sub-optimal until something like 35 moves in, at which point normal peak grandmasters would finally go "oh okay now I get the point of all of that confusing behavior, and I can now see that I'm going to get mated in 10 moves".

So, the computer is "right" - that move is worse if you're playing a supercomputer. But it's "wrong" because that same move is better as long as you're playing a human, who will never be able to see an absurd thread-the-needle forced play 45-75 moves ahead.

That said, this probably isn't what GP was referring to, as it wouldn't lead to an assignment of a "brilliant" move simply for failing to see the impossible-to-actually-play line.


This is similar to game theory optimal poker. The optimal move is predicated on later making optimal moves. If you don’t have that ability (because you’re human) then the non-optimal move is actually better.

Poker is funny because you have humans emulating human-beating machines, but that’s hard enough to do that you have players who don’t do this win as well.


I think this is correct for modern engines. Usually, these moves are open to a very particular line of counterplay that no human would ever find because they rely on some "computer" moves. Computer moves are moves that look dumb and insane but set up a very long line that happens to work.


It does happen that the engine doesn't immediately see that a line is best, but that's getting very rare those days. It was funny in certain positions a few years back to see the engine "change its mind" including in older games where some grandmaster found a line that was particularly brilliant, completely counter-intuitive even for an engine, AND correct.

But mostly what happens is that a move isn't so good, but it isn't so bad either, and as the computer will tell you it is sub-optimal, a human won't be able to refute it in finite time and his practical (as opposed to theoretical) chances are reduced. One great recent example of that is Pentala Harikrishna's recent queen sacrifice in the world cup, amazing conception of a move that the computer say is borderline incorrect, but leads to such complications and a very uncomfortable position for his opponent that it was practically a great choice.


It can be either one. In closed positions, it is often the latter.


It's only the later if it's a weak browser engine, and it's early enough in the game that the player had studied the position with a cloud engine.


> Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good.

Do you have any links? I haven't seen any such (forget GM, not even Magnus), barring the opponent making mistakes.


Here’s a chess stackexchange of positions that stump engines

https://chess.stackexchange.com/questions/29716/positions-th...

It basically comes down to “ideas that are rare enough that they were never programmed into a chess engine”.

Blockades or positions where no progress is possible are a common theme. Engines will often keep tree searching where a human sees an obvious repeating pattern.

Here’s also an example where 2 engines are playing, and deep mind finds a move that I think would be obvious to most grandmasters, yet stockfish misses it https://youtu.be/lFXJWPhDsSY?si=zaLQR6sWdEJBMbIO

That being said, I’m not sure that this necessarily correlates with brilliancy. There are a few of these that I would probably get in classical time and I’m not a particularly brilliant player.


Stockfish totally dropped hand crafted evaluations in 2023.


It’s still the case that the evaluation model hasn’t seen enough examples of a blockade to be able to understand it as far as I can tell. Some very simple ones it can (in fact I’ve seen stockfish/alpha-zero execute quite clever blockades before). But there’s still a gap where humans understand them better.


It used to happen way more often with Magnus and classical versions of Stockfish from pre Alpha Zero/Leela Zero days. Since NN Stockfish I don't think it happens anymore.


Maybe he means not the best move but an equally almost strong move?

Because ya, that doesn't happen lol.


I would love to examine Stockfish play that seemed extremely counterintuitive but which ended up winning. How can I do so? (I don't inhabit any of the current chess spaces so have no idea where to look, but my son is approaching the age where I can start to teach him...).

That said, chess is such a great human invention. (Go is up there too. And texas no-limit hold'em poker. Those are my top 3 votes for "best human tabletop games ever invented". They're also, perhaps not uncoincidentally, the hardest for computers to be good at. Or, were.)


The problem is that Stockfish is so strong that the only way to have it play meaningful games is to put it against other computers. Chess engines play each other in automated competitions like TCEC.

If you look on Youtube there are many channels where strong players analyze these games. As Demis Hassabis once put it, it's like chess from another dimension.


> I would love to examine Stockfish play that seemed extremely counterintuitive but which ended up winning.

If you want to see this against someone like Magnus, it is rare as super GMs do not spend a lot of time playing engines publicly.

But if you want to see them against a normal chess master somewhere between master and international master, it is every where. For e.g. this guy analyses his every match afterwards and you frequently here "oh I would never see that line":

https://www.youtube.com/playlist?list=PLp7SLTJhX1u6zKT5IfRVm...

(start watching around 1000+ for frequently seeing those moments)


I recommend Matthew Sadler's Game Changer and The Silicon Road To Chess Improvement.


You explained yourself right. The issue is that you keep qualifying your statements.

> it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them...

> ... in tournament conditions.

I'm suggesting that I'd like to see the ones that humans have found - outside of tournament conditions. Perhaps the gulf between us arises from an unspoken reference to solutions "unrealistic to expect a human to find" without the window-of-time qualifier?


I can wreck stockfish in chess boxing. Mostly because stockfish can't box, and it's easy for me to knock over a computer.


If it runs on a mainframe you would lose both the chess and the boxing.


Are there really boxing capable mainframes nowadays?

Otherwise I think the mainframe would lose because of being too passive


The point of that qualifier is that you can expect to see weird moves outside of tournament conditions because casual games are when people experiment when that kind of thing.


How are they faster? I don’t think any ELO report actually comes from participating at a live coding contest on previously unseen problems.


My background is more on math competitions, but all of those things are essentially speed contests. The skill comes from solving hard problems within a strict time limit. If you gave people twice the time, they'd do better, but time is never going to be an issue for a computer.

Comparing raw Elo ratings isn't very indicative IMHO, but I do find it plausible that in closed, game-like environments models could indeed achieve the superhuman performance the Elo comparison implies, see my other comment in this thread.


Your post made me curious to try a problem I have been coming back to ever since ChatGPT was first released: https://open.kattis.com/problems/low

I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.


ChatGPT solves this problem now as well with 5.1. Time for a new litmus test.


Just to clarify the context for future readers: the latest problem at the moment is #970: https://projecteuler.net/problem=970


I just had chatgpt explain that problem to me (I was unfamiliar with the mathematical background). It showed how to solve closed form answers for H(2) and H(3) and then numerical solutions using RK4 for higher values. Truly impressive, and it explained the derivations beautifully. There are few maths experts I've encountered who could have hand-held me through it as good.


Was the explanation correct?


He has no idea because he's unfamiliar with the background.


I didn't understand the background before the explanation, but afterwards I did. Inl walked me through the mathematical steps and each was logical and ok to follow if you have a basic calculus knowledge.


It was. I asked it to give more details on parts of the derivation I didn't quite follow and it did that. Overall it was able to build from the ground up to the solution and solve it both numerically and analytically (for smaller values of x)


I tried it with gpt-5.1 thinking, and it just searched and found a solution online :p


Is there a solution to this exact problem, or to related notions (renewal equation etc.)? Anyway seems like nothing beats training on test


Are you sure it did not retrieve the answer using websearch?


gpt-5.1 gave me the correct answer after 2m 17s. That includes retrieving the Euler website. I didn't even have to run the Python script, it also did that.


Did it search the web?


Yeah, LLMs used to not be up to par for new Project Euler problems, but GPT-5 was able to do a few of the recent ones which I tried a few weeks ago.


Does it matter if it is out of the training data? The models integrate web search quite well.

What if they have an internal corpus of new and curated knowledge that is constantly updated by humans and accessed in a similar manner? It could be active even if web search is turned off.

They would surely add the latest Euler problems with solutions in order to show off in benchmarks.


you can disable search.

just create a different problem if you don't believe it.


I asked Grok to write a Python script to solve this and it did it in slightly under ten minutes, after one false start where I'd asked it using a mode that doesn't think deeply enough. Impressive.


Is this a problem for which the (human) solution is well documented an known and was learned during the training phase? Or is it a novel problem?

I personally think anthropomorphizing LLMs is a bad idea.


definitely uses a lot of tooling. From "thinking":

> I'm now writing a Python script to automate the summation computation. I'm implementing a prime sieve and focusing on functions for Rm and Km calculation [...]


If using through the chat interface are these models not doing some RAG?


So when does the developer admit defeat? Do we have a benchmark for that yet?


According to a bunch of philosophers (https://ai-2027.com/), doom is likely imminent. Kokotajlo was on Breaking Points today. Breaking Points is usually less gullible, but the top comment shows that "AI" hype strategy detection is now mainstream (https://www.youtube.com/watch?v=zRlIFn0ZIlU):

AI researcher: "Just another trillion dollars. This time we'll reach superintelligence, I swear."


Every Ai researcher calls it quits one YOLO run away from inventing a machine that turns all matter in the Universe into paperclips



We need to wait and see. According to Google they have solved AI 10 years ago with Google Duo but somehow they keep smashing records despite being the worst coding tool until Gemini 2.5. Google internal benchmarks are irrelevant


The fact that Gemini 3 is so far ahead of every other frontier model in math might be telling us something more general about the model itself.

It scored 23.4% on MathArena Apex, compared with 0.5% for Gemini 2.5 Pro, 1.6% for Claude Sonnet 4.5 and 1.0% for GPT 5.1.

This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.

To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.

You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.

The SimpleQA benchmark is another datapoint that we're probably looking at a research breakthrough, not just more data or more compute. Gemini 3 Pro achieved more than double the reliability of GPT-5.1 (72.1% vs. 34.9%).

This isn't an incremental gain, it's a step-change leap in reducing hallucinations.

And it's exactly what you'd expect to see if there's an underlying shift from probabilistic token prediction to verified search, with better error detection and backtracking when it finds an error.

That could explain the breakout performance on math, and reliability, and even operating graphical user interfaces (ScreenSpot-Pro at 72.7% vs. 3.5% for GPT-5.1).


I usually ask a simple question that ALL the models get wrong: List of mayor of my city [Londrina]. ALL the models (offine) get wrong. And I mean, all the models. The best that I could, it's o3 I believe, saying it couldn't give a good answer for that, and told to access the city website.

Gemini 3 somehow is able to give a list of mayors, including details on who got impeached, etc.

This should be a simple answer, because all the data is on wikipedia, that certainly the models are trained on, but somehow most models don't manage to give that answer right, because... it's just a irrelevant city in a huge dataset.

But somehow, Gemini 3 did it.

Edit: Just asked "Cool places to visit in Londrina" (In portuguese), and it was also 99% right, unlike other models, who just create stuff. The only thing wrong here, it mentioned sakuras in a lake... Maybe it confused with Brazilian ipês, which are similar, and indeed the city it's full of them.

It seems to have a visual understanding, imo.


Ha, I just did the same with my hometown (Guaiba, RS), a city that is 1/6th of Londrina, and its wikipedia page in English hasn't been updated in years, and still has the wrong mayor (!).

Gemini 3 nailed on the first try, included political affiliation, and added some context on who they competed with and won over in each of the last 3 elections. And I just did a fun application with AI Studio, and it worked on first shot. Pretty impressive.

(disclaimer: Googler, but no affiliation with Gemini team)


Pure fact-based, niche questions like that aren't really the focus of most providers any more from what I've heard, since they can be solved more reliably by integrating search tools (and all providers now have search).

I wouldn't be surprised if the smallest models can answer fewer such (fact-only) questions over time offline as they distill/focus them more thoroughly on logic etc.


Funny, I just asked "Ask Brave", which uses a cheap LLM connected directly to its search engine, and it got it right without any issues.

It shows once again that for common searches, (indexed) data is the king, and that's where I expect that even a simple LLM directly connected to a huge indexed dataset would win against much more sophisticated LLMs that have to use agents for searching.


I asked Claude, and had no issues with the answer including mentioning the impeached Antonio Belinati...


thanks for sharing, very interesting example


Thanks for reporting these metrics and drawing the conclusion of an underlying breakthrough in search.

In his Nobel Prize winning speech, Demis Hassabis ends by discussing how he sees all of intelligence as a big tree-like search process.

https://youtube.com/watch?v=YtPaZsasmNA&t=1218


The one thing I got out of the MIT OpenCourseWare AI course by Patrick Winston was that all of AI could be framed as a problem of search. Interesting to see Demis echo that here.


It tells me that the benchmark is probably leaking into training data, and going to the benchmark site :

> Model was published after the competition date, making contamination possible.

Aside from eval on most of these benchmarks being stupid most of the time, these guys have every incentive to cheat - these aren't some academic AI labs, they have to justify hundreds of billions being spent/allocated in the market.

Actually trying the model on a few of my daily tasks and reading the reasoning traces all I'm seeing is same old, same old - Claude is still better at "getting" the problem.


This comment was written by an AI specifically instructed to be more concise than usual.


> To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.

You say "probabilistic generation" like it's some kind of a limitation. What is exactly the limiting factor here? [(0.9999, "4"), (0.00001, "four"), ...] is a valid probability distribution. The sampler can be set to always choose "4" in such cases.


Your comment is AI generated


I'll give you the style is like an LLM but the thoughts seem a bit unlike one. I mean the MathArena Apex results indicating a new discovery rather than more data is definitely a hypothesis.

Also panarky denies it.


>This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.

To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.

You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.

Loos like AI slop


It obviously is.


From my understanding, Google put online the largest RL cluster in the world not so long ago. It's not surprising they do really well on things that are "easy" to RL, like math or SimpleQA


Aren't you just describing tool calls?


You clearly AI generated this comment.


[flagged]


Hmmm, I wrote those words myself, maybe I've spent too much time with LLMs and now I'm talking like them??

I'd be interested in any evidence-based arguments you might have beyond attacking my writing style and insinuating bad intent.

I found this commenter had sage advice about how to use HN well, I try to follow it: https://news.ycombinator.com/item?id=38944467


I’ll take you at your word, sorry for the incorrect callout. Your comment format appeared malicious, so my response wasn’t an attempt at being “snarky”, just acting defensively. I like the HN Rules/Guidelines.


You mentioned "step change" twice. Maybe a once over next time? My favorite Mark Twain quote is (very paraphrased) "My apologies, had I more time, I would have written a shorter letter".


I thought the repetition was intentional.


This is something that is happening to me too, and frankly I'm a little concerned. English is not my first language, so I use AI for checking and writing many things. And I spend a lot of time with coding tools. And now I need sometimes to do a conscient effort to avoid mimicking some LLM patterns...


“If you gaze long into an abyss, the abyss also gazes into you.”


Is that you Nietzsche? Or are you Magog https://andromeda.fandom.com/wiki/Spirit_of_the_Abyss


You seem very comfortable making unfounded claims. I don't think this is very constructive or adds much to the discussion. While we can debate the stylistic changes of the previous commenter, you seem to be discounting the rate at which the writing style of various LLMs has backpropagated into many peoples' brains.


I can sympathize with being mistakingly accused of using LLM output, but as a reader the above format of "Its not x - it's y" repeated multiple times for artificial dramatic emphasis to make a pretty mundane point that could use 1/3 the length grates on me like reading LinkedIn or marketing voice whether it's AI or not (and it's almost always AI anyway).

I've seen fairly niche subreddits go from enjoyable and interesting to ruined by being clogged with LLM spam that sounds exactly like this so my tolerance for reading it is incredibly low, especially on HN, and I'll just dismiss it.

I probably lose the occasionally legitimate original observation now and then but in a world where our attention is being hijacked by zero effort spam everywhere you look I just don't have the time or energy to avoid that heuristic.


Also discounting the fact that people actually do talk like that. In fact, these days I have to modify my prose to be intentionally less LLM-like lest the reader thinks it's LLM output.


1) Models learn these patterns from common human usage. They are in the wild, and as such there will be people who use them naturally.

2) Now, given its for-some-reason-ubiquitous choice by models, it is also a phrasing that many more people are exposed to, every day.

Language is contagious. This phrasing is approaching herd levels, meaning models trained from up-to-the-moment web content will start to see it as less distinctly salient. Eventually, there will be some other high-signal novel phrase with high salience, and the attention heads will latch on to it from the surrounding context, and then that will be the new AI shibboleth.

It's just how language works. We see it in the mixes between generations when our kids pick up new lingo, and then it stops being in-group for them when it spreads too far.. Skibidi, 6 7, etc.

It's just how language works, and a generation ago the internet put it on steroids. Now? Even faster.


Wow. Sounds pretty impressive.


The problem is these models are optimized to solve the benchmarks, not real world problems.


For those curious here are the actual keywords (from https://docs.python.org/3/reference/lexical_analysis.html?ut... )

Hard Keywords:

False await else import pass None break except in raise True class finally is return and continue for lambda try as def from nonlocal while assert del global not with async elif if or yield

Soft Keywords:

match case _ type

I think nonlocal/global are the only hard keywords I now barely use, for the soft ones I rarely use pattern matching, so 5 seems like a good estimate


I recall when they added "async" and it broken a whole lot of libraries. I hope they never again introduce new "hard" keywords.


Removing "print" in 3.0 helped their case significantly, as well.


I'm also curious what's the use case of this over Ray. Tighter integration with PyTorch/tensors abstractions?


That.

Also, it has RDMA. Last I checked, Ray did not support RDMA.

There are probably other differences as well, but the lack of RDMA immediately splits the world into things you can do with ray and things you cannot do with ray


Not currently, but it is being worked on https://github.com/ray-project/ray/issues/53976.


From the docs ( https://meta-pytorch.org/monarch/index.html ):

Monarch is a distributed programming framework for PyTorch based on scalable actor messaging. It provides:

- Remote actors with scalable messaging: Actors are grouped into collections called meshes and messages can be broadcast to all members.

- Fault tolerance through supervision trees: Actors and processes for a tree and failures propagate up the tree, providing good default error behavior and enabling fine-grained fault recovery.

- Point-to-point RDMA transfers: cheap registration of any GPU or CPU memory in a process, with the one-sided tranfers based on libibverbs

- Distributed tensors: actors can work with tensor objects sharded across processes

It seems like the goal of Monarch is to do what Ray does, but more tightly integrated with the Deep Learning/distributed training ecosystem?


https://nautil.us/deep-learning-is-hitting-a-wall-238440/

Gary Marcus said that Deep Learning was hitting a wall 1 month before the release of DALLE 2, 6 months before the release of ChatGPT and 1 year before GPT4, arguably 3 of the biggest milestones in Deep Learning


Sam Altman said GPT-3 was dangerous and openai should be responsible for saving the humanity.


Worth pointing out that no one who doesn't work at a frontier lab has ever seen a completely un-nerfed, un-bowdlerized AI model.


There are some base models available to the public today. Not on "end of 2025 frontier run" scale, but a few of them are definitely larger and better than GPT-3. There are some uses for things like that.

Not that the limits of GPT-3 were well understood at the time.

We really had no good grasp of how dangerous or safe something like that would - and whether there are some subtle tipping point that could propel something like GPT-3 all the way to AGI and beyond.

Knowing what we know now? Yeah, they could have released GPT-3 base model and nothing bad would have happened. But they didn't know that back then.


But we know that ChatGPT 5 is better than anything un-nerfed, un-bowdlerized from 2 years ago. And is not impressive.


I had a similar idea when apple vision pro came out, to be able to code while laying on a couch or bed fully relaxed, but never got to doing it. Neat!


> code while laying on a couch or bed fully relaxed

I wanted this so much I started programming on my phone with Termux. Yes, on a touch screen.


I wrote an entire graphical Go-game tree based editor in Lisp, with a stylus on PalmPilot. I considered it an artistic expression.


I wrote my own lisp interpreter in C inside Termux. My language might be the first to be born inside a smartphone.

> with a stylus

Respect.


I code this way using the Rayneo Air 3S Pro. Feels like I accomplish more this way because the display is on my head and I can relax.


I do the same with viture pro xr glasses using a Bluetooth keyboard. It's been great when I'm having neck/back issues that require laying down to recover. The downside being that xr glasses cause a bit more eye strain, which forces the short periodic breaks that I should be taking anyway.


Do you just type with a regular keyboard then?


(I'm the one who posted the url, not author post)

Julian Schrittwieser (author of this post) has been in AI for a long time, he was in the core team who worked on AlphaGo, AlphaZero and MuZero at DeepMind, you can see him in the AlphaGo movie. While it doesn't make his opinion automatically true, I think it makes it worth considering, especially since he's a technical person, not a CEO trying to raise money

"extrapolating an exponential" seems dubious, but I think the point is more that there is no clear sign of slowing down in models capabilities from the benchmarks, so we can still expect improvements


Benchmarks are notoriously easy to fake. Also he doesn’t need to be a CEO trying to raise money in order to have an incentive here to push this agenda / narrative. He has a huge stock grant from Anthropic that will go to $0 when the bubble pops


If you refresh you get simpler ones, like the couple kissing


We know that they correctly implement their specification*


No, they are correct, because the deciders themselves are just a cog in the proof of the overall theorem. The specification of the deciders is not part of the TCB, so to speak.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: