More

samwho · 2025-12-23T16:39:44 1766507984

Thank you <3

samwho · 2025-12-19T18:32:18 1766169138

Thank you! <3

These are all built with React and CSS animations (or the Web Animations API where I needed it). I’m not very good at React so the code is a real mess. 2 of the components also use threejs for the 3D bits.

For the stuff on my personal site, which simonw graciously linked to in another reply, you can see all the code behind my work at https://github.com/samwho/visualisations

samwho · 2025-12-19T18:26:39 1766168799

Simon, you’re too kind. Thank you. <3

samwho · 2025-12-19T14:01:42 1766152902

When I was writing this, GPT 5.1 was the latest and it got it right away. It’s the sequence of prime numbers fwiw :)

samwho · 2025-12-19T11:58:59 1766145539

I was wondering about this when I was reading around the topic. I can’t personally think of a reason you would need to segregate, though it wouldn’t surprise me if they do for some sort of compliance reasons. I’m not sure though, would love to hear something first-party.

weird-eye-issue · 2025-12-19T13:27:15 1766150835

They absolutely are segregated

With OpenAI at least you can specify the cache key and they even have this in the docs:

Use the prompt_cache_key parameter consistently across requests that share common prefixes. Select a granularity that keeps each unique prefix-prompt_cache_key combination below 15 requests per minute to avoid cache overflow.

ambicapter · 2025-12-20T17:01:24 1766250084

> Select a granularity that keeps each unique prefix-prompt_cache_key combination below 15 requests per minute to avoid cache overflow.

Why below a certain number? Usually in caches a high number of requests keeps the cached bit from expiring or being replaced, no?

weird-eye-issue · 2025-12-23T12:56:58 1766494618

It needs to go to the same machine and machines can only handle so many requests

psadri · 2025-12-19T17:30:43 1766165443

Does anyone actually compute / use this key feature? Or do you rely on implicit caching? I wish HN had a comment with a poll feature.

weird-eye-issue · 2025-12-20T04:59:46 1766206786

It would be important to use for relatively high traffic use cases

Let's say you have a chatbot with hundreds of active users, their requests could get routed to different machines which would mean the implicit caching wouldn't work

If you set the cache key to a user id then it would be more likely each user's chat could get cached on subsequent requests

samwho · 2025-12-19T12:05:42 1766145942

The only thing that comes to mind is some kind of timing attack. Send loads of requests specific to a company you’re trying to spy on and if it comes back cached you know someone has sent that prompt recently. Expensive attack, though, with a large search space.

gwern · 2025-12-19T19:18:02 1766171882

No, the search space is tiny: you can just attack 1 BPE at a time! Stuff like password guessing is almost trivial when you get to do a timing attack on each successive character. So that lets you quickly exfiltrate arbitrary numbers of prompts, especially if you have any idea what you are looking for. (Note that a lot of prompts are already public information, or you can already exfiltrate prompts quite easily from services and start attacking from there...)

reitzensteinm · 2025-12-19T21:10:02 1766178602

Hill climbing a password would only be possible if intermediate KV cache entries were stored. To hillclimb "hunter2", you're going to try "a", "b", "c", etc, until you notice that "h" comes back faster. Then you try "ha", "hb" and so on.

But that's only going to work if the cache looks like: "h", "hu", "hun", ..., "hunter2"

If just "hunter2" is in the cache, you won't get any signal until you stumble on exactly that password. And that's before getting into the block size granularity of the caches discussed elsewhere in this thread.

That's not to say timing attacks aren't possible. I haven't looked at Claude Code's prompt generation, but there's no intrinsic reason why you couldn't do things like figure out what open source code and research papers your competitors are loading into context.

Sharing caches between orgs would be an incredible misstep.

jgeralnik · 2025-12-19T22:15:18 1766182518

Right, you can’t actually guess a letter (byte) at a time but you can guess a token at a time (I believe the vocabulary is 200000 possible tokens in gpt 5) So you could send each of the 200000 possible tokens, see which is cached, and then send 200000 more tokens to find the next cached token Certainly less efficient but well within the realm of a feasible attack

reitzensteinm · 2025-12-19T23:28:13 1766186893

It's a good call out re: tokens vs letters, but I think you might have misunderstood my point - you can't do it a token at a time unless the intermediate KV cache is stored after each token is generated.

This won't be the case in any non toy implementation, as it would be unneccessary and slow.

jgeralnik · 2025-12-20T05:51:16 1766209876

Ah, fair enough. Anthropic caches at a block level (basically a single message) so for non-trivial messages this is really less of a concern, although I definitely understand why they still scope cache to a single tenant

IanCal · 2025-12-19T20:38:28 1766176708

Do any providers do this level of granularity? Anthropic require explicit cache markers, for example.

jgeralnik · 2025-12-19T22:16:33 1766182593

Anthropic requires explicit cache markers but will “look backwards” some amount, so you don’t need to fall on the exact split to get cached tokens

gunalx · 2025-12-19T12:26:46 1766147206

I habe come across turning on caching means the llm has a faint memory of what was in the cache, even to unrelated queries. If this is the case its fully unreasonable to share the cache, because of possibility of information leakage.

weird-eye-issue · 2025-12-19T13:27:45 1766150865

This is absolutely 100% incorrect.

samwho · 2025-12-19T12:31:56 1766147516

How would information leak, though? There’s no difference in the probability distribution the model outputs when caching vs not caching.

sroussey · 2025-12-19T23:48:01 1766188081

the probability distribution the model outputs is identical under identical conditions.

A local model running alone on your machine will 100% always return the exact same thing and the internal state will be exactly the same and you can checkpoint or cache that to avoid rerunning to that point.

But… conditions can be different, and batching requests tends to affect other items in flight. I believe Thinking Machines had an article about how to make a request deterministic again without performance going to complete crap.

I tend to think of things this way (completely not what happens though): what if you were to cache based on a tensor as the key? To generate a reasonably sized key what is an acceptable loss of precision to retrieve the same cache knowing that there is inherent jitter in the numbers of the tensor?

And then the ever so slight leak of information. But also multiplied since there are internal kv caches for tokens and blah blah blah.

dustfinger · 2025-12-19T15:46:24 1766159184

I wonder if there is valuable information that can be learned by studying a companies prompts? There may be reasons why some companies want their prompts private.

dustfinger · 2025-12-19T18:13:43 1766168023

I realize cache segregation is mainly about security/compliance and tenant isolation, not protecting secret prompts. Still, if someone obtained access to a company’s prompt templates/system prompts, analyzing them could reveal:

- Product logic / decision rules, such as: when to refund, how to triage tickets

- Internal taxonomies, schemas, or tool interfaces

- Safety and policy guardrails (which adversaries could try to route around)

- Brand voice, strategy, or proprietary workflows

That is just off the top of my head.

samwho · 2025-12-19T11:46:31 1766144791

With KV caching as it’s described there it has to be a prefix match. OpenAI state in their docs they don’t cache anything below 1024 tokens long, and I’m sure I read somewhere that they only cache in 1024 token blocks (so 1024, 2048, 3072, etc) but I can’t find it now.

There’s been some research into how to cache chunks in the middle, but I don’t think any of the providers are doing it yet because it needs the prompt to be structured in a very specific way.

moebrowne · 2025-12-19T12:00:06 1766145606

https://platform.openai.com/docs/guides/prompt-caching#requi...

> Caching is available for prompts containing 1024 tokens or more.

No mention of caching being in blocks of 1024 tokens thereafter.

IanCal · 2025-12-19T20:45:22 1766177122

At launch it was described as being in blocks of 128

https://openai.com/index/api-prompt-caching/

samwho · 2025-12-19T10:51:01 1766141461

Could you tell me what browser/OS/device you’re using? A few people have said this and I haven’t been able to reproduce it.

NooneAtAll3 · 2025-12-21T04:28:22 1766291302

librewolf, fork of firefox, latest version

f12 menu lists this:

Loading failed for the <script> with source “https://global.ketchcdn.com/web/v2/config/ngrok/ngrok_ketch_...”. prompt-caching:1:356 Response { status: 404, type: "default", url: "", redirected: false, ok: false, statusText: "Not Found", headers: Headers(1), body: ReadableStream, bodyUsed: false }

React Router caught the following error during render entry.client-BTJ7ChVH.js:8:64676 Response { status: 404, type: "default", url: "", redirected: false, ok: false, statusText: "Not Found", headers: Headers(1), body: ReadableStream, bodyUsed: false }

Uncaught Error: Minified React error #520; visit https://react.dev/errors/520 for the full message or use the non-minified dev environment for full errors and additional helpful warnings. chunk-G3INQAYP-D7BZozYw.js:4:2490 Rm https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... mu https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... Lm https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... t1 https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... A1 https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... Ba https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... Caused by: Response { … }

samwho · 2025-12-19T10:06:44 1766138804

Another person had this problem as well and we couldn’t figure out what causes it. We suspect something to do with WebGL support. What browser/device are you using? Does it still break if you disable all extensions? I’d love to fix this.

bkor · 2025-12-19T15:22:48 1766157768

It gives "D is not a function". This on Firefox 146. Various extensions including Ublock Origin but that doesn't seem to cause it. Also doesn't work in a private window.

samwho · 2025-12-19T10:05:38 1766138738

It’s funny, I didn’t set out for that to be the case. When I pitched the idea internally, I wanted to scratch my own itch (what on earth is a cached token?) and produce a good post. But then I realised I had to go deeper and deeper to get to my answer and accidentally made a very long explainer.

yomismoaqui · 2025-12-19T15:08:27 1766156907

Thanks for the post, it's near perfect in focus, detail and how it's written.

EDIT: You have some minor typos in the post (psuedocode)

samwho · 2025-12-19T10:04:12 1766138652

Yay, glad I could help! The sampling process is so interesting on its own that I really want to do a piece on it as well.

wesammikhail · 2025-12-19T12:07:44 1766146064

Looking forward to it!