So we can avoid a lot of this pain by not using constructors for thread_local va...

yosefk · 2025-02-17T19:50:26 1739821826

> A significant portion of that is rdtsc time, right? Like 50-80%. Writing a few bytes to a ringbuffer prefetched in local cache is very very cheap but rdtsc takes ~4-8 nanos.

Exactly right, though it seems totally insane to me coming from an embedded background where you get the cycle count in less than 1 ns but writing to the buffer would be "the" performance problem (and then you could eg avoid writing short calls as measured at runtime, but on x86 you will have spent too much time on the rdtsc for this to lower the overhead.) There's also RDPMC but it's not much faster and you need permissions(tm) to use it, plus it stops counting on various occasions which I never fully understood.

Regarding prefetching - what do you do perfetching-wise that helps performance?.. All my attempts to do better than the simplest store instructions did nothing to improve performance (I tried prefetchw/__builtin_prefetch, movntq/_mm_stream_pi and vmovntdq/_mm_stream_si128, all of them either didn't help or made things even slower)

loeg · 2025-02-17T23:02:15 1739833335

> Regarding prefetching - what do you do perfetching-wise that helps performance?

Absolutely nothing -- the CPU internally just does a very good job predicting the ringbuffer write pattern, for obvious reasons.