So we can avoid a lot of this pain by not using constructors for thread_local variables, right? We can roll our own version of this using a thread_local bool, or just interpose thread spawning as you've done.
As far as why it's worse with fPIC/shared: there are a variety of TLS models: General Dynamic, Local Dynamic, Initial Exec, and Local Exec. And they have different constraints / generality. The more general models are slower. IIRC IE/LE won't work with shared library thread_locals, but it's been a while so don't quote me on that.
Generally agree that it seems like the compiler could in theory be doing a better job in some of these circumstances.
> I’m sure there’s some dirty trick or other, based on knowing the guts of libc and other such, which, while dirty, is going to work for a long time, and where you can reasonably safely detect if it stopped working and upgrade it for whatever changes the guts of libc will have undergone. If you have an idea, please share it!
Find the existing TLS allocations, hope there's spare space at the end of the last page, and just map your variables there using %fs-relative accesses?
Always fun to see another high performance tracing implementation. We do some similar things at work (thread-local ringbuffers), though we aren't doing function entry/leave tracing.
> in my microbenchmark I get <10 ns per instrumented call or return
A significant portion of that is rdtsc time, right? Like 50-80%. Writing a few bytes to a ringbuffer prefetched in local cache is very very cheap but rdtsc takes ~4-8 nanos.
> While we're on the subject of snapshots - you can get trace data from a core dump by loading funtrace_gdb.py from gdb
Nice. We trace into a shared memory segment and then map it at start time, emitting a pre-crash trace during (next) program start. Maybe makes more sense for our use case (a specific server that is auto-restarted by orchestration) than a more general tracing system.
> A significant portion of that is rdtsc time, right? Like 50-80%. Writing a few bytes to a ringbuffer prefetched in local cache is very very cheap but rdtsc takes ~4-8 nanos.
Exactly right, though it seems totally insane to me coming from an embedded background where you get the cycle count in less than 1 ns but writing to the buffer would be "the" performance problem (and then you could eg avoid writing short calls as measured at runtime, but on x86 you will have spent too much time on the rdtsc for this to lower the overhead.) There's also RDPMC but it's not much faster and you need permissions(tm) to use it, plus it stops counting on various occasions which I never fully understood.
Regarding prefetching - what do you do perfetching-wise that helps performance?.. All my attempts to do better than the simplest store instructions did nothing to improve performance (I tried prefetchw/__builtin_prefetch, movntq/_mm_stream_pi and vmovntdq/_mm_stream_si128, all of them either didn't help or made things even slower)
As far as why it's worse with fPIC/shared: there are a variety of TLS models: General Dynamic, Local Dynamic, Initial Exec, and Local Exec. And they have different constraints / generality. The more general models are slower. IIRC IE/LE won't work with shared library thread_locals, but it's been a while so don't quote me on that.
Generally agree that it seems like the compiler could in theory be doing a better job in some of these circumstances.
> I’m sure there’s some dirty trick or other, based on knowing the guts of libc and other such, which, while dirty, is going to work for a long time, and where you can reasonably safely detect if it stopped working and upgrade it for whatever changes the guts of libc will have undergone. If you have an idea, please share it!
Find the existing TLS allocations, hope there's spare space at the end of the last page, and just map your variables there using %fs-relative accesses?
Always fun to see another high performance tracing implementation. We do some similar things at work (thread-local ringbuffers), though we aren't doing function entry/leave tracing.
> in my microbenchmark I get <10 ns per instrumented call or return
A significant portion of that is rdtsc time, right? Like 50-80%. Writing a few bytes to a ringbuffer prefetched in local cache is very very cheap but rdtsc takes ~4-8 nanos.
> While we're on the subject of snapshots - you can get trace data from a core dump by loading funtrace_gdb.py from gdb
Nice. We trace into a shared memory segment and then map it at start time, emitting a pre-crash trace during (next) program start. Maybe makes more sense for our use case (a specific server that is auto-restarted by orchestration) than a more general tracing system.