Reminds me of a time at Quora in 2011 where we saw Python GC impact 99th percentile server-side site speed. So drawing from HFT inspiration where some companies would disable JVM GC during trading hours and perform them offline, I thought about how to take some backends periodically offline in order to have GC not happen on user requests. A simpler operational solution emerged though where I just had to disable GC on user requests and make it happen only on a special "/_gc" endpoint. I then dual purposed the frequent nginx/haproxy backend health-check functionality to use that endpoint, thereby ensuring all backends had frequent GC and the time spent there only impacting the health check requests, and not that of end users.
Thanks, don't think I saw much impact at all in aggregate - our memory consumption on these web servers were dominated by objects we intentionally stored per request or globally, and not temporary/unreferenced python objects.
Even while GC is delayed Python (CPython at least) will free some stuff through reference counting. Only circularly referenced stuff should stick around until the next GC run. So that can avoid lots of stack temporaries and stuff.
Theoretically with your code structured right you can disable the cyclical garbage collector outright: It only deals with reference cycles which you can explicitly avoid by using the weakref module.
Not entirely sure how you'd go about writing code like that, but it's possible.
Nice examples and graphs. What really confuses me is the definition of "low-latency" nowadays. The meaning suffers a slippery slope in recent years. It used to refer to a scale of microseconds at HFT shops, then it came to web request latency at a scale of milliseconds. Now every GC-based language claims they are "low-latency" because 90%/95%/99% GC stop is within 16.67ms/50ms/whatever. Therefore today some HFT developers invented words like "ultra-low latency"[0] to name their work.
On the flip side, I've also heard "low-latency" (or even "instant") to mean "within 30 seconds", as an upgrade from "the mainframe runs a batch job every hour (or day)".
I don't think there's any way to consider a phrase like "low latency" without considering what you're talking about.
> What really confuses me is the definition of "low-latency" nowadays
There has never been a strict definition of “low latency”. It is heavily context-dependent. Something which is not “low latency” enough for live audio or a car’s steering controller might be plenty fast for an internet text chat service.
While not completely orthogonal, I feel like real-time (soft or hard) is distinct from low latency; real-time can be still relatively high-latency as long as the latency is bounded/controlled/predictable. Of course generally in practice real-time systems strive to have low latency too.
Absolutely. Think about turning a super tanker or something other huge. It must happen on time. It doesn't matter if it takes one or two seconds to initiate.
Not exactly. Hard Real-Time = Missing a deadline is a major system fault. Hard read-time would be anti-lock breaks on your car: Did the cylinder not actuate within a given deadline? If so, disable the ABS feature system since its no longer deterministic, set an error code on the car's computer, and make some error light on the dashboard go on.
Soft real time is video decoding: We need to render 1 frame every 1/30th-ish per second. If we miss our deadline, we either need to skip that frame and move onto the next, or pause to fill a buffer.
You can have a hard real time system where the latency required is measured in seconds and a soft realtime system where the deadline is in nanoseconds.
Someone at Google once told me they worked on “real-time” search. Having come from graphics and video rendering I asked what that meant and she said it was how search may or may not respond to key presses.
Prior to HFT shops, embedded use cases might consider 100s of us to be the threshold for low-latency (40 instructions on a typical 8-bit micro running in single-digit MHz is under 100us, with no cache-misses to muck up timings).
I’d say typically not, there’s not much to use as most of the FPGA code would be know-how and optimised down to cycle-level for each particular exchange.
How about basic frameworks like high-speed PCIe communication with the operating system, including an open-source driver for Windows and Linux? That should be algo independent.
I’m not an fpga engineer myself in my firm, so I can’t answer that with certainty, but one thing I know for sure is how strong the “not invented here” syndrome is in this kind of shops, so you can guess...
Context is king. When you're working on the timescale of days, low latency can be within minutes. By the same measure, when you expect to get a new event every microsecond, then nanoseconds matter.
In this case, Gambit (the outfit who sponsored this work) are a stat arb shop doing HFT. So I'd say the label fits.
I'd really like to know the context in which the time for light to travel ~1 inch or (generously) 1/200 of 1 cpu operation or 1/100 of a cache read is relevant to anything.
The more accurate data captures are, the more accurate models of the data can become. Also, for things like pure arbitrage (which virtually doesn't exist anymore due to the efficiency of markets with computers trading), those trades are Taladega Nights style "you're first or you're last" winner take all.
As another poster commented about, it is not uncommon in RF. Fiber optics along with Microwave radios are common in the space. The goal is to get to the limit of what physics allows ultimately.
This could possibly be combined with multiprocessing for great effect? I'm imaging something like having a pool of workers executing tasks (/reacting to events/serving requests/etc), and only running gc after the task has been done, but before indicating readyness to the supervising process (/loadbalancer/etc).
Seems quite useful, allowing the developer to guide the garbage collector into the right direction by carefully placing statements to tell the garbage collector when it is (not) ok to run. But make sure to add some comments describing the purpose of those statements, and how to profile your code to check whether it's still working correctly. Don't want to accidentally stop garbage collection altogether, either.
I'd be curious to see if any of the work done here could be applied back to the main CPython project. I doubt it could happen immediately -- at least, not with the way GC is currently implemented -- but PyPy has been a source of innovation for CPython in the past (see: new dict implementation).
CPython already has a lower latency GC than PyPy, gc.disable() already works, and allowed manual memory management when needed.
Reference counting allows you (if needed) to keep references to memory in your python code, and free them in the right spots.
This is PyPy becoming useful for a lot more production use cases. From web APIs that have a latency SLA, to audio, games. In many cases peak performance is not important, it's the minimum performance.
Yeah. However you have the option to not pause if it is important. You can control where the memory management happens. You can either keep references to the memory, and call gc.disable(). When you are ready you can let go the references and enable the gc.
PyPy now lets you control where memory management happens. Making it possible to control worst case performance. For many production apps this is a big deal.
You can never prevent the GC cycle in CPython at the end of a block (context). You can only prevent the GC that tries to break reference cycles. If your class does crazy things at destruction, like "time.sleep(10)", and you create an instance of the class inside a function, when that function returns you will pause CPython even if you call gc.disable()
You also cannot disable the minor collections in PyPy, only the major collections, but once the JIT kicks in PyPy can prevent some of the object churn by optimizing instances away.
It's explained in the article that this is to solve one specific issue that Gambit Research had: in some parts of their code they need to take action with very low latencies (<10ms) and hence they can't wait for the GC. This way they manually execute the GC in other sections of the code were timing requirements are relaxed.
It's a fairly common thing to do for e.g. games as well. Disable the GC during all your code, and then run it manually at the end of your frame. All you're doing is moving the gc runs to predictable points. You could even skip a collection if you've got a slow to render frame or two, evening out the spikes in framerate.
The advantages of a typical GC are that it avoids the development costs of manual memory management, and it allows high throughput. The main disadvantage is usually latency spikes. Using this feature decreases maximum latency spikes by orders of magnitude, with only a small cost in cognitive burden and throughput. If without this feature GC might have been a good tradeoff if not for latency, there is only a sliver of design space in which using this feature would push the throughput out of acceptable bounds. (And cognitive burden is still way lower than any other form of memory management.)
edit: added more details I remembered later