Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PyPy for low-latency systems (morepypy.blogspot.com)
226 points by shocks on Jan 3, 2019 | hide | past | favorite | 50 comments


Reminds me of a time at Quora in 2011 where we saw Python GC impact 99th percentile server-side site speed. So drawing from HFT inspiration where some companies would disable JVM GC during trading hours and perform them offline, I thought about how to take some backends periodically offline in order to have GC not happen on user requests. A simpler operational solution emerged though where I just had to disable GC on user requests and make it happen only on a special "/_gc" endpoint. I then dual purposed the frequent nginx/haproxy backend health-check functionality to use that endpoint, thereby ensuring all backends had frequent GC and the time spent there only impacting the health check requests, and not that of end users.

edit: added more details I remembered later


In the Ruby world this is called "out-of-band GC" and is supported by multiple application servers: http://tmm1.net/ruby21-oobgc/


This is a very interesting approach, what happened with memory footprint when you did this?


Thanks, don't think I saw much impact at all in aggregate - our memory consumption on these web servers were dominated by objects we intentionally stored per request or globally, and not temporary/unreferenced python objects.


Even while GC is delayed Python (CPython at least) will free some stuff through reference counting. Only circularly referenced stuff should stick around until the next GC run. So that can avoid lots of stack temporaries and stuff.


Theoretically with your code structured right you can disable the cyclical garbage collector outright: It only deals with reference cycles which you can explicitly avoid by using the weakref module.

Not entirely sure how you'd go about writing code like that, but it's possible.


Nice! I'll have to give this a try at some point if I run into GC related latency issues and see if it works on my systems.


Nice examples and graphs. What really confuses me is the definition of "low-latency" nowadays. The meaning suffers a slippery slope in recent years. It used to refer to a scale of microseconds at HFT shops, then it came to web request latency at a scale of milliseconds. Now every GC-based language claims they are "low-latency" because 90%/95%/99% GC stop is within 16.67ms/50ms/whatever. Therefore today some HFT developers invented words like "ultra-low latency"[0] to name their work.

[0]: https://en.wikipedia.org/wiki/Ultra-low_latency_direct_marke...


On the flip side, I've also heard "low-latency" (or even "instant") to mean "within 30 seconds", as an upgrade from "the mainframe runs a batch job every hour (or day)".

I don't think there's any way to consider a phrase like "low latency" without considering what you're talking about.


Good point, the word "low" itself is relevant.


> What really confuses me is the definition of "low-latency" nowadays

There has never been a strict definition of “low latency”. It is heavily context-dependent. Something which is not “low latency” enough for live audio or a car’s steering controller might be plenty fast for an internet text chat service.


A better term is soft or hard real-time.


While not completely orthogonal, I feel like real-time (soft or hard) is distinct from low latency; real-time can be still relatively high-latency as long as the latency is bounded/controlled/predictable. Of course generally in practice real-time systems strive to have low latency too.


Absolutely. Think about turning a super tanker or something other huge. It must happen on time. It doesn't matter if it takes one or two seconds to initiate.


Not exactly. Hard Real-Time = Missing a deadline is a major system fault. Hard read-time would be anti-lock breaks on your car: Did the cylinder not actuate within a given deadline? If so, disable the ABS feature system since its no longer deterministic, set an error code on the car's computer, and make some error light on the dashboard go on.

Soft real time is video decoding: We need to render 1 frame every 1/30th-ish per second. If we miss our deadline, we either need to skip that frame and move onto the next, or pause to fill a buffer.

You can have a hard real time system where the latency required is measured in seconds and a soft realtime system where the deadline is in nanoseconds.


Its related, but not equivalent.

real-time == "bounded latency" (to a soft or hard degree).

low-latency implies no threshold, just low relative to something else.


I thought the difference between soft vs hard realtime wasn't one of magnitude, but context.

Soft: After X time, usability of the data degrades with age.

Hard: After X time, the data is useless (example: realtime car control systems; if you take too long the car has already crashed)


Oh, sorry, I mean that low latency is a synonym of real-time, either hard or soft depending on the application.


Someone at Google once told me they worked on “real-time” search. Having come from graphics and video rendering I asked what that meant and she said it was how search may or may not respond to key presses.

Whatever it was, it certainly wasn’t real-time.


That's been the case for operating systems and it's quite effective


Prior to HFT shops, embedded use cases might consider 100s of us to be the threshold for low-latency (40 instructions on a typical 8-bit micro running in single-digit MHz is under 100us, with no cache-misses to muck up timings).


In HFT shops, we usually categorise stuff as “hardware” (fpga) and “software” these days. Low-latency typically means sub-microsecond.


Do HFT shops use open-source code for FPGA dev?


I’d say typically not, there’s not much to use as most of the FPGA code would be know-how and optimised down to cycle-level for each particular exchange.


How about basic frameworks like high-speed PCIe communication with the operating system, including an open-source driver for Windows and Linux? That should be algo independent.


I’m not an fpga engineer myself in my firm, so I can’t answer that with certainty, but one thing I know for sure is how strong the “not invented here” syndrome is in this kind of shops, so you can guess...


A bit of competition is always fun, you can point them at the code/papers linked from https://forums.xilinx.com/t5/Xcell-Daily-Blog-Archived/Need-...


Context is king. When you're working on the timescale of days, low latency can be within minutes. By the same measure, when you expect to get a new event every microsecond, then nanoseconds matter.

In this case, Gambit (the outfit who sponsored this work) are a stat arb shop doing HFT. So I'd say the label fits.


HFT shops refer to things on nanosecond and sometimes even picosecond scale if you want to talk seriously about it.

Source: software engineer who's worked (or still works) for electronic trading firms for the last 10ish years.


I'd really like to know the context in which the time for light to travel ~1 inch or (generously) 1/200 of 1 cpu operation or 1/100 of a cache read is relevant to anything.


The more accurate data captures are, the more accurate models of the data can become. Also, for things like pure arbitrage (which virtually doesn't exist anymore due to the efficiency of markets with computers trading), those trades are Taladega Nights style "you're first or you're last" winner take all.

As another poster commented about, it is not uncommon in RF. Fiber optics along with Microwave radios are common in the space. The goal is to get to the limit of what physics allows ultimately.


For RF work it's common to have to send a signal at a nanosecond resolution to be received properly due to time sharing.


This could possibly be combined with multiprocessing for great effect? I'm imaging something like having a pool of workers executing tasks (/reacting to events/serving requests/etc), and only running gc after the task has been done, but before indicating readyness to the supervising process (/loadbalancer/etc).


Even a very simple mode with two processes that never have GC enabled at the same time would greatly improve things.


Seems quite useful, allowing the developer to guide the garbage collector into the right direction by carefully placing statements to tell the garbage collector when it is (not) ok to run. But make sure to add some comments describing the purpose of those statements, and how to profile your code to check whether it's still working correctly. Don't want to accidentally stop garbage collection altogether, either.


This is a nice bit a progress and addresses a major concern about using PyPy in real-time systems.


I'd be curious to see if any of the work done here could be applied back to the main CPython project. I doubt it could happen immediately -- at least, not with the way GC is currently implemented -- but PyPy has been a source of innovation for CPython in the past (see: new dict implementation).


CPython already has a lower latency GC than PyPy, gc.disable() already works, and allowed manual memory management when needed.

Reference counting allows you (if needed) to keep references to memory in your python code, and free them in the right spots.

This is PyPy becoming useful for a lot more production use cases. From web APIs that have a latency SLA, to audio, games. In many cases peak performance is not important, it's the minimum performance.


Refcounting comes with its own in-thread gc pauses whenever you exit a block or context and the local variables are collected.


Yeah. However you have the option to not pause if it is important. You can control where the memory management happens. You can either keep references to the memory, and call gc.disable(). When you are ready you can let go the references and enable the gc.

PyPy now lets you control where memory management happens. Making it possible to control worst case performance. For many production apps this is a big deal.


You can never prevent the GC cycle in CPython at the end of a block (context). You can only prevent the GC that tries to break reference cycles. If your class does crazy things at destruction, like "time.sleep(10)", and you create an instance of the class inside a function, when that function returns you will pause CPython even if you call gc.disable()

You also cannot disable the minor collections in PyPy, only the major collections, but once the JIT kicks in PyPy can prevent some of the object churn by optimizing instances away.


Yeah. Avoiding slow things like classes, threads and adding time.sleep(10) is the trick.


But a new event could be submitted at anytime.

This is certainly an improvement, but not a complete solution.


Ideally you have enough copies of the server process to handle the events that come when another process is running GC.

You already have to have enough of them to handle events while other workers are busy.


Relevant: "Blade: A Data Center Garbage Collector" (2015, https://arxiv.org/pdf/1504.02578.pdf)

Terrible title, but basically the same idea.


Combining these two functions, it is possible to take control of the GC to make sure it runs only when it is acceptable to do so.

I'm conflicted on this. My gut tells me if I'm going to manually take control over the garbage collector I should reconsider my design decisions.

Disclaimer, I don't know the first thing about PyPy or Gambit Research, so probably this is the right approach for them?


It's explained in the article that this is to solve one specific issue that Gambit Research had: in some parts of their code they need to take action with very low latencies (<10ms) and hence they can't wait for the GC. This way they manually execute the GC in other sections of the code were timing requirements are relaxed.


It's a fairly common thing to do for e.g. games as well. Disable the GC during all your code, and then run it manually at the end of your frame. All you're doing is moving the gc runs to predictable points. You could even skip a collection if you've got a slow to render frame or two, evening out the spikes in framerate.


The advantages of a typical GC are that it avoids the development costs of manual memory management, and it allows high throughput. The main disadvantage is usually latency spikes. Using this feature decreases maximum latency spikes by orders of magnitude, with only a small cost in cognitive burden and throughput. If without this feature GC might have been a good tradeoff if not for latency, there is only a sliver of design space in which using this feature would push the throughput out of acceptable bounds. (And cognitive burden is still way lower than any other form of memory management.)


This is awesome!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: