Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Huge pages are a good idea (evanjones.ca)
175 points by moreati on Jan 22, 2023 | hide | past | favorite | 59 comments


Huge pages are an absolutely great idea. I will continue to complain about our eternal commitment to 4k pages, with all their pitfalls, which it appears we may be stuck with until the heat death of the universe. At the very minimum we could go to 16k pages, which are good for many reasons including, in particular, being able to have bigger VIPT cache sizes without needing an increase in associativity (and thus latency). Not the end of the world but a very solid win on top of the TLB wins.

But transparent hugepages continue to be a massive source of bugs, weird behaviors, and total system failures in my experience. I just got a bug report this week where a simple THP enabled system simply spun out of control with a kernel task locking the system at 100% CPU for minutes, with a 10 line reproducer via mmap(2). This is in combination with qemu/libvirt in a virtual machine, and it's possible the virtualization stack is just exposing bugs, but, like. This is very well tested stuff! I'm not sure "Google enabled it fleet wide, so it can be done" is very re-assuring to me when most of us don't have fleetops/infra/kernel teams capable of dealing with this stuff. The person who reported this bug said they started seeing odd behavior a month ago, before boiling it down; it wasn't readily apparent at all. Is this just a massive footgun for our distro users? I dunno. Something that works in the p95 and then collapses horrifically in the p99 cases like this doesn't feel great. I try not to be superstitious about things like this but, yeah. It's weird.

Anyway. This reminds me I have to submit some patches to disable jemalloc in a few aarch64 packages so I can use them on Asahi Linux. 4k pages will haunt us until the end of time.


> Huge pages are an absolutely great idea. I will continue to complain about our eternal commitment to 4k pages, with all their pitfalls, which it appears we may be stuck with until the heat death of the universe.

I think it is useful to distinguish just larger regular pages (i.e. 16k or 64k pages on arm64) and extraordinary huge pages (2M on systems where regular pages are 4k). In the first case, the system uses these larger pages uniformly, as there is no smaller page.


You're completely right. But I need to a place to launder^W air my related complaints, too. :)


If you think 2M pages are extraordinarily huge, what would you call 1G pages? :-)


I say they are extraordinary, because they are larger than regular page size as reported by sysconf(_SC_PAGESIZE).

Also, i meant "extraordinary (huge pages)", not "(extraordinary huge) pages".


Huge pages are not a great idea, increasing the page size from 4k is.


The part in the article about ARM64 Linux giving up on 64kb pages was particularly damning. Although as it notes, Apple is using something other than 4gb pages at least.


I recently set up huge pages on my database server (MariaDB and Postgres) the recommended way, which required way too much rigamarole IMO. Add to the kernel command line to statically allocate a number of huge pages of a certain size. Create a group to access huge pages. Configure that group to access huge pages. Add mysqld etc to the group. Configure the huge pages to be mounted as a virtual filesystem in /dev/ for some reason. Add corresponding configuration to the database to tell it to use huge pages and where to get them.

This should all just be a single boolean flag in the database config telling it to use huge pages which it gets from mmap dynamically. Why is any of the filesystem, permission, static allocation malarkey necessary?


Huge pages need contiguous free physical pages. Without preallocating them at boot time, with higher system uptime, chances of finding such region to satisfy the allocation are slimmer, especially for 1G pages, to the point when even services starting later at boot time might not get them due to external fragmentation caused by 4k pages allocations.

While I can see why special permissions are needed to grab them, the whole filesystem thingy is clunky as hell. I have no idea why they didn't put them by default in /sys or /proc.


> This should all just be a single boolean flag in the database config telling it to use huge pages which it gets from mmap dynamically. Why is any of the filesystem, permission, static allocation malarkey necessary?

FWIW, those bits shouldn't be necessary with postgres. If huge_pages is try or on, we'll specify MAP_HUGETLB (or (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT if huge_page_size is set). If mmap() fails we'll error (=on) out or fall back to non-huge allocation (=try)

However, you do need to allocate huge pages on the system level for this to succeed. But it's indeed just a /proc (or /sys if you want more control). /proc/sys/vm/nr_hugepages, or /sys/kernel/mm/hugepages/hugepages-kB/nr_hugepages.

One of the more awkward bits about the kernel config is that they are calculated in pages, so you need to do the conversion yourself :(

   echo $(( (32 * 1024*1024*1024) / (2 * 1024 * 1024) + 1)) | tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages && cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages


I remember when hard drives moved to 4k sectors. It seemed insane that they were still 512bytes, and seemed neat to have them the same size as memory pages. But 4k pages seem incredibly small today, and I would argue that any application suffering with larger pages is doing something wrong. To have performance problems you need to use large amounts of memory with small allocations right? Not only that, but also free some to cause fragmentation?

It's strange to me that this is an issue so late in the game.


> I remember when hard drives moved to 4k sectors. It seemed insane that they were still 512bytes, and seemed neat to have them the same size as memory pages. But 4k pages seem incredibly small today

It's a problem with NVMe especially large ones: we're just starting to see 4kn.

Even for a 4kn drive, the block size they report (and the erase block size) are not the ones used internally. There are some tools to infer the correct size using breakpoint detection in latency measurements.

There're also some simple rule of thumb: 2^16 for Samsung >1Tb (64k)


"4kn" = 4k native, in case anyone was wondering.


I was. Thanks!


After seeing this, I spent a half hour or so this morning and I was able to implement this in one of my programs, and now it runs to completion in about 85 percent of the time that it used to require. So, thanks!


Update: I believe I was fooling myself. I was running the new code, then the old code, and the new code was consistently beating the old code. But, when I reversed the order, and ran the old code first, then the new code, the old code beat the new code! Then I started monitoring CPU temperatures during the tests. Turns out the code to run first was starting with cooler CPUs and getting a little head start on the work with a longer time before thermal throttling kicked in. If I wait for things to cool down between runs, then the old code (no huge pages) consistently beats the new code (huge pages) but only by a very small amount. Benchmarking is hard.


That's why I typically randomize these things and run multiple times.


• The Linux kernel's implementation of transparent huge pages has been the source of performance problems.

I remember (admittedly years ago) spending a lot of time trying to debug server crippling performance problems, ultimately learning that transparent huge pages were the cause.

Proceed with caution.


Things have changed since then, now you can do per application malloc backed huge pages.


It's funny how this moves in circles: Automatic merging into hugepages (THP) is added to the kernel. Some workloads suffer, so it gets turned off again in some distributions (not all of course). But for many different workloads (some say the majority), THP is actually really, really beneficial, so it is ported to more and more mallocs over time.

It might have been more straightforward to teach the problematic workloads to opt out of THP once the problems were discovered.


The `madvise` setting of THP was never problematic for anyone. The Redis people made a huge stink about it because 1) they don't know what they are talking about and 2) the architecture of their program is really bizarre. Consequently it is now a widespread myth that you must set THP to `never` or else it is hazardous. But being widespread does not make it any less of a myth.


I've built an efficient in-memory database. It has to be very low latency on batched queries. When I couldn't squeeze more perf out of it myself, I tried enabling large page support (dotnet on Windows) and instantly got a 10% perf increase.

It was not at all a pleasant experience due to lack of documentation as well as the flaky implementation. But I was surprised by how much overhead the TLB accounted for.


Sounds like lot of issues stem from the transparent aspect of huge pages; that all (un-)mappings are not rounded to huge page size and 4k pages are still supported. Has there been any consideration towards non-transparent huge pages where all that magic does happen and all you got are huge pages?


macOS on Apple Silicon and iOS use 16kiB pages without lying to applications.

Linux has a MAP_HUGETLB constant, which you can pass to mmap(), which opts your application into 2MiB or 1GiB pages. Unfortunately, last time I tried it it required the system admin to enable support in the system for that, which wasn't on by default (on Debian at least). So from the perspective of an ordinary application developer that's useless.


Same on Windows, your memory allocator can request 2MiB or 1GiB pages, and some allocators support this (mimalloc comes to mind), but the admin needs to change a group policy setting to allow that. It's still used by some products like SQL server.


A few years ago at least GB pages were buggy under Windows. I'm not sure of the current siruation.


I do not believe that transparent huge pages works with 1GiB page size under any circumstances. Linux just flat out doesn't support that. The 1G TLBs on x86 are a completely separate hardware facility.


The default in current Debian is on:

  $ cat /sys/kernel/mm/transparent_hugepage/enabled 
  [always] madvise never
  $ grep HUGEPAGE /boot/config-5.10.0-20-amd64 
  CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
  CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
  CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
  CONFIG_TRANSPARENT_HUGEPAGE=y
  CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
  # CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set


> Unfortunately, last time I tried it it required the system admin to enable support in the system for that

Why??


NB This is even more important in VM scenarios, where second-level address translation means something needs to walk a guest-physical-to-system-physical map for each level of the guest-virtual-to-guest-physical map. So TLB locality becomes even more important, and using huge pages cuts down on a multiplier in resolving TLB misses.


People seem to be assuming that only malloc'ed memory is relevant. At least for Fortran, you want to allocate arrays on the stack (gfortran -fstack-arrays).

Apart from using larger/huge pages, you may want to take steps to minimize TLB misses. The Goto BLAS paper talks about that.


It's mind warping terminology to call them huge pages. Memory sizes have increased 2000-4000x since x86 Linux picked 4 kB, so 2 MB pages are still in relative terms smaller than 4 kB pages were back then and it should be a no-brainer to use them by default.


Memory sizes have increased, but so have the workloads that people run. It's not like our computers still only consume a few MBs of RAM and we have 1000x times more free memory than we used to.

Going from 4k to 2M doesn't waste a fixed amount of memory, it wastes a percentage of memory. So the amount of memory lost continues to scale with memory size. Losing 10% of your memory when you only have 32MB hurts just as much as losing 10% when you have 32GB, because we didn't have a 1000x times increase in free RAM. Applications just use more RAM than they used to.

Now an increase to 8k, 16k, or even 64k might be defensible. And even that last one will draw some scowls from very serious people. But going straight to 2M pages by default will not be a good tradeoff. Especially since the current system with differently sized pages gets you most of the performance gain for a much more reasonable cost.


The percentage of memory idea isn't quite right. Consider what would happen in 2 computers, one with 4MB of ram and one with 4 GB of ram. The former would waste a large percentage from moving from 4k to 2MB pages than the latter one.


That's true, although the effect here is because the page size you picked (2MB) is close to the physical memory size (4MB).

Between a computer with 1GB and a computer with 64GB, any vaguely reasonable default page size you pick (whether that's 64k or 2M) is far from the total memory size, so the percentage wasted doesn't get much smaller as RAM continues increasing. I.e. the relative benefit of larger pages grows much slower than total system memory.

So in other words, my point is that default page size should not just grow linearly with memory size, the best tradeoff is something a little more complex. Saying that RAM grew 1000x, therefore page size should grow a lot is aiming too high too fast. Especially given that hardware already supports mixed page sizes (despite lackluster software support at the moment)


Note that more memory does not necessary mean that everything is uniformly bigger, as there is some strongly-non-uniform distribution of memory map sizes, and more memory most likely means that large maps goes larger.

Take for example bash: its RSS is about ~4 MB, but it has 43 memory maps (on my system). One is > 1MB, two are 512-1024 kB, most are very small. Each requires at least one page, so with mandatory 2 MB pages, its RSS would be 86 MB instead of 4 MB.


Were there no downside it would be a default. Perhaps ask "isn't it a no-brainer to use large pages" rather than declare it is without saying why you think so.

Edit: I did a quick search. upside of less TLB thrashing, downside of potentially more disk thrashing as pages are paged in and out.



Problem with transparent huge pages is that it is pushed on applications that were developed with expectation that memory management is based on pages of fixed size, and now they run on system that reports it uses 4k pages, but sometimes substitutes 2M pages instead of 4k pages, in not-really-transparent manner, causing unexpected memory consumption issues. If default for THP were madvise-only, so THP-aware applications can use it, it would not cause such problems.


The easiest way to exploit THP, by far, is to link your program against TCMalloc and forget about it. Literally free money. Highly recommended.

https://github.com/google/tcmalloc


This is usually below the level of abstraction I am working on. I have questions. Before madvise, did people simply assume that a memory page is always 4kiB in size and built that assumption into so many programs? Is that why many programs break? Why did they assume that? Was there at least something like "size of int" or so around before madvise? And if so, why did they not use that?


I think most applications don't have any dependency towards a specific page size. They use malloc (C) or new (C++) to allocate memory which does not expose this constraint.

You need to care if using mmap directly to map files or other resources into the virtual memory address space. The default page size can be queried using for example sysconf() on Linux. I guess something like garbage collectors in language run-times would also use mmap directly as it's most likely to side step malloc/new.

An application would normally not use madvise, unless also using mmap for some special purpose.

It depends on the CPU architecture how flexible it is with different page sizes. For example, from what I recall, MIPS was extremely flexible and allowed any even power of two size for any TLB entry.

x86_64 only support three different page sizes, 4 kB, 2 MB and 1 GB and there are limitations wrt the number of TLB entries that can be used for the larger page sizes.

So, yea, there are bound to be regressions if just trying to switch to 2 MB as a default but I think it should be doable. Not all archs use 4 kB to begin with.


You can call sysconf(_SC_PAGESIZE) to get page size. Problem with transparent huge pages is that you get regular page size (4k) from sysconf(), but then OS sometimes use 2M instead.

Just switching to larger regular page size (e.g. 64k) on platforms that support it would not have problems associated with THP.


Is that a performance problem or a correctness problem?


We had issue where memory consumption was several times higher with THP enabled. When you hit OOM, such performance problem became correctness problem.


The article has some examples of breakage, and these are mostly using an allocator other than the system allocator which hard codes a 4k page size (Go, Chrome, jemalloc) at least for x86 and arm.

Why did they assume that? 4k pages are a feature of the memory management unit of the cpu. Optional support for large pages came to x86 with pentium in the mid-1990’s. Presumably all x86 cpus out there today have large page support, but the assumption of the 4k default is deeply ingrained.


The Redis example is a little different:

"fork() e.g. Redis: Calling fork marks all of the process's pages as copy-on-write. Then when a single byte on a page is modified, the page must be copied. Redis uses fork to create a read-only "snapshot" of memory, when writing a checkpoint to disk."

So prior, that fork forced only a rewrite of some collection of 4kb pages. Afterwards, obviously much larger rewrites.


Interestingly a common pattern: fork (and no exec) is also what Android does [0] to not only speed up Kotlin / Java app start-up but save on allocation of shared dex/bytecode (a couple dozen megabytes); code: https://archive.is/GMka3

[0] The state of ASLR on Android 5 (2015), https://archive.is/ADx65 (copperhead.co).


jemalloc does not assume a particular page size; it configures page size at build time. By default the configured page size is the actual page size (usually 4KiB), but optionally it's some power-of-two multiple of the actual page size. In the early days of jemalloc the page size was queried at startup rather than burned into the binary, but that dynamic flexibility was never of any use in practice, and it inhibited compiler optimizations.


There is also a great blog post that covers reliably allocating huge pages: https://mazzo.li/posts/check-huge-page.html


** EXCEPT FOR SPLUNK ** (or applications which perform I/O on small bits of data)

https://docs.splunk.com/Documentation/Splunk/9.0.3/ReleaseNo...

I have notes-to-myself which I can't find at the moment, but the TL:DR is know what you're doing. If you disable it make a note so if the server is re-purposed you don't hamstring an unwitting inheriting admin. I was up against the wall on a Linux indexer which was for mysterious reasons under-performing by a ridiculous percentage, all kinds of crazy latency, until I disabled the THP. This was years ago. If you disable, make it a systemd service so it can be discovered, people don't check /sys/kernel/ as a frequent place to look.

Edited: THP not TLB (the translation look-aside buffer).


If your application breaks due to THP, do not despair! There is prctl(PR_SET_THP_DISABLE), allowing it to be disabled on per-process basis.


The original title was: Huge Pages are a Good Idea


Indeed. Something significant is lost when the "Huge" is omitted. Memory pages being good idea is essentially a tautology at this point, but the use of huge (much larger than 4k) memory pages being a universally good idea is much less obvious and more nuanced, and ultimately, potentially even untrue.


Sorry, I didn't mean to trim that.


You should still be able to edit the title? (And my guess is that you didn't actually trim that, but HN's automatic title editor removed it as a typical clickbait word.)


Seems odd to even counter clickbait so surgically, when other rules against any editorializing would catch that anyway.


Without the "huge", I was even more curious as I thought it was something about infinite scrolling vs pagination, so adding the word made it less of a clickbait title for me.

...and I suppose if you're not a fan of the trend of paginated content being shortened and split into more pages, then even with the "huge" added, it would still be clickbait as it wasn't what you're expecting.


It might accept a quoted "Huge Pages" — I don't know.

dang or hn mod might have to fix it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: