Update: I believe I was fooling myself. I was running the new code, then the old...

Update: I believe I was fooling myself. I was running the new code, then the old code, and the new code was consistently beating the old code. But, when I reversed the order, and ran the old code first, then the new code, the old code beat the new code! Then I started monitoring CPU temperatures during the tests. Turns out the code to run first was starting with cooler CPUs and getting a little head start on the work with a longer time before thermal throttling kicked in. If I wait for things to cool down between runs, then the old code (no huge pages) consistently beats the new code (huge pages) but only by a very small amount. Benchmarking is hard.