Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Agree on shorter, disagree on simpler. The hard part of understanding GPU code is knowing the reasons why algorithms are the way they are. For example, why we do a split-k decomposition when doing a matrix multiplication, or why are we loading this particular data into shared memory at this particular time, with some overlapping subset into registers.

Getting rid of the for loop over an array index doesn't make it easier to understand the hard parts. Losing the developer perf and debug tooling is absolutely not worth the tradeoff.

For me I'd rather deal with Jax or Numba, and if that still wasn't enough, I would jump straight to CUDA.

It's possible I'm an old fogey with bias, though. It's true that I've spent a lot more time with CUDA than with the new DSLs on the block.



I don’t think it is possible to write high performance code without understanding how the hardware works. I just think staring at code that coalesces your loads or swizzles your layouts for the hundredth time is a waste of screen space, though. Just let the compiler do it and when it gets it wrong then you can bust out the explicit code you were going to write in CUDA, anyway.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: