Tangential question related to the example kernel: in GPU programming is it idio...

porridgeraisin · 2025-11-07T18:33:12 1762540392

They have made it empty only.

>> out = torch.empty([m, n], dtype=x.dtype, device=x.device)

The accumulator has been initialized to zero, since well, they have to add stuff into it.

>> acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)

> idiomatic

No as far as I have seen they generally try to not initialize if its not necessary.

> overhead

There is the memory bandwidth point as you might expect. But additionally when using high level interfaces like pytorch, when you write torch.zeros(512, 512) in pytorch, it launches a whole kernel (tens of micros) just for that line. So that's cpu -> gpu -> back to cpu, and then it does the next line, where it goes to gpu again and uses that memory. So in these cases you make sure to avoid it if its in a hot path. Ideally you want the 2nd kernel to do the initialization itself. When you write cuda c++ yourself this is how you typically do it. Helion being a compiler might be doing this optimization, but runtime based torch can't clearly.

saagarjha · 2025-11-07T18:31:33 1762540293

It saves a kernel launch and memory bandwidth for a fill kernel. If you’re going to overwrite the data anyway, why bother?