Tangential question related to the example kernel: in GPU programming is it idiomatic/standard to initialize the out array as zeros rather than empty? are the performance savings negligible?
No as far as I have seen they generally try to not initialize if its not necessary.
> overhead
There is the memory bandwidth point as you might expect. But additionally when using high level interfaces like pytorch, when you write torch.zeros(512, 512) in pytorch, it launches a whole kernel (tens of micros) just for that line. So that's cpu -> gpu -> back to cpu, and then it does the next line, where it goes to gpu again and uses that memory. So in these cases you make sure to avoid it if its in a hot path. Ideally you want the 2nd kernel to do the initialization itself. When you write cuda c++ yourself this is how you typically do it. Helion being a compiler might be doing this optimization, but runtime based torch can't clearly.