> I feel like I've said this a few times but for some reason you're not getting ...

> I feel like I've said this a few times but for some reason you're not getting it, not really sure what the hangup is?

I think the hangup is that you aren't understanding what a C++ future is, and why it solves this problem. Remember, I'm operating under the assumption that the GPU-programmer has figured out how to write a C++ future on the GPU.

I kinda have an overall idea of how one would be written, but I recognize that no one has done it yet. But I don't see any reason why a C++ Future couldn't be done. The real question I have is if C++ Futures can be implemented efficiently to make all of this worthwhile. (I've gone through a bunch of ideas in my brain... I think I've got one idea that is "efficient enough"... but many obvious implementations... like atomics or mutexes won't work well on a GPU. An implementation that is innately "task based cooprerative switching" is the only efficient implementation I can think of)

> Like I've been saying, you don't know what you want to q-search next until you've finished the last q-search. So you can't make a q-search queue because you couldn't add anything to the queue until the previous addition was evaluated.

Future<int> f = spawn(new q-search);

f.wait(); <---- Really simple actually. If "f" is blocked, then the current task goes into the "blocked" list. The first task from the "unblocked list" gets pulled off. If the "unblocked list" is empty, then this SIMD-thread idles until someone else in the NVidia warp (or AMD workgroup) calls f.wait() for another opportunity.

Yes, there is some divergence here, but the overall process of f.wait() is efficient enough that I don't think its a big penalty to have the warp stall.

You only run Q-searches that are on the "Running" list. You only need 32-running Q-searches to keep an NVidia warp busy, or 64-running Q-searches to keep an AMD workgroup busy.

Surely, there are at least 32 Q-searches that are unblocked at a time in a parallel chess search? Sure, not at ply 1, but the number of Q-searches that could be performed in parallel grows exponentially per ply, and also remember... an NVidia Warp will operate at 100% utilization with as little as 32-Q searches.

An AMD Workgroup is less efficient, needing 64 Q-searches before operating at 100% utilization. But this is still a small number in the scope of the exponentially increasing chess game tree.