Chess-playing neural network LC0 beats Stockfish in 100-game match

ganeshkrishnan · on May 28, 2019

It's really amazing and impressive what LC0 has achieved with almost no money and only community pooled training matches. They were struggling to pay $600 per month for the cloud server so have a baremetal server.

Pretty sure it can beat AlphaZero at this point of time.

umanwizard · on May 29, 2019

It's insane. Basically a hobby project and it obliterates other engines which are the result of 30+ years of computer chess research.

currymj · on May 29, 2019

I feel the opposite way: it's insane that Stockfish, running on CPUs and crude handmade heuristics, is able to keep up at all.

umanwizard · on May 29, 2019

Sure, Stockfish is a very impressive piece of technology, and I'm not trying to take away from that.

The way I see it, and what stands out as really cool to me is that we seem to be living through a phase transition to the third major era of chess, based on what type of player was strongest:

500 AD - 1997 AD: Humans.

1997 AD - 2019 AD: "Classical" minmax computer chess engines.

2019 AD - ???: Reinforcement learning engines.

It's extremely exciting to witness.

Just like the invention of the car didn't make horse races obsolete, I wonder if we'll see high-level computer chess competition split into two categories, one where using pre-trained models is not allowed.

jhncls · on May 29, 2019

Kasparov complained that Deep Blue had access to a myriad of his historical games, so it could prepare purposely just against one opponent. At the same time, Kasparov hadn't any access to train against Deep Blue's way of playing.

Stockfish had a similar disadvantage against Alphazero. With LC0 being more openly available, it can start improving its heuristics against this type of opponents.

ngcc_hk · on May 29, 2019

It is called zero because it only play with itself. Follow alpha go zero and leeks zero.

It has not studied human play. Unlike alpha go.

gamegoblin · on May 29, 2019

I agree with you in general, but to be perfectly fair, even the Zero-style AIs have a few hyperparameters such as block size and CPUCT value etc that can be tuned to produce engines that play differently. And if Stockfish is the measuring stick by which you gauge progress, you'll tune those values to produce good anti-stockfish nets.

That said, I think the effect here is probably negligible.

mot524 · on May 30, 2019

Except that LC0 wins decisively against all other engines too. You just hear about the wins vs. Stockfish because Stockfish is the best conventional engine so that's more newsworthy.

umanwizard · on May 29, 2019

Sure, but that doesn't really change my point, only the dates involved. You can debate how fair the Kasparov - Deep Blue match was, but even being maximally generous, by the mid 2000s it was completely impossible for a human to beat a computer in a serious, fair match.

mr_toad · on May 29, 2019

> it's insane that Stockfish, running on CPUs and crude handmade heuristics, is able to keep up at all.

Seems like another nail in the coffin for symbolic AI.

mindgam3 · on May 29, 2019

Only if you think skill in chess translates to actual intelligence. As a chess master, I tend to benefit from this myth, but it’s bogus.

The idea that computer chess is a stepping stone to common sense AGI is also fairly absurd to me.

Don’t get me wrong. Leela is a great project. Just see it for it is, an impressive demo showing how machine learning algs paired with massive compute has led to increased performance in a perfect-information game like chess. It’s not a “nail in the coffin” of anything other than possibly Stockfish’s reign as strongest chess player. No more, no less.

mda · on May 28, 2019

Pretty sure Alpha zero would destroy Leela.

ganeshkrishnan · on May 29, 2019

No, I don't think so. Remember that AlphaZero played stockfish 8 which is pretty outdated. Some of it's moves against SF was pretty amazing and you could be right but everything is speculative right now.

My only vote for LC0 is that it can win against the latest iteration of SF

sanxiyn · on May 29, 2019

Stockfish 8 evaluation was in 2017. In 2018, DeepMind re-did everything with Stockfish 9 and nothing changed.

https://deepmind.com/blog/alphazero-shedding-new-light-grand...

mitchellgoffpc · on May 28, 2019

Maybe so, but Deepmind isn’t training AlphaZero chess any further! Leela will hopefully just keep on getting better and better :)

mda · on May 29, 2019

Eh, If they want, they can train Alpha zero in one day Leela can be trained in 10 years. In any case indeed Leela's progress is impressive.

mot524 · on May 30, 2019

Training AlphaZero for more than 4 hours on DeepMind's system didn't improve its strength. Makes you wonder if that's as close to ideal chess as we're ever going to get...

mot524 · on May 30, 2019

I believe AlphaZero's CNN uses 40 residual layers and LC0 uses 20 layers. So in theory, AlphaZero could have twice as much chess knowledge. So that's a reason to believe that AlphaZero might be stronger than LC0. But there's no real way to know at this point.

nestorD · on May 29, 2019

I remember an annonce saying that the go branch of leela-zero had finaly reachen the level of facebook's network. It is arguably not Alpha zero but it shows that, if they have not reached the level of Alpha-zero yet, it is only a matter of time.

dragontamer · on May 28, 2019

LC0 and AlphaZero work on the following principles:

* GPU-based compute -- Deep Learning scales very well on GPUs, which are known to have superior parallelism over CPUs.

* MCTS -- "Monte Carlo Tree Search", which is a different tree-search compared to Minimax. MCTS seems closer to "human-like play", with regards to search order.

IMO, it is inelegant to use CNNs / neural nets as the evaluation layer. However, its difficult to imagine an evaluation function that would scale to GPUs. I'm curious how to use the SIMD-compute model to evaluate chess positions, but it doesn't seem like very much research exists on the subject.

GPUs have significantly higher compute power than CPUs. Chess positions can largely be represented by just a few 64-bit integers (!!). (An 8x8 bitboard is 64-bits, one bitboard per piece). So the compute-density of chess-evaluation is very very high, there just aren't any algorithms that are well suited to the GPU-architecture aside from deep learning right now.

-------

I really think MCTS is the superior search tree algorithm. The only issue is the evaluation function. Its certainly cool that Deep Learning / CNNs are being used as a "universal" GPU-function (self-learn / self-train so that a large CNN can "deeply" use a ton of GPU resources to make up its own evaluation function). But as far as I'm concerned, its a crutch.

----------

EDIT: If I had oceans of free time... I'd love to see what "translating" Stockfish's algorithms (ex: backwards pawn check. Bishop moves check. Paired Bishop advantage check. Etc. etc.) to a GPU would be like.

A GPU probably could evaluate a million boards for "backwards pawns". Then evaluate the million boards for "paired bishops". Then evaluate the million boards for "knight outpost". Etc. etc. In effect: calculate the millions of positions in parallel, by lining them up and batching them together.

Stockfish's parallelism is CPU-based, with lots of if-statements. Batching these checks into individual units would remove the if-statements and potentially make it amendable to SIMD-compute.

The tree-evaluation would try to evaluate ~1-million boards at a time, to keep the GPU busy. Maybe ~1-million boards per CPU-thread (somehow self-tuning the evaluation process for an optimal mix of CPU + GPU time used).

Again: if I had mountains of free time, lol.

stuartriffle · on May 29, 2019

I’m working on a brute force MCTS engine right now. It scales across SIMD (to 8x) and GPU (only CUDA so far).

https://github.com/StuartRiffle/Jaglavak

My GTX 780 can play about 1M games to completion per sec. Still under construction, but any advice is appreciated!

dragontamer · on May 29, 2019

I finished groking the CUDA stuff, looks good in general!

For those who want an overview: stuartriffle follows the MCTS algorithm, which has two parts. The MCTS is the general C++ code "TreeSearch" (which I haven't grok'd yet. The textbook UCT evaluation function is there, but there's some async stuff going on that I don't understand yet). For the evaluation function / heuristic, its a classic random-rollout code, which randomly plays Chess until it discovers a win, loss, or "draw" (which seems to be a "max moves == 0" case).

"Random Rollout" is the most naive implementation of MCTS, but its still an important demonstration showing that GPUs can indeed play chess within its shaders. The engine as a whole seems free of any obvious GPU-divergence issues, so I'd expect it to be reasonably efficient at random chess playing.

So chess engines CAN be ported to a GPU. From here on out, its a matter of experimenting with various methodologies to see what creates the most powerful chess player.

--------

I'm going to have to go through the MCTS part later. Its clearly some kind of parallel implementation that batches up nodes for MCTS evaluation.

The GPU probably can run some of the MCTS stuff faster than the CPU (ex: the UCT evaluation probably should be on the GPU: which could be done relatively easily by passing play-statistics to the GPU kernel. GPUs are very fast at square-roots and stuff, and you're already spinning up a GPU thread for the whole random-rollout thing, might as well evaluate the UCT value while you're still in GPU land. I think anyway)

But yeah, I haven't fully understood how you're doing the MCTS stuff yet. Its definitely interesting code though.

----------

EDIT: I'd personally want to figure out how to get the MCTS search into the GPU somehow. But doing the rollouts on the GPU is the "obvious" first step (at a minimum, you need to have a chess engine in the GPU before any further steps are possible), and its cool that you found a practical use of the random-play engine with the classic MCTS algorithm.

symplee · on May 29, 2019

Have you graphed elo to number of roll-outs? Where do the diminishing returns kick in?

How many elo points do you guess you'd gain if you improved to 10M games per sec? 10B?

ganeshkrishnan · on May 29, 2019

Would you be able to update the readme to add build instructions, how to run it etc?

HeWhoLurksLate · on May 29, 2019

Man, if I had enough money, I would totally ship you a fancier card (and new build, if necessary). That's really cool!

currymj · on May 28, 2019

CNN for Go at least seems justifiable. The use of a convolution captures the fact that a configuration of stones is mostly the same configuration of stones whether or not you shift it around the board a little bit.

This is much less true for chess, although it might still be true enough.

dragontamer · on May 28, 2019

There are a lot of configurations of pawns or knights that are highly desirable.

A "knight outpost" (a knight guarded by a a pawn) is a "local" board position that is well considered to be strong. Same with a bishop outpost. "Paired knights" (knights which can protect each other) can only rarely be taken by a rook or queen. Pawn-chains are good, while "backwards pawns" and "doubled pawns" are bad. "Rook behind pawn" is good, "Rook in front of pawn" is bad. King Opposition is good (https://en.wikipedia.org/wiki/Opposition_(chess))

Go is more uniform for sure. But there are plenty of opportunities for CNNs to work out in chess.

bonzini · on May 29, 2019

> "Paired knights" (knights which can protect each other) can only rarely be taken by a rook or queen.

Not entirely true for rooks, since taking 2 knights for a rook is advantageous. Paired knights are usually considered weak.

barrkel · on May 29, 2019

IMO it makes most sense to use NN for the evaluation layer. Human heuristics are unlikely to be as well tuned as a NN, and may encode unknown biases. A NN can cover the search space more dispassionately and discover relationships that don't occur to humans, and with a delicate relative balance of tradeoffs.

The evaluation of a position is to guess if it is more likely to lead to a final game outcome. The other way to determine this is to look further ahead, deeper into the search tree. The latter has significantly more information content than a heuristic. IMO it would almost certainly be a poor tradeoff of compute; the heuristic needs to be run over each possible move at each level of the search tree, so halving the cost of your heuristic is like doubling the amount of compute you use for tree evaluation. And you're talking about running heuristics over millions of boards. That's a lot of compute to leave on the table.

And I think NN are very good for boiling down large computations into a few serial matrix operations, efficiently packing in dependencies and associations. I don't think a rule-based heuristic could compete, algorithmically.

Don't forget that one of the great mysteries of human intelligence is that it works as well as it does while using neurons that operate incredibly slowly by comparison to digital computation. The can do it because NN bake in cached knowledge that permits good decisions to be made on very little data. Exactly what you want for evaluation.

nwallin · on May 28, 2019

> I'm curious how to use the SIMD-compute model to evaluate chess positions, but it doesn't seem like very much research exists on the subject.

I disagree very strongly with this statement. Evaluation heuristics have been studied extensively, but the overwhelming consensus is that the simpler the heuristic, the better. If you make your heuristic worse, but fast enough to give you +1 on your search depth, you will almost always be better. So our evaluation heuristics are almost trivial because thirty years of research has demonstrated that the trivial heuristics are the best ones.

SIMD doesn't help us for a few reasons. First, because bitboards are extremely sparse. You'll have at most 8 pawns in your pawn bitboards, 1 king in your king bitboards, and only under exceedingly rare circumstances will you have more than 2 pieces in your other 8 bitboards. Under no circumstances will you have more than 32 1s in your twelve bitboards, a density of one 1 per 24 zeroes.

Second, you have a lot of loops with depths determined at runtime. You won't know how many squares your bishop can move until you're evaluating the position. SIMD isn't very good at that.

> I really think MCTS is the superior search tree algorithm. The only issue is the evaluation function.

The issue is move ordering. In minimax + alpha-beta pruning + PVS + hash table move cache, you don't need a near-perfect heuristic to tell you which order to evaluate moves. If there are 20 legal moves, it's fine if the best move is the 4th-8th move you search, although alpha-beta is faster if you search the best move first, which is why PVS+hash table is such a strong improvement.

MCTS requires a good heuristic. If you have 20 legal moves, MCTS will evaluate maybe 2 of them, and if they're the 4-8th best moves, your AI will always be terrible, always. PVS+hash tables can't help you, because the only thing they can tell you was what move was good in the past.

There isn't a good way to sit down and get a human to write a good heuristic. NNs are really good at it though. IMHO, if you want MCTS, you either need NNs or something like it. There's no way to jury-rig classical evaluation functions into it. (by "classical evaluation functions" I mean everything from Deep Blue to Stockfish) Replacing NNs in this context will likely mean replacing NNs in every other context. This would be a good thing, a great thing even, but it would require a new paradigm in the AI/ML world, not just something unique to chess.

I actually agree with you about the elegance of MCTS and bitboards, and the inelegance of NNs and minimax. (and all of minimax's accompanying optimizations) But there's not really a good way forward.

dragontamer · on May 28, 2019

> SIMD doesn't help us for a few reasons. First, because bitboards are extremely sparse. You'll have at most 8 pawns in your pawn bitboards, 1 king in your king bitboards, and only under exceedingly rare circumstances will you have more than 2 pieces in your other 8 bitboards. Under no circumstances will you have more than 32 1s in your twelve bitboards, a density of one 1 per 24 zeroes.

You're thinking too traditionalist :-)

You're 100% correct in that the above is a "traditional" application of SIMD. However, I'm talking about a "non-traditional" SIMD approach. (or perhaps... "more traditional", closer to the original 1980s style SIMD that GPU programmers use in their pixel shaders)

NVidia calls it SIMT, but the concept existed since the dawn of the SIMD methodology.

I don't want to repeat what others have done. Give the first 2 pages of this paper a lookthrough for this "nontraditional" SIMD approach: http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/1...

My short summary is... the AMD Vega64 is effectively a 16384-wide SIMD unit. (or perhaps... 64 parallel CU clusters of 256-wide SIMD units).

What you're suggesting (which won't work) is to use the 16384 SIMD units to evaluate one bitboard. Of course this won't work, even though its the "traditional" approach to SIMD. Even breaking things down "per CU" so that each compute-unit's 256x shaders is "too much parallelism" to work one bitboard.

Instead, I suggest using the 16384 SIMD Units of Vega64 to evaluate 16384 different bitboards (!!). I'm very confident that Vega64's SIMD units are advanced enough to handle this kind of task... although the algorithmic details haven't really been figured out by anyone yet (well... that I'm aware of anyway).

---------

Every SIMD Shader of Vega64 has 256 (!!) 32-bit registers which can be used once per clock. My estimate is that it takes 25-registers (12x 64-bit boards for the white+black occupancy, + 1x32 register for misc purposes like enpassant, castling, etc. etc.). That leaves 231 registers left over for other calculations. (Occupancy 1). With luck, maybe Occupancy 4 or higher could be reached (limit of only 64x registers used), which would make RAM access more efficient.

The question I have is: how do you coordinate these 16,386 simultaneous threads of execution? How does this turn into an efficient tree search?

> Evaluation heuristics have been studied extensively, but the overwhelming consensus is that the simpler the heuristic, the better. If you make your heuristic worse, but fast enough to give you +1 on your search depth, you will almost always be better. So our evaluation heuristics are almost trivial because thirty years of research has demonstrated that the trivial heuristics are the best ones.

Exactly! Which is why I'm confident that these simple heuristics can be evaluated on each SIMD-shader of a modern GPU. 256x 32-bit registers leaves a LOT of room for these SIMD "shaders" to work with!

There's a lot of unknowns that I'd have to work with if I were to code something up in OpenCL (or more likely, ROCm). But a lot of the fundamental research and heuristics have already been done. Just no one has even tried writing them for a GPU yet.

The biggest issue is how to represent a search tree in a way that would be efficiently represented in a GPU. Neither MCTS nor Minimax have really been written in a GPU, although the concept of "iterative deepening", and other such simple searches HAVE been executed on a GPU-per shader basis.

I haven't really thought how the full details would work. But Vega64 has 4096 shaders operating at 1.5 GHz. If each shader spent 1000-clock cycles per bit-board evaluation, we're looking at 6.1 Billion nodes per second on a single GPU.

> hash table move cache

That wouldn't scale unfortunately: there's no way for 16386 different SIMD threads to access one shared cache. Maybe a shared cache would be split up to the 64-individual compute-units of the Vega64, perhaps in the small 64kB "shared memory region" (which can be accessed within 32-clock ticks worst-case of Vega64... 1-clock tick in ideal situations).

"Shared memory" can be shared with up to 1024 SIMD Threads. A 64kB hash table is extremely small however, giving you the idea of how little memory is available per SIMD-thread.

The idea of a "globally shared hash table" will have to be sacrificed in porting the code to a GPU effectively.

Or... at best... it would have to be shared on a per-1024 thread workgroup basis. There's no way 16386 threads banging on RAM (even a 512GBps RAM like HBM2) would scale. Heck, 1024-threads banging the 10,000GBps (aggregate) 64kB shared-memory region is probably still not going to scale very well... but that 64kB region is the only thing that is anywhere fast enough to do something like the traditional hash table move cache.

This 64kB "shared memory" is the GPU's greatest weapon to scale for parallelism, but there's so many chess algorithms that want to use it. The Rook-move "Sliding Piece Attacks" table with "Magic Bitboards" takes up 64kB already. (16kB for Bishop moves).

So figuring out which things will fit inside of that memory region is going to be a big problem for whoever writes the first GPU-based chess AI.

--------

> MCTS requires a good heuristic.

EDIT: Not really! The original MCTS go AIs searched randomly. "Random rollouts" was their heuristic.

"Random Rollouts" is how MCTS got its start. It was surprising how good MCTS performed with random results (and eventually, modern MCTS algorithms have CNN-based heuristics... but the original MCTS bots prove that "bad heuristics" is still good enough for MCTS).

781 · on May 28, 2019

If I understand correctly, you propose using GPUs so that traditional chess engines can search even more nodes per second.

Leela evaluates three orders of magnitude less nodes per second than Stockfish (70K vs 70M), yet it still won. You need to search smarter, not deeper, as this result shows.

dragontamer · on May 28, 2019

> You need to search smarter, not deeper, as this result shows.

This result pits a GPU with 10,000 GOPS worth of compute and 500GBps of memory bandwidth against a CPU with only 100 GOPS of compute and 50GBps memory bandwidth.

All I'm saying is: how do we test a "fair" comparison? Well... lets (somehow) rewrite the CPU algorithm to run on the GPU. Its an unsolved problem, maybe its completely unsolvable. But its at least a research-question worth pondering.

---------

The TL;DR is: I think its possible to port Stockfish's evaluation functions over to a GPU. However, there are numerous unsolved research problems that need to be solved if this were to happen. The hash-table needs to be reworked, the search tree needs to be reworked, "Lazy SMP" won't work on a GPU, etc. etc. Most of the "tree search" methods need to be completely rewritten from scratch to be GPU-based.

But the fundamental evaluation functions of Stockfish are brutally simple and look like they could work on a GPU shader to me.

781 · on May 28, 2019

Stockfish tests improvements by also playing random games:

> Changes to game-playing code are accepted or rejected based on results of playing of tens of thousands of games on the framework against an older "reference" version of the program

https://en.wikipedia.org/wiki/Stockfish_(chess)#Fishtest

Think about it, a human makes a small tweak, then they run a lot of games with it and decide if it's worth keeping or not.

That's basically a terrible implementation of a neural network's gradient descent.

Now imagine instead of a human coming up with a small tweak, you randomly search for it and take a holistic view of the board instead of micro-heuristics. Oh, you just invented AlphaZero :)

BTW, since chess is highly parallel, you can run Stockfish on a computer cluster. Using this approach you can equalize the GPU power and have a balanced computer power match (of course, it will be totally unbalanced power-consumption wise). I'm not convinced Stockfish will win with equal computing power.

dragontamer · on May 28, 2019

> Now imagine instead of a human coming up with a small tweak, you randomly search for it and take a holistic view of the board instead of micro-heuristics. Oh, you just invented AlphaZero :)

I'm fairly certain that the human brain does not work off of FP16 floating point numbers with back-propagated errors being calculated with differential equations to set the weights of our individual neurons. :-)

Artificial neural networks are fascinating self-learning machines. But remember: they're artificial. There's nothing "human" about LeelaZero, AlphaGo, or any other CNN. Especially because AlphaGo / LeelaZero are augmented with an exceptionally powerful MCTS search functionality (no human counts the number of positions they visit and "balances" each node... nobody does that. MCTS is an extremely powerful computer algorithm for search, also an artificial construct)

> I'm not convinced Stockfish will win with equal computing power.

Ehh? The results are 10-Leela / 8-Stockfish / 82 draws. It was an exceptionally close set of 100 games.

Note that Stockfish works with a global hash-table of chess positions it shares between threads. This methodology works with a "low" number of threads (ie: 16 threads, maybe even 64 threads). But it absolutely will not work at GPU-scale (~16,386+ SIMD threads on Vega64, or similar GPUs).

It is not going to be an easy job to "port" Stockfish properly to a GPU-based system, or even to a cluster of 100x racked up computers. How do you efficiently share a global hash table across 100x clustered computers?

I mean, you simply cannot. Stockfish simply isn't designed to scale that high. Stockfish is innately a single-node design that's constrained by the RAM.

---------

The fact of the matter is: modern systems need a higher-form of scaling. I think LeelaZero / AlphaZero are "cheating", in that they've found methodologies that allow the huge amount of GPU easily be used.

I think this is a wakeup call: that algorithms need to start looking at the GPU more carefully. Heavy compute definitely needs to start thinking about how to scale to 16,000+ threads and work on GPU-like systems.

781 · on May 29, 2019

You keep on missing the point.

Today's Stockfish running on a low end computer would still beat Stockfish from many years ago running on a much more powerful one, because it's evaluation function is significantly better.

Chess is not strictly about computing power, and neural-network evaluation functions are vastly better.

dang · on May 29, 2019

Would you please edit the uncivil bits out of your comments here? Lines like "You keep on missing the point" and "You know very well" just add acid to the mix and are against the site guidelines.

https://news.ycombinator.com/newsguidelines.html

dragontamer · on May 29, 2019

> You keep on missing the point.

I think you misunderstand. I absolutely understand your point. I'm explicitly rejecting your point. There's a crucial difference.

The evidence laid out in this test does not necessarily lead to the conclusion you have here. There are a number of confounding factors, that if I had more time... I'd like to investigate.

True, your point is ONE possible story for what is going on here. But alas, my instincts suggest something else is going on. I think CNNs have simply been extremely well optimized for the GPU platform, and that indeed, they are one of the few algorithms that run extremely well on a GPU.

I'm curious how a well-put together "classical" chess AI would work if it were ported to a GPU. I understand that no such chess AI has ever been written, but that doesn't change my curiosity.

-------

EDIT:

> Chess is not strictly about computing power, and neural-network evaluation functions are vastly better.

I just thought of a way that would test this assertion. Instead of porting Stockfish to a GPU, port LeelaZero to a CPU. Run the Neural Net on the same set of hardware, and see who wins.

That way, Stockfish keeps its (cannot be scaled) centralized hash table / lazy SMP algorithm, while LeelaZero runs at the same compute-power that Stockfish has.

fruit1234 · on May 29, 2019

It was 14 wins for leela and 7 wins for sf, which is pretty large for the level they are playing at. Anyway people have thought about using GPU for chess engine but it was difficult to make work. (https://chess.stackexchange.com/questions/9772/cpu-v-gpu-for...). GPU and CPU have fundamentally different architecture and comparing them using just ops per second without taking into account their capabilities is missing the point.

781 · on May 29, 2019

You know very well that LeelaZero was designed to run on a GPU, just like Stockfish was designed to run on a CPU.

You can also do the reverse, it's easy to naively make Stockfish run on a GPU, it will just be performing terribly bad since it's algos will not utilize the GPU properly.

dragontamer · on May 29, 2019

I mean, that's what needs to be done, for the test to work.

Either Stockfish's style of algorithm needs to be ported to a GPU (and I'm arguing its possible to do so efficiently. But a number of unsolved problems do have to be solved).

Or... Leela Zero needs to be ported to a CPU.

I'm not saying a naive port: I mean a port where the programmer spends a good bit of effort optimizing the implementation. That way its fair. Those are the two hypotheticals that can happen for a "fair" test.

I'm, personally, more interested in the case of Stockfish -> somehow ported to GPU, mostly because its never been done before. I mean, I don't want to do it, but if anyone ever did it, I'd be very interested in reading how they solved all of the issues. :-)

mot524 · on May 30, 2019

Implementing Stockfish's evaluation function on a GPU is almost certainly possible, but stupid. Stockfish is able to evaluate several million nodes per second. By the time it identified a position to evaluate and copied it to the GPU, it could have already finished evaluating it on the CPU.

But a GPU can evaluate many positions in parallel... great... not helpful here because you don't know which positions you need to evaluate.

This is a case of apples and oranges. CPUs are good at running conventional chess engines and GPUs aren't, that's just a fact and there's no way around it. GPUs are good at running some stuff like neural networks, so, great.

If you need to run a neural network engine (LC0) then run it on a GPU, if you need to run a conventional engine, then run it on a CPU. Which is what people are doing now. Trying to change this is nonsensical, and trying to argue about which has more compute power makes as much sense as arguing about whether a car can drive on water better than a boat can drive on land.

dmurray · on May 29, 2019

I'm hoping you're right and can show this in a couple of years as effectively a third way to build a world class chess engine.

I don't understand how anyone can tell you "AlphaZero beat Stockfish with a method that involves evaluating fewer positions, so any new engine should also evaluate fewer positions." Chess engine research is not finished: there are a lot of things still to try.

mot524 · on May 30, 2019

The problem is that tree searching is largely serial, in the sense of, you don't know the next position you want to search until you finish searching the previous position. So you can't just send a million positions to a GPU to evaluate in parallel because you have no idea which positions you'll want to evaluate.

So GPUs are pretty worthless for computer chess unless your algorithm evaluates positions with a deep CNN (which do run well on GPUs). It's not a question of how much time people have spent to investigate running chess algorithms on GPUs, it's a question of what kind of algorithms are better suited to CPUs vs. GPUs.

dragontamer · on May 31, 2019

Young Brothers Wait Concept: https://www.chessprogramming.org/Young_Brothers_Wait_Concept

Parallel searching is possible, but it just hasn't really been done on GPU hardware. Stockfish itself parallelizes to multiple cores (although its only really designed for less than 20 from my understanding).

-------

> The problem is that tree searching is largely serial, in the sense of, you don't know the next position you want to search until you finish searching the previous position.

That's not the issue. Searching the game tree in parallel is stupid easy. But this creates a lot of waste.

The chess minimax algorithm with alpha-beta pruning requires you to serially process nodes to know which nodes to "skip". There in lies the issue: of the three types of nodes (PV, CUT, and ALL nodes), CUT nodes have a best-case of two evaluations per CUT-node. While PV and ALL nodes have to have all of their children evaluated.

Roughly 1/2 of all nodes are "cut" nodes however, which means in the best case, you can double the search depth with the alpha-beta pruning algorithm.

As far as I can tell, the Young Brothers Wait Concept extends the minimax alpha-beta pruning algorithm to multiple cores, but it isn't as efficient as a fully serial processor.

Still, that's probably where I'd start if I were to seriously try to put this on a GPU.

-------

No matter how you look at it: a game tree grows exponentially. It only makes sense that exponentially growing trees would allow exponentially more work to be processed in parallel.

mot524 · on May 31, 2019

Yes, parallel search is possible, but inefficient. Searching with e.g. 32 cores already cuts efficiency in half, in most useful scenarios (i.e., where the best move changes), and it gets worse from there.

But regardless, it seems you have a fundamental misunderstanding of how GPUs work vs. CPUs. GPUs "cores" are designed to run the same instructions on huge amounts of data simultaneously. So they're great for e.g. Photoshop filters where you have to do the same operations to each pixel of an image, or neural networks where you have to do the same operations on each value of a matrix. What they can't do is have each core running its own sequence of branchy, serial code on its own data, which is what chess engines do.

There's a reason why CPUs and GPUs are different products, and Intel can charge $1500 for a CPU with 16 cores whereas you can get an Nvidia graphics card with 768 cores for $150. It's because these cores don't do the same things.

You're right that game trees grow exponentially and that's actually the reason why GPUs are worthless. You can get a GPU to execute every possible sequence of chess moves from a given position--people have done that, and it's phenomenally fast. But that has a branching factor of ~35, meaning that for a search of depth 8, you have to visit 35^8 positions = 2251 trillion. If you're doing a serial search with modern optimizations, the branching factor is closer to 5, which means you have to search ~390 thousand positions for a depth 8 search. So for this relatively shallow search, a serial algorithm is almost 6 million times more efficient than a hypothetical GPU algorithm. (And that discrepancy gets worse the deeper you search.) There isn't a GPU with 6 million times more cores than a CPU...

dragontamer · on May 31, 2019

> What they can't do is have each core running its own sequence of branchy, serial code on its own data, which is what chess engines do.

You're correct on one front, but mistake in an important case. GPUs only have 32x SIMD (on NVidia) or 64x SIMD (on AMD). Which means we only have to figure out how to scale the algorithm to ~32x SIMD to solve the thread divergence issue.

Each 32x (NVidia) or 64x (AMD) workgroup can be branching and looping on its own without any slowdowns.

I think you're underestimating the flexibility of GPGPU branching. Actually, I think everybody is. Its not an issue of solving the GPU divergence issue in general, its about solving thread-divergence at size 32 (NVidia) or size 64 (AMD).

As long as a set of 32-threads (NVidia) or 64-threads (AMD) take the same branches, then your utilization will remain high.

> If you're doing a serial search with modern optimizations, the branching factor is closer to 5, which means you have to search ~390 thousand positions for a depth 8 search. So for this relatively shallow search, a serial algorithm is almost 6 million times more efficient than a hypothetical GPU algorithm. (And that discrepancy gets worse the deeper you search.) There isn't a GPU with 6 million times more cores than a CPU...

I agree that a parallel search will always be less efficient than a serial search. There are more tricks to process in serial than in parallel.

But the "big gain" would be a work-efficient parallel algorithm for alpha-beta pruning. To some degree, it seems possible. Think of all the parallelism that can be fit in.

1. PV Nodes (Knuth Type 1) can be searched in parallel without any slowdown. Alpha-beta pruning requires all PV Nodes to be searched, so right here there's a major source of parallelism.

2. CUT Nodes (Knuth Type 2) need to be done in serial. However, there are millions of cut-nodes generated across an alpha-beta search. In theory, a GPU can batch up all of these CUT Nodes, and process them in parallel without slowdowns.

3. ALL Nodes (Knuth type 3) are also processed in parallel.

So serious question: has it been proven that a GPU cannot process Alpha-beta pruning in parallel? Or is it just that people haven't done it yet?

Not all tricks can be parallelized. But from the perspective of minimax + alpha-beta pruning, it seems possible that an algorithm (probably a very complex one) can be used to process all of the Type 1 / Type 3 nodes in parallel, while batching Type 2 nodes together and processing (different) Type 2 nodes in parallel.

Thread divergence would be O(3), since there are only three cases. All three cases have the same bulk code: they generate the potential moves of the node... the only question is if they spawn a new worker (in the case of ALL Nodes / PV nodes), or if they wait for Alpha/Beta values (in the case of CUT nodes)

Granted, I don't know how to implement a "future" in GPGPU land. But the general concept is pretty easy to see the parallelism if you imagine Alpha-Beta values as C++ futures and/or promises.

-------------------

A hypothetical GPU algorithm would be to process PV Nodes and ALL Nodes somehow in parallel. CUT Nodes can be evaluated in parallel, but must wait for the results of previous nodes to know if they are pruned or not. Still, the first result of any CUT Node must be processed, which leads to potential work that can be processed in parallel.

mot524 · on May 31, 2019

I think you misunderstand how "processing a node" works. It's a recursive algorithm. Even if you correctly identify an ALL node (which is not always easy) and say "hey process all these moves at the same time" then each move will result in a radically divergent search path that a GPU won't be able to handle.

Basically nothing in a parallel chess engine happens in lockstep. All the threads are always doing completely different things. One thread may be evaluating a position. One thread may be making a move. One thread may be taking back a move. One thread may be generating all pseudolegal moves. One thread may be generating captures. No thread is ever doing the same thing as another thread so there's no parallelism of the sort that can be exploited by a GPU.

You're likely looking at simplified diagrams of a search tree and seeing a circle that has lines that connect it to a layer of more circles and thinking "hey just do all those circles in parallel" but the way chess engines work is 100% different from that.

You're not the first person to have the idea to do alpha-beta search with a GPU. There have been many efforts, and several for chess specifically. They've all been kind of awkward and horrible and it's debatable whether or not you can even really call them alpha-beta. There are hundreds of people who make chess engines and it's not like they're all incompetent and never thought of using the GPU.

dragontamer · on May 31, 2019

> I think you misunderstand how "processing a node" works. It's a recursive algorithm. Even if you correctly identify an ALL node (which is not always easy) and say "hey process all these moves at the same time" then each move will result in a radically divergent search path that a GPU won't be able to handle.

I don't think there's as much divergence as you think there is. Someone down-thread already created a chess move-generator and evaluator: https://news.ycombinator.com/item?id=20035835

Its not the "chess" part that results in divergent thread execution, its maybe the "search part", at least with current algorithms.

Pawn, king, and knight movements won't have any thread divergence: these are pure bitboard manipulations without if-statements. Only sliding piece attacks (Bishop, queen, rook) seem complicated enough to cause divergence... but its mostly about iteration. One bitboard with 6 "bishops" (2x white bishop, 2x black bishop, 1x white queen, 1x black queen) will run diagonal sliding piece move generation 6x. But a bitboard with only 3 "bishops" will only run it 3 times. But I don't expect this kind of thread divergence to cause major issues, and most boards in a SIMD thread probably will have similar numbers of pieces.

> You're likely looking at simplified diagrams of a search tree and seeing a circle that has lines that connect it to a layer of more circles and thinking "hey just do all those circles in parallel" but the way chess engines work is 100% different from that.

Yes, that was my initial idea. But... the more I look into it, the more it seems possible. I'm just a little bit beyond that point, although its all just theory in my mind.

----------

Lets look at this implementation of alpha-beta: https://www.chessprogramming.org/Alpha-Beta#Negamax_Framewor...

Lets change it to the following:

    int alphaBeta( future<int> alpha, future<int> beta, int depthleft ) {
       if( depthleft == 0 ) return quiesce( alpha, beta );
       for ( all moves)  {
          score = Spawn(-alphaBeta( -beta, -alpha, depthleft - 1 )); // <--- Issue #1
          if( score >= beta ) // <---- Issue #2
             return beta;   //  fail hard beta-cutoff
          if( score > alpha ) // <---- Issue #3
             alpha = score; // alpha acts like max in MiniMax
       }
       return alpha;
    }

Issue #0: Futures/Promises -- No one has implemented futures / promises on GPGPUs yet. How can this be done efficiently on the GPU? Can the SIMD architecture be leveraged? Atomics could work, but they probably won't scale... I think something "innate" to the GPGPU architecture has to be done.

Issue #1: "Spawning" threads is pretty simple (but unconventional), even on a GPGPU. Create tasks that can be load-balanced on the SIMD GPU through the use of SIMD-Stacks and SIMD-queues to load-balance. Basically the same thing that GPGPU raytracing engines do (rays don't always "bounce", most rays miss all targets. Figuring out which rays hit a target and need to bounce... vs rays that miss and need to be terminated... is a solved problem for GPGPU programmers. Its basically task spawning). "Negation" (the -alpha, and -beta) shows that the future<int> needs to support negation, so that we don't block on that line. This line should execute in parallel.

Issue #2: score >= beta is a "hard synchronization" point, the only point where future::wait() needs to be called. However, most checks will be against a beta of -infinity (PV and ALL nodes would have a -infinity value). So this step could be skipped in a huge number of cases. In all other cases, a "blocked" thread will have to wait for its tree of future<ints> "score" value to be all finished executing.

Issue #3: This is effectively score = max(alpha, score). Across multiple iterations, this is max(alpha, score0, score1, score2, score3), etc. etc. Alpha starts as negative infinity: meaning a large number of nodes will always spawn.

The PV-nodes, CUT nodes, and ALL nodes don't need to be analyzed. You can see that this "task-based" definition spawns parallel-threads on PV-nodes and ALL nodes, while blocking on CUT nodes. This clearly becomes a (parallel SIMD) depth-first search.

---------

Of these issues, #0 (creating a future / promise) on the GPGPU is probably the hardest issue to solve. It is clear to me that this "future" needs symbolic execution (able to handle max(future, future), as well as negation -future). A difficult programming challenge, but not really insurmountable. It should be noted that if depth-first search is prioritized (if we can somehow guarantee that all "searchers" are on the left-most edge of all nodes), futures will take up a constant space roughly proportional to O(depth + size-of-parallel-cluster). (well, it seems clear to me anyway...)

The overall GPGPU task scheduling algorithm has been done in GPGPU Raytracers, even if people don't really realize it yet. http://www.cse.chalmers.se/~uffe/streamcompaction.pdf

GPGPU Raytracers have their rays either hit a target, or miss a target. If they hit a target, they spawn a new ray (!!), if they miss, then they die. Those silly GPGPU Raytracer programmers have solved a very, very difficult issue but haven't been telling everyone about it. Lol. This clearly can serve as the basis of a task-scheudling / load-balancing core on GPGPUs.

-------

EDIT: Don't get me wrong. I recognize that I'm basically proposing that a GPU recursive task scheduler / mini-SIMD operating system with futures needs to be built to accomplish the job. Yeah, that seems difficult, but it doesn't seem impossible to me.

mot524 · on May 31, 2019

I know exactly how much divergence there is because I'm the author of a chess engine and I've spent weeks/months looking at traces of my engine's tree searches.

I think what you're not getting is what a chess engine search tree looks like. You're thinking it has nice layers of nodes, but really, most of the time, the engine is in the quiescence search, searching sequences of captures. The branching factor is extremely low and each branch terminates at an unpredictable depth. Instead of nice layers of nodes, imagine a lightning bolt that's constantly changing. You can't predict what you're going to have to do next.

GPUs would excel if you have to do the same thing to different boards all at the same time. But that's never what happens. Imagine you're in the q-search and you're at a node with two captures. Might as well search both at the same time? Okay, a GPU can make two moves at the same time and evaluate the resulting positions simultaneously no problem. Maybe one evaluation causes the branch to terminate and then that thread needs to undo the move, whereas the other branch needs to keep going and generate and search subsequent moves. This sort of divergence is the norm, not the exception, and it's impossible to handle in any sort of efficient way with a GPU.

It's great that people have made GPU software to visit all possible positions to depth N, and it's great that somebody is making a MCTS engine run on a GPU. Both are tricky to do, and significant accomplishments if they work correctly. But in terms of making an effective engine these are just the absolute first initial baby steps and they don't signify anything about the viability of the rest of the project.

dragontamer · on June 1, 2019

I appreciate the discussion.

> most of the time, the engine is in the quiescence search

That seems to be the best case for GPUs. Whenever a GPU SIMD thread finishes a quiescence search, just grab another position that is under quiescence search.

Frankly, the quiescence search part seems most ideally suited to GPUs out of all, as long as it is appropriate batched.

> Imagine you're in the q-search and you're at a node with two captures. Might as well search both at the same time.

Put them onto the q-search queue, which will be checked as GPU threads go idle.

The q-search problem is exactly identical to raytracers. GPUs don't know if a ray will bounce zero, once, or even 10 times. But that doesn't cause divergence because it's just about sticking rays onto a queue and evaluating them

Rays in raytracers can even spawn multiple rays in a bounce, such as subsurface scattering. This pattern is solved!

mot524 · on June 1, 2019

Like I've been saying, you don't know what you want to q-search next until you've finished the last q-search. So you can't make a q-search queue because you couldn't add anything to the queue until the previous addition was evaluated. I feel like I've said this a few times but for some reason you're not getting it, not really sure what the hangup is?

Alpha-beta in general is so sequential that basically all the algorithms to parallelize it focus on making each thread as independent and sequential as possible. For example, the PV-split algorithm (which generalizes into Young Brothers Wait) splits the tree as low as possible (eventually to the root node) so that all threads can act as independently as possible. There is no local parallelism to be exploited because every branch is radically different from the last branch, and dependent on the result of the last branch. Local parallelism is necessary for a GPU to be effective and conventional game tree search algorithms have none.

dragontamer · on June 1, 2019

> I feel like I've said this a few times but for some reason you're not getting it, not really sure what the hangup is?

I think the hangup is that you aren't understanding what a C++ future is, and why it solves this problem. Remember, I'm operating under the assumption that the GPU-programmer has figured out how to write a C++ future on the GPU.

I kinda have an overall idea of how one would be written, but I recognize that no one has done it yet. But I don't see any reason why a C++ Future couldn't be done. The real question I have is if C++ Futures can be implemented efficiently to make all of this worthwhile. (I've gone through a bunch of ideas in my brain... I think I've got one idea that is "efficient enough"... but many obvious implementations... like atomics or mutexes won't work well on a GPU. An implementation that is innately "task based cooprerative switching" is the only efficient implementation I can think of)

> Like I've been saying, you don't know what you want to q-search next until you've finished the last q-search. So you can't make a q-search queue because you couldn't add anything to the queue until the previous addition was evaluated.

Future<int> f = spawn(new q-search);

f.wait(); <---- Really simple actually. If "f" is blocked, then the current task goes into the "blocked" list. The first task from the "unblocked list" gets pulled off. If the "unblocked list" is empty, then this SIMD-thread idles until someone else in the NVidia warp (or AMD workgroup) calls f.wait() for another opportunity.

Yes, there is some divergence here, but the overall process of f.wait() is efficient enough that I don't think its a big penalty to have the warp stall.

You only run Q-searches that are on the "Running" list. You only need 32-running Q-searches to keep an NVidia warp busy, or 64-running Q-searches to keep an AMD workgroup busy.

Surely, there are at least 32 Q-searches that are unblocked at a time in a parallel chess search? Sure, not at ply 1, but the number of Q-searches that could be performed in parallel grows exponentially per ply, and also remember... an NVidia Warp will operate at 100% utilization with as little as 32-Q searches.

An AMD Workgroup is less efficient, needing 64 Q-searches before operating at 100% utilization. But this is still a small number in the scope of the exponentially increasing chess game tree.

dragontamer · on June 1, 2019

> Alpha-beta in general is so sequential that basically all the algorithms to parallelize it focus on making each thread as independent and sequential as possible. For example, the PV-split algorithm (which generalizes into Young Brothers Wait) splits the tree as low as possible (eventually to the root node) so that all threads can act as independently as possible. There is no local parallelism to be exploited because every branch is radically different from the last branch, and dependent on the result of the last branch. Local parallelism is necessary for a GPU to be effective and conventional game tree search algorithms have none.

I forgot to respond to this half.

Have you heard of Task-based parallelism? Pretty much any recursive algorithm can easily become parallel through the technique.

Alpha-beta tree searches very easily follow the task-based parallelism. Give this page a look: https://software.intel.com/en-us/node/506102

-------

The key element that ties it all together is the future / promise construct. https://en.wikipedia.org/wiki/Futures_and_promises

If you have a task-based parallelism model, with symbolically executed futures/promises (that... instead of waiting immediately on max(alpha, score)... where "alpha" and "core" are both futures... it can immediately return a symbol representing "max(alpha, score)" and carry on with the execution / evaluation. Lazily evaluating the future at a later time when alpha and/or score is ready.

Yeah, its a few advanced threading concepts here and there. But I've seen the primitives written in various frameworks. And from my understanding of GPU-programming, C++ Futures / Promises CAN be implemented in GPUs. Its just that no one has really done it yet.

From those perspectives, I keep looking at something as "sequential" as alpha-beta pruning and I don't see any major thread-divergence issues. Honest. I just don't think anyone has combined these 4 or 5 techniques together in one program yet.

In short:

* GPGPU programming / SIMD

* Task-based parallelism (like TBB): https://software.intel.com/en-us/node/506102

* Futures

* Symbolically-lazily executed symbols associated with the futures (only "Beta-cuttoff" absolutely requires a yield or stall. Everything else can be "lazily" executed at a later point in time, which should discover latent parallelisms in the program).

* Alpha-Beta pruning

-------

You keep saying "it cannot be done", but what I'm saying is "I've seen some new parallel primitives invented in the last 10 years. I think it can be done now". As long as someone ports these primitives over to the GPU.

EDIT: I have looked through the AMD Vega instruction set. Modern GPUs allow the programmer to know the program-counter, to set the program counter (aka a jump), and all that good stuff. Function-pointers, as long as a full AMD workgroup (64-threads) or NVidia warp (32-threads) do in fact work (!) on a modern GPU.

Yeah, there are issues like the CPU Stack (it won't translate to the GPU. So everything talked about here would have to be converted into iterative form). So I haven't solved all the issues yet. But the general thought process looks promising to my eyes

mot524 · on June 1, 2019

I've already spent a bunch of time explaining to you why this won't work. But hey, apparently you're the expert. If you want to reimplement Stockfish on a GPU such than it's faster than a CPU then nobody's stopping you. The source code is only ~10k lines and very well-written, clear, well-documented, and easy to understand. Go for it. You'll be famous as the first person to effectively implement alpha-beta on a GPU and as the author of the strongest conventional chess engine in history. So go for it.

dragontamer · on June 1, 2019

Regardless, I appreciated the discussion! Maybe I'll give it a whack. It all comes down to whether or not I could build "future" efficiently on a GPU.

I don't want to pretend I'm an expert at all. I've looked at symbolic manipulation and chess mostly as a hobby (since the Chess community has so many cool bitwise algorithms they've invented). And the GPU-community is small, but interesting to me. HPC programmers / C++ programmers in another bunch... information really isn't flowing as well between the "styles of compute", I'm just trying to connect a bunch of ideas that other people have thought up together.

mot524 · on June 2, 2019

Okay, as long as we're not arguing, I will tell you that any sort of fine-grained intra-thread communication is to be avoided like the plague in computer chess. Engines are able to search millions of positions per second... most multithreading frameworks (e.g., the ones that give you your "futures" and "promises") rely on locks that can halt threads for milliseconds at a time. It's absolutely disastrous for efficiency.

I believe Stockfish uses the idea of "Lazy SMP" for parallel search which literally means no intra-thread communication other than locks on hash table entries to avoid data corruption of the hash table. Anything more sophisticated has proven to be a loss. So any sort of discussion of tasks, futures, promises, or even Young Brothers Wait is going to be fairly pointless. Also, anything that relies on offloading data to the GPU for processing for more than a few milliseconds is going to be a net loss for obvious reasons. Meaning that even if you could evaluate a million positions in parallel, and you already (somehow) knew which positions you needed to evaluate, doing the evaluation on the GPU is probably still going to be a loss due to the time it takes to copy the positions to/from the GPU.

Really, if you were tasked to come up with an algorithm that was poorly suited for GPGPU computation, it might be computer chess.

That being said, it doesn't really matter. How much were you hoping to speed up Stockfish? AlphaZero still crushes Stockfish at 10:1 time odds. Even if you were able to make Stockfish 10x faster it wouldn't change the dynamic of CNN chess engines being superior to conventional engines.

dragontamer · on June 2, 2019

> Engines are able to search millions of positions per second... most multithreading frameworks (e.g., the ones that give you your "futures" and "promises") rely on locks that can halt threads for milliseconds at a time. It's absolutely disastrous for efficiency.

Yeah, that's what I mean as "efficiently create a future" on the GPU.

GPUs absolutely cannot spinlock, and atomics are way slower than CPU atomics. So mutex / spinlock / atomic based synchronization absolutely won't work.

So... what's my plan? Well... as I said, task-based cooperative multithreaded futures. I've talked about some ideas for how to implement it above... but I won't post another wall of text. Also: some advanced synchronization primitives are free on SIMD-based GPUs (ex: thread-barriers within a warp are literally free and compile down to a no-op). I think there's a rich set of efficient GPU synchronization primitives.

Lazy SMP works for Stockfish, but clearly won't work with 163,840 threads (Vega 64 at max thread occupancy). So I'd have to do something else. (163840 threads locking the RAM one-at-a-time to access a shared hash table? Probably not going to work)

> How much were you hoping to speed up Stockfish?

The Vega64 GPU has 4096 shaders at 1.5 GHz. 5k to 50k shader-cycles per node would give 122 Million nodes/second to 1220 Million nodes/second. Vega64 GPU also has 500 GBps of memory bandwidth of 8GBs of HBM2 RAM, with "64kB Shared Memory" per CU having an aggregate bandwidth of 9,000 GBps (which should lead to potentially better cross-thread synchronization structures within a workgroup)

Hard to say where a project would land, but probably within that range somewhere. It'd HAVE to land there to be successful (and hopefully closer to the 5k shader-cycles per node)

The only major instruction GPUs are missing is the pext and pdep instructions from x86. Sliding piece attacks may need a rework to be efficient on a GPU due to the very low amounts of L1 RAM available. Otherwise, GPUs support popcount, AND, XOR, etc. etc. that chess bitboard use.

BTW: Threadripper 1950x running Stockfish processes 2-million Nodes/Second, or roughly 32k core-cycles per node (4GHz 16-cores)). Which served as the basis of my estimate (+/- a magnitude)

mot524 · on June 2, 2019

I already told you that intra-thread synchronization is to be avoided but you're still talking about it (promises, futures, tasks, etc.) so I don't know what to say. Nothing you're saying is going to work like you think it is, but it seems like you're going to have to figure that out for yourself.

Clearly you want to talk more than you want to program. Maybe reimplementing Stockfish on a GPU is too daunting of a task. Why don't you get something simple like TSCP running on a GPU and see how far you get with that? TSCP is only about 2k lines of code and probably half of those lines are comments. With the state of GPU cores and compilers these days maybe you can get the C code running on a GPU core directly without much modification. TSCP also uses less than 64k of RAM so it should fit nicely in an Nvidia SM.

dragontamer · on June 2, 2019

> Clearly you want to talk more than you want to program.

And clearly you want to insult me, to the point where you've basically created an account on this website just to post what you've been posting. This isn't the first unnecessary swipe you've levied either. But... insults are fine as long as I'm getting useful information out of you.

Anyway, the insult to information ratio is off whack now. I guess its time for the discussion to die. I'm not here to argue dude, just discuss, but you keep wanting to turn this into an argument for some reason.

mot524 · on June 2, 2019

I told you that intra-thread communication will ruin chess engine efficiency, and in your reply, all you do is talk more about stuff that requires intra-thread communication (tasks, futures, promises, etc.). You might think that all I'm here to do is insult you, but from my point of view it seems like all you're here to do is ignore me, even though I'm the one with domain-specific knowledge. It's not a "discussion" if you're not going to consider anything I have to say.

dragontamer · on June 3, 2019

Look, I appreciate domain expertise and all. But seriously, level with me here.

Do you know how GPGPU raytracers work?

Rays are recursively defined: a ray starts at the screen (usually ~1000 points per pixel. A 1920 x 1080 picture with 1000 points per pixel is ~2-billion starting rays). When a Ray hits an object, it will spawn a new ray. When a Ray hits a light source (and the background is the most common light source) it will light up all the objects that were in its path (or its parent's rays path). The life of one ray may be very short (it hits the background), while the life of another ray may have 20+ bounces (if it hits a mirror, then a shiny object, then is subsurface-scattered to a 4th object, etc. etc.)

So we have an example that is both recursively defined (rays bouncing and spawning new rays), requires a bit of synchronization (each ray causes the 1920x1080 picture to light up in its own way), and is efficiently computed by GPUs.

You claim that I'm not listening to you. And of course from my perspective, you aren't listening to me. So give me a bone here. I've talked a LOT about GPGPU raytracers in the posts above, and its synchronization algorithm above already. Did you follow any of what I was talking about?

Or are you just deflecting to Chess Engines because that's what your expertise is? Because if you're honest about this discussion, you would have been talking about GPGPU Raytracers, and how its Raytracing Synchronization algorithm is applicable to the (similarly recursive) alpha-beta minimax tree (or I guess... from your perspective, why the two recursive definitions fail to line up).

--------

Summary: do you know how GPU Raytracers work. If so, then please reply in a way that demonstrates your knowledge. Because from my perspective, you've been avoiding the key point of my argument this entire time.

dagw · on May 29, 2019

I'd love to see what "translating" Stockfish's algorithms (ex: backwards pawn check. Bishop moves check. Paired Bishop advantage check. Etc. etc.)

Do you know if all these checks are documented somewhere (other than the source code)?

dragontamer · on May 29, 2019

https://www.chessprogramming.org/Programming

The "beginner chess engine" for the wiki is here: https://www.chessprogramming.org/CPW-Engine (focusing on readability instead of optimizations)

For the heuristics of stockfish themselves, see https://hxim.github.io/Stockfish-Evaluation-Guide/

mrob · on May 28, 2019

FIDE Candidate Master Kingscrusher (Tryfon Gavriel) has been posting analysis videos for some of these games on Youtube. LC0's win as black against the Trompowsky Attack was particularly impressive:

https://www.youtube.com/watch?v=rz51_wFgNrQ

ganeshkrishnan · on May 29, 2019

I have been following kingscrusher for years and I never knew his real name.

symplee · on May 29, 2019

Official website: http://lczero.org/

GitHub: https://github.com/LeelaChessZero/

Still a very active community project.

Didn't see the link posted here in the comments yet, definitely worth checking out. Most of the devs and serious contributors are active in the discord channel.

joe_the_user · on May 29, 2019

I cannot figure out what exactly is happening from the site. I see a table showing "Gull 3" (presumably using the LC0 enginer) with five wins and few other engines with 1-2 losses.

I can't see draws on the table. A lot of engines show no wins or losses.

All this matters somewhat considering that there was talk in AlphaZero match about whether the win percentage was realistic and whether the Stockfish was given it's full opening book. The thing is by the time you're 3000+ elo points, chess tends to be pretty drawish. One program might better than another but it can take while to get anything a draw.

bwasti · on May 29, 2019

I think that link expired and now shows the wrong games. Here's a link to the final game between LC0 and Stockfish https://cd.tcecbeta.club/archive.html?season=15&div=sf&game=...

ngcc_hk · on May 29, 2019

Some of the discussion seems forget that the lc0 use neural network. There is essential no logic and there is no mtcs in the core part. Just play.

And it is called zero ad it does not read any human game, like alpha go zero.

It is just play and play to train the neural network ... you may say it want to find a function by itself that play the game. And the function is in the weights.

mindgam3 · on May 28, 2019

Anyone have a direct link to the games? Current article link goes a generic TCEC landing page.

wces · on May 28, 2019

100 games: https://cd.tcecbeta.club/archive.html?season=15&div=sf&game=...

mindgam3 · on May 28, 2019

Thanks!

tim333 · on May 30, 2019

Anyone know how the LC0 vs Stockfish hardware compares? I believe a criticism of the Stockfish vs AlphaZero games was that AlphaZero had fancier hardware.

mot524 · on May 31, 2019

I believe LC0 was being run on dual RTX 2080 Tis and Stockfish was being run on "43 cores" which I take to mean a workstation with dual Intel 22-core processors. So you're talking about two $1200 graphics cards vs. two $3000+ CPUs. So you could make an argument that LC0 was running on inferior hardware, at least dollar-wise.

bonzini · on May 28, 2019

53-46, one game to go

Someone · on May 28, 2019

Last one is a draw. Weird ending, with one of them refusing to take a pawn because the resulting end game is in a table base and is 100% known to be a draw, and the computer evaluates the position where it doesn’t take the pawn as a 0.x% win and a 99.y% draw.

umanwizard · on May 28, 2019

https://cd.tcecbeta.club/archive.html?season=15&div=sf&game=...

Weird indeed. Can anyone explain why LC0 thought it had a chance? It seems obvious (thinking like a human) that white can't make progress: it can't move its rook off the h file without losing the pawn, and it can't promote its pawn without moving its rook. And it can't bring its king up to help the h-pawn without letting black's pawn promote.

LC0 evaluates the final position at 0.04, which I guess is close enough to 0 for it to agree to a draw, but I wonder why Stockfish is able to see that it's truly 0.00 and LC0 isn't.

Someone · on May 29, 2019

”it can't move its rook off the h file without losing the pawn”

Nitpick: I agree there’s nothing for white in that position, rook a6, b6 or c6 seem possible moves in the end position (if black takes the pawn, the rook gives chess on row 5 and takes black’s rook in the next move). From there, it could move the rook to row 5 to defend the pawn. If so, the rook can get away from the h file.

Also, if black’s king were to walk towards the black pawn, the rook has to move from the h file to give chess.

In general, LC0 seems to be way more optimistic in its scoring function than Stockfish. As an extreme, look at game 70. Stockfish sees it as a 100.0% draw after move 49, but play goes on until move 202 (!), with LC0 at some stage thinking it has a 25% chance at winning (around move 106)

umanwizard · on May 29, 2019

RE your nitpick: Fair point. I hadn’t seen the pinning tactic after 140. Ra6 Rxh5??? 141. Ra5+

As is probably obvious from that oversight, I am not a very strong player.

However, I still see no way for white to make progress. After 140. Ra6, black just walks its king towards the white pawn. White’s only options then are to give perpetual check with the rook, or to give up the pawn.

It’s plausible to me why lc0 doesnt see this — I guess it is just searching down random paths involving useless rook moves forever, and never gets to the end of the tree. I’m more curious about how Stockfish is able to prove it’s a draw.

Edit: after poking around the Stockfish code a bit, I think what's going on is that Stockfish sees that down every non-pruned path of the tree, either player has the ability to give check, so neither player can be ahead (basically each side has a perpetural check). Would love if any Stockfish expert reading this can comment on whether that's a valid understanding.

ClassyJacket · on May 28, 2019

I know nothing of this system other than reading this page, but my basic understanding is that Stockfish is more of a traditional algorithm that simulates possibilities and finds the most likely path to a win ('IF I move my knight here, they could do X or Y or Z...'), but LC0 is a neural network, which don't really process absolutes, just probabilities in terms of floating point numbers. It might be something to do with that.

nikki93 · on May 29, 2019

Stockfish's evaluation function is; though; ultimately a floating point accumulation of such decision processes. In fact the leaf level scores are themselves floating point.

eg. check out the scores here: https://github.com/official-stockfish/Stockfish/blob/master/...

... actually realized it uses an int typedef for the 'Score' type so it's "fixed-point" I guess.

ClassyJacket · on May 29, 2019

Sure, but it still makes sense that that number would still result in a flat zero - or be disregarded - since Stockfish is (presumably?) capable of evaluating and realizing the absolute 0 chance, no? I'm not claiming to have special knowledge of these algorithms but it makes at least intuitive sense to me.