Greatest chess players ever? Capa, Kramnik, Karpov, Kasparov, *in that order* (cuz 'puters don't lie!)
On May 9, 7:26 pm, (Dr A. N. Walker) wrote:
In article ,
David Richerby wrote:
I don't *expect* anyone to do that. But anyone who might want to
check my results [or, in the present case, *G&B's* results] *can* so
modify their source code, and can thereby get exactly the same
results, even on different hardware [...]
But `reproducible' doesn't mean `anyone can reproduce exactly the same
data.' It means that anyone should be able to repeat the experiment
and produce consistent results. For example, the experiment of
rolling a die 100 times and calculating the mean score is
reproducible, [...],
Right. But we have quite a good mathematical model of the
statistics of rolling dice; there is no such model for values of
chess positions. G&B's results are [or should be] *exactly*
reproducible, which doesn't mean that you *have* [or in the PP's
words that I *expect* you to have] done so, but if you have any
doubts about their results, you can. You could then use that as
a basis for all sorts of other experiments, such as using other
ply depths and other engines, or other conditions [such as seeing
what happens after (say) 30 seconds per position]. Would that
make a difference? I don't know, but if you have 36 computers
and a spare month available, feel free.
OK. But without doing that for the moment. What settings do you use to
analyse annotate your own games?
I would be prepared to bet it is nothing like as shallow as 12 ply
fixed + quiessence. I find even 14 ply minimum sometimes misses things
that are intuitive to a human player going over the blundercheck
annotated game. And that is in mere club level games annotated with a
much stronger engine than Crafty.
I agree with your original analysis that 12 fixed play was the least
bad compromise for the data, time and kit available.
A better method would be to take the strongest program available
(i.e. Rybka) and have it score all the games, and only THEN apply
some selected algorithm to the standardized results.
And if I did the scoring again tomorrow, with a different "seed" to
the RNG, or with different contents in the cache/hash, or perhaps
after adding a new disc or RAM to my computer, let alone if you
tried to repeat my experiment, would you expect to see the same
results?
Well if it has to be open source then Fruit 2.1 (~2780) might be
another alternative to try against Crafty (~2670). An extra 100 points
and a bit less materiallistic evaluation would be closer to human GM
level play. Fruit 2.2.1 just about stumbled onto that tricky line that
Phil Innes sets so much stall by engines not finding before I pulled
the plug.
I should hope so! At least, the results should be very close.
Perhaps. And perhaps someone can run the experiment, if they
have 36 computers and *two* months going spare. But one of the things
we have learned from computer chess is that intuition is a very poor
substitute for doing the experiment.
Agreed. But equally when the experiment has a systematic error due to
using a relatively shallow fixed depth (but reproducible) searching to
score the moves played it doesn't take much intuition to conclude that
an engine that cannot annotate club level games accurately at that
level is completely out of its depth on superGMs.
It will penalise GMs that have formed plans extending beyond 12 ply if
there is no obvious gain made inside its quiessence horizon. And it
hardly ever sees material sacrifices for gains in positional advantage
or tempo.
expect to see dramatically different results from anything other than
increasing the size of the hash table (and, if necessary, the amount
of RAM) or the speed of the system,
So in other words you would be happy to see different results
if we ran the experiment again next year, with twice as much RAM, a
bigger cache, more disc, and the next version of the OS, even if it's
the same version of Rybka? Or do you just not see the value of an
experiment that *could* be re-run on *any* computer, from a 20-yo
PDP-11 [though it would take a long time ...] to whatever super-
duper system we might have 20 years hence [when it might run in
a few seconds rather than a CPU-year] and give identical results
as the basis for further experiments?
Exact reproducibility probably isn't so important here. Getting the
maximum accuracy of the move evaluation function for the limited
amount of time available is the key. Fixed depth does not do that.
[FWIW, I was recently running some 30-yo programs, partly out
of mere interest but also partly as a benchmark, and it was gratifying,
and a confidence boost in the compiler/OS, when results were identical,
and disturbing when some of the floating-point numbers in one test
were systematically different for reasons we have not yet managed to
track down.]
Usually something to do with rounding conventions or bugs in old trig
function implementations.
then the test is so totally
dependent on the initial conditions as to be worthless.
Very possibly it is. Perhaps it isn't. We won't know
unless someone runs the experiment. My own gut feeling is that
G&B's basic results are probably pretty stable, and would be
reproduced by any decent engine at any decent ply level, though
I would not be at all surprised if a few of the 70-80 centipawn
"blunders" turned out well at greater depth and a few non-blunders
turned out to be dubious. Swings, roundabouts. BICBW.
I don't think they are swings and roundabouts though. GM level games
are littered with precisely the sort of positions that chess engines
find really difficult to score accurately. And they usually occur at
pivotal moments.
Indeed. My take is that it is silly to think a computer (at this
point in time) can accurately determine who among the world
champions was "the greatest".
Then perhaps it's lucky that G&B weren't trying to determine that?
Then perhaps they shouldn't have written a summary claiming that this
is what they were trying to determine. ``Who is the best chess player
of all time? [...]
Firstly, "greatest" and "best" are far from synonymous, even
if you [as in the snippage] slip in a "strongest". There are factors
such as longevity, superiority over contemporaries, innovation,
triumph over adversity, competitiveness, ... to consider as well as
pure analytical chess-playing skill. And if you're going to quote
their initial question, you might perhaps have quoted also their
following sentence: "Chess players are often interested in this
question to which there is no well founded, objective answer, ...".
OK, perhaps the implication is that they should have stopped
there and then. But if historical Elo ratings are of interest, then
I see no reason why another objective measure of *something* need not
be. They *do* have an objective measure. It *does* seem that their
results correlate well with *some* quality that we can recognise in
the play of Capablanca, Petrosian, Tal, etc. Their methodology is
at least interesting, even if flawed.
Agreed. The experiment is worth repeating with a much stronger
engine.
The problem comes if we all
start to take it too seriously. It's a semi-amateur investigation
that resulted in a conference paper and a somewhat light-hearted
summary at a commercial chess site. I have no problem with that.
Nor do I. I think it mostly has found the players with the lowest
blunder rate fairly convincingly.
Regards,
Martin Brown
|