Greatest chess players ever? Capa, Kramnik, Karpov, Kasparov, *in that order* (cuz 'puters don't lie!)
On May 20, 7:20 pm, Ron wrote:
In article WHC3i.1755$xP.1292@trndny04,
"Chess One" wrote:
does the result make sense to strong chess players?
The problem here is what's known as confirmation bias.
Indeed. And now I think I know know why their scoring methodology
using Crafty-12ply correlates so well with GM level play. It has
nothing at all to do with its strength or suitability for the task and
everything to do with its lack of adequate discrimination between
scoring the top N best moves in a complex position.
Elsewhere in the thread Andrew Walker suggested trying the "How Good
is Your Chess" scoring scheme against various chess engines. I
deliberately picked one that I thought might separate the goats from
the sheep. On reflection I think one including a nice long endgame
would be a much better choice.
The game chosen was N. Legky-E.Bacrot, French Team Championship 2007
(May 2007 edition Chess,UK)
It needs a few more people to do it with some other GM scored/
annotatated games to get some statistical significance but the results
are *very* interesting. I also tried the weakest engine I can think of
to determine a floor.
Turing at 2min/ply (effective search 5ply + quiessence) made 35% of
the same moves as a GM !
All of the serious engines managed about 60-70% of the same moves as a
GM.
Crafty-12ply (and at deeper levels too) scores the top few moves
(somethime 6+) as X +- 0.02
It should be no surprise that one of those is the GMs choice of move.
Other engines have more discrimination but are not always right in
their judgement. The scoring scheme penalised doing things the GM
thought was wrong and gave an 8 point bonus for seeing the brilliancy.
To my amazement none of the engines scored much more highly than I did
on this test.
Turing scored 21 = "strong club player" (but it was painful to watch
- couldn't find obvious winning combinations)
Crafty scored 34 = "county player"
CometB27 scored 39 = "national master" (~11ply and it was slightly
worse 38 at 2min than at 1min)
Rybka scored 44 = "national master" (may have been penalised for
inhuman moves)
Shredder10 scored 40, 46 = "national masetr" (12ply & 2min results)
On at least one move the engines all appeared to have found a better
move than the GM annotator Daniel King (and were duly penalised for
it) and now the most striking thing of all.
All of the engines (without exception) found the same 3 key positional
moves impossible to get right.
r1br2k1/1p3pp1/p1p2q1p/2Qp4/1P1P3n/2N1P3/P3BPPP/R4R1K b - - 0 16
The game continued: 16. ... Re8 17. a4 Bf5 18. Qb6 Re7
All of these moves for black mystified the engines - they mostly see a
flat featurelss landscape of similarly scored moves (exact values vary
each time you run the test) but it is especially flat at 12ply.
Only Rybka2.3.1, Shredder10 and CometB27!!! found the brillaincy
(extra 8 point boost).
The test agrees with what our intuition tells us, therefore it must be a
good and valid test. You use the test to check your intuition and your
intuition to check the test - it's perfectly circular reasoning.
But our intuition can be wrong.
It is a correlation does not imply causation problem. If you imagine
the most simple material balance at quiessent node evaluation function
and score GM play with that alone you will get a lot more than 50%
agreement! In fact it is very hard to find an engine that is weak
enough not to get at least a 33% agreement (perhaps Sanny's might).
If the 60-70% exact agreement shown in this game is roughly typical
for the middlegame tactical combinations with a few positional moves
the engines cannot grok then the overall 50% hit rate of Crafty12ply
may well be due to the engine being totally lost and confused in the
endgame.
You avoid confirmation bias by setting up rigorous standards for a test
before you run it, which was clealry not done in this case.
I will confess to a minor bias here in allowing the very weakest
engine, Turing up to 10 minutes at times just to see if it would ever
latch on to the winning possibilities of certain tactical combinations
(the answer was still no so it didn't actually gain any benefit from
my indulgence).
I would like to see what FritzX scores on this test (or any similar
one)
Regards,
Martin Brown
|