View Single Post
  #222  
Old May 16th 07, 03:50 PM posted to rec.games.chess.misc,rec.games.chess.computer
Dr A. N. Walker
external usenet poster
 
Posts: 96
Default Greatest chess players ever? Capa, Kramnik, Karpov, Kasparov, *in that order* (cuz 'puters don't lie!)

In article .com,
Martin Brown wrote:
[...] So I'm guessing that when both sides think they
are winning [by something significant, not by 20cp or so], one
of them has overlooked something of tactical importance, which
is why, after a bit, it turns into a tactical win.

It started from the opening in this game. Shredder +0.90 Crafty -0.27
peaking at +1.40 vs -0.10 then converging a bit until the fateful
17. ... Be7 2.77 vs 0.6 then for a while the scores agreed before
again diverging to 1.24, 3.27.


Ah. Perhaps I have a misunderstanding about what you did
or said? I understood that you had played some fast[ish] games
between Shredder and Crafty, and then passed on to us Shredder's
analysis [at much longer time limits] of the game? So that, for
example, the analysis would have been/looked the same even if the
game had been a GM encounter at classical time limits or you vs
me at 5-min chess, or any other source? So were the black scores
not *Shredder's* evaluations rather than *Crafty's*? Otherwise, I
find the exact agreements, eg at 1.22 and later at 2.58, for several
moves very suspicious. And if so, then this is not an example of
both *sides* thinking they were winning, but rather of *Shredder*
thinking both sides were winning? [As you mentioned later, there
are parity problems in many engines that cause this, esp in gambits,
but if Shredder is particularly prone to it, it doesn't help its
case to be a reliable annotator.]

It was pretty clear in this game that
Crafty simply did not know which way was up!


Absolutely. Computers seem to be prone to that sort of
game, though. Once they don't understand a position, they tend
to go *really* pear-shaped.

For the sake of balance in a best of 3 engine match at this time
penalty was 1 win 1 draw 1 loss for each engine.


Don Beal used to say that you need matches of 100+ games
to find which engine is better -- he had cases where one side
was losing 17-0 or thereabouts but hauled back to win [and this
in the days before "learning"].

BTW is there a way to get the graphical display of time taken and
engine score shown in Chessbase window or does the game have to be put
into the playing window to see that info?


Pass. I've never entered games with that info in the
first place.

Remember that is Crafty working at roughly the same search depth
setting as was being used to judge the play of world champoin chess
players. It may be a bit unfair to make it play the opening (where its
performance is very poor).


Ah. I assumed 8s/move wouldn't be enough to reach the
depth used by G&B [roughly 6h/game on 2.5GHz machines] ....

But that is still enough
to have some confidence in finding gross evaluation errors of 50cp or
more (which is what Crafty at 12 ply does).

Yes, but you still seem to be missing something. 100cp is a
pawn, and you can understand that very directly. 50cp is what? It
will matter if at some point we swap a 50cp advantage for a pawn-up
with 50cp compensation, but until then it's an arbitrary measure.

I was using that as an example.


Yes, but you still seem not to have understood! Look, suppose
some engine gives 1.23 as its evaluation. That means that somewhere
down the tree there is a position, reached by "best play" as far as
the current collection of static evaluations goes, which has a static
evaluation of 1.23. *That* evaluation is a sum of various factors --
+1.00 because we have an extra pawn, +0.17 because we control an open
file, +0.47 because of king safety, -0.13 because the opposing knight
is well-placed, +- this, that and the other, possibly including all
manner of complexity and joint factors, etc. Only the extra pawn is
"gold standard" currency. Everything else is there either because
BobH or some other programmer has decreed that an open file is worth
0.17 or because a "learning" program has currently settled on that
as the value. None of it is reliable [else we wouldn't need the tree
search at all], none of it seems to matter very much [or changing the
0.17 to 0.16 would dramatically change the strength of the program],
none of it relates very closely to how a GM would assess that position.

The only merit of this whole scheme is that pragmatically it
works. You may recall the Beal&Smith result that a completely random
static evaluation works surprisingly well. So when you say that
Shredder and Crafty have a discrepancy of [eg] 50cp in some position,
what you mean is that Shredder has looked at many millions of lines,
99.999+% of which are utter rubbish by *any* standards, picked on one
line as best for both sides, chosen the leaf position in that line,
pulled a number more-or-less out of a hat for that position; that
Crafty has done the same [but almost certainly chosen different lines
as best in most of the positions]; and that the two numbers differ
by 0.5. The miracle of computer chess is that quite often the numbers
agree within 0.2 or so. But there is no useful, objective, value in
those numbers. Indeed, we already know that an evaluation of 1.23 is
wrong by either 1.23 [if the position is actually drawn] or by worse
than that if the position is won/lost and the "machine infinity" for
won positions is greater than 2.46. Go figure.

Looking more carefully at that game
there were long sustained periods where Crafty was more than 100cp off
the mark and about 10 moves where it was more than 200cp out (and in
the middlegame). This doesn't bode well for its ability to score GM
level play.


What matters is not that [esp as basically the difference over
those moves was whether Crafty in those positions was totally and
utterly lost or merely utterly and totally lost, and the G&B scheme
would have stopped counting by then], but whether Crafty/Shredder
mis-assesses the correct move ordering, and if so by how much.

[...] If Crafty12 is so rotten, it's been
amazingly lucky.

I don't think it is that rotten. Just that it misses a lot of the rare
but absolutely key GM moves and marks them down because it doesn't
understand them. It probably gets the 95% of the routine moves exactly
right, but it is the handful of other moves that make all the
difference.


If I am reading the annotations correctly, then in the game
you gave, Shredder and Crafty each played 16 moves out of the 41 that
were sub-optimal according to Shredder. That doesn't seem a handful
of other moves to me, and it makes it seem unlikely that a GM would
have played nearly all the moves, routine or otherwise, the same way
as Crafty. [Of course, most of the 16+16 were "in the noise", but
it still suggests that with a 10cp noise level, there is a lot of
scope for Crafty/Shredder to make quite different moves from GMs.
Indeed, G&B's Fig.7 shows most WCs playing the same move as C12
about 50% of the time.]

According to the metrics line Crafty was at average search depth 12.2
and Shredder at 13.1 (but in 1/3 the time) during this test match.


OK, but this is not helping your thesis! Summarising, we
now have that C12 deviates by around 0.35 from Shredder, by between
0.10 and 0.15 from almost all WCs, and by between 0.06 and 0.09 from
recent strong computers in matches vs humans, despite an expected
0.1 or so error from random noise. Accepting that Shredder/Rybka/
etc are technically stronger than Crafty, this nevertheless suggests
that Crafty is doing very well at emulating these other players and
engines, and by inference at assessing how good they are.

Can we define a set of rules then [...]


Something that might be interesting. There are books out
there with titles like "How Good is Your Chess" [and regular
articles in a number of magazines] where strong players have
annotated games with point scores ["Score 7 for Nxe5, 3 for Bg5,
-3 if you blundered by Re1, 1 for routine development by Nc3,
0 for anything else"]. We could set engines doing these tasks
at various rates, see how they score, see how they rate the
alternatives, and perhaps -- if someone would sink some time
and/or money into this -- get some GMs to comment both on the
original scoring and on the computer results.

I still can't take centipawns seriously enough to want
to invest effort into tracking down 20cp discrepancies ....

--
Andy Walker, School of MathSci., Univ. of Nott'm, UK.

Ads
 

Loans - Personal Car Finance - Loans - Xbox Mod Chip - Advertising