![]() |
| If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|||||||
| Tags: capa, chess, cuz, greatest, karpov, kasparov, kramnik, lie, order, players, puters |
|
|
Thread Tools | Display Modes |
|
#211
|
|||
|
|||
|
In article . com,
Martin Brown wrote: In positional terms, I trust my own judgement more than Fritz, so I'm really using the computer only for blunder-checking. If In that case it is certainly worth downloading and running something like Fruit2.2.1 (evaluation free for 14days) as a kibitzer to see the sort of things that you are missing. [...] What sorts of things do you think I am missing that Fruit [or any other strong engine] might show me? We are perhaps somewhat at cross-purposes, in that I'm primarily interested [when entering my own games] *only* in the stupid tactical things I've missed. If Fritz doesn't show it in 10 seconds, then it wasn't that stupid after all. [...] Roughly Crafty19.19 takes 1-3mins to reach 12ply in this mode but in 60s Shredder10 typically reaches 15-16ply in all but the most complex positions. OK, but different engines mean different things by "12 ply". [My own program would mean "anything from 6-ply upwards", as it has variable depth-reduction, depending how "interesting" a move is, and particularly boring moves count double; I know this is eccentric.] [...] The problem here is that Crafty is frequently out by more than 50cp on key variations and has been in all the GM games I have fed it so far. Hang about! The *true* value of any position is "won", "drawn" or "lost", so "out by 50cp" is meaningless except as an evaluation against some amorphous scale of "slight advantage", "definite advantage", "surely a winning advantage" and so on. If Crafty is getting the *material* wrong, that's probably serious, but otherwise 50cp simply means that Crafty has a different scale for positional edges. It's not, of itself, right or wrong *unless* it causes Crafty to lose a drawn position or lose/draw a won one. .... Admittedly the first two were engine showpieces but the second pair were randomly chosen high level games. You can see it happen most prominently in the longer game where it misses the crucial winning line and mis scores a host of moves systematically wrong because it doesn't understand what is going on. ... But does it miss the crucial winning line because it has *tactical* shortcomings, or because it misunderstands how to play positionally? "Missing a winning line" sounds more like the former [or you might have said (eg) "misses the winning plan"]. How are you judging "systematically wrong"? Merely because a strong engine gives different numbers, or because [eg] Crafty gives the "wrong" number of centipawns to a positional feature? There is no objective meaning to be attached to "White is 1.23 centipawns ahead" other than "Rybka/Fruit/Crafty gives this as its evaluation". [...] I do think that a fair proportion of the "errors" that the G&B analysis says the GMs have made are in reality just the rms error of Crafty's evaluation which is something like 30cp multiplied by the number of times they do something that it doesn't expect. Possibly; and we won't know unless/until someone does the experiment. But in that case, the actual figures for most WCs of around 13cp/move, and less for the WCs who most of us would regard as the most "accurate" in their positional judgement and tactical awareness, are surprisingly low. Further, it's interesting that the strongest and best WCs, by reasonably common consent, are those whose judgement differs least from that of Crafty. It is time to turn the question around slightly. Can anyone find a GM level game where Crafty at 12ply avoids missing important winning lines and obtains reasonable blundercheck agreement to within say 20cp against any other top rated engine run for 60s/move? So far all the games I have tested have shown serious discreprancies (50cp). I don't think this is an interesting question *unless* we can produce an objective meaning [beyond Crafty/Rybka] of 20cp. -- Andy Walker, School of MathSci., Univ. of Nott'm, UK. |
| Ads |
|
#212
|
|||
|
|||
|
In article .com,
raylopez99 wrote: Don't confuse the PSEUDO-chess scientists and programmers answers on this thread with REAL answers. Keep in mind I program as a hobby, have an IQ of over 140, and am a successful and quite wealthy businessman. I've got to level with you. It always bothers me when somebody starts an answer to a simple question by attacking people who disagree with him and then puffing up his own irrelevant credentials. Now to get to the point of your questions: I don't know. Ok. But since you agree it's all intuition and hunches, don't you think you should be a little more open to the possibility of being wrong? Personally, I'm not convinced that 12 ply is anywhere near sufficient. I also think the folly of your insistence of normalization being sufficient is illustrated by the example that this analysis process would rate a program that beat crafty-at-12-ply as being worse than crafty-at-12-ply. Since all of the players evaluated would demolish crafty-at-12-ply, this strikes me as a near-fatal flaw. -Ron |
|
#213
|
|||
|
|||
|
In article ,
JohnnyT wrote: Wow, that word. That is the key of the whole thing. Lack of blunder is by a long way, in my mind, and I think in many's, a long long way from the word "accuracy". And some of the questions had to do with #1 move correlation. Which again raises the question of "accuracy". And not of blunders. I think that Crafty-12 as an arbiter of accuracy, is a pretty tough row to hoe. The even bigger problem is leaping from "accuracy" to "greatness," which is the claim of some people. Accuracy may be a component of greatness (and, in fact, it certainly is) but I don't think very many people would consider it the sole defining quality of greatness. I think there are many other qualities, which I don't think even RayLopez can claim Crafty-12 is capable of measuring, which go into greatness. These include, but are not limited to: Combativeness, contributions to theory, originality, number of brilliancies, and performance under pressure. And I'd do this myself, if I had the tools to do so easily, but I'd be fascinated to see the Crafty-12 evaluations of Tal and Botvinnik based entirely on their first match. Does anybody have a way of setting up a computer to do this automatically? -Ron |
|
#214
|
|||
|
|||
|
On May 11, 11:06 am, raylopez99 wrote:
1) Would you feel equally confident if we only gave crafty 11 ply? 10? 8? 4? Where do you draw the line? What non-arbitrary criteria are you using to suggest that 12-ply is meaningful whereas 3 ply, obviously, would not be? 2) What objective criteria are you using to define "extremely close" such that you don't trust the computer's ability to rank players properly? I'm very curious to hear your answers to these questions. In truth, nobody in this thread really knows, and indeed further research is needed. But the burden of persuasion is on Camp #1 to make their case--that so called "positional sacrifice" positions are rather common in a game of chess and that chess is NOT largely tactics This is absurd. The burden of proof, if indeed there is one here, is on those who would maintain that a mediocre chess program can *accurately* rank the world champions. My view is that the greater the superiority of the chess engine over those it judges, the smaller this burden of proof becomes; the less important the dead-on accuracy of move-ranking becomes as a requirement to accurately rank the players relative to one another. An inferior program is good for one thing, though: spotting gross blunders in the games of the world champions. A perfect example would be the famous game where BOTH human players overlooked a simple tactic involving moves: 1. Q-h8+ (gives away the Queen) ...Kxh8, 2. Nxf7+ (capturing a pawn) ...K-moves, 3.NxQ, netting a pawn. A human is *far* more likely to make this oversight than even a mediocre program like Crafty_at_12 plys. (these are the assumptions behind their claims--I claim the contrary). History has shown otherwise. Nonsense. History has not yet shown that RL's ignorant assumption regarding tactics is anything other than just that: an assumption. The figure often quoted by GM Tarrasch (99%), was not an actual measurement, but only a way of making a point. For purposes of the present discussion, it would be far more useful to think of chess as "only" 90% tactics. Even here, it is not really the percentage which is the main point, but rather it is that tactics take precedence over many lesser things. It's akin to having nuclear missiles or, say, an aircraft carrier within range. Indeed, on the last point, Kramnik missed a mate in one last year. This was an anomaly; generally speaking, the world champions don't overlook mates-on-the-move in top-level play. Chess is largely tactics, and that's why it is fair to have a chess engine rate the champions. Ah, but "fairness" was never the issue here. What critics have focused on is the subject of *accuracy* (something RL obviously knows nothing about). You can make 30 brilliant "deep" positional moves in chess, have a clearly winning position, and still lose a chess game in a mate in one. Obsessing over GM Kramnik's fluke blunder, I see. The phrase "grasping at straws" leaps to mind. That is chess. A PC would score you poorly in such a game, even though you were "brilliant" up until your blunder Now that you mention it, GM Kramnik had already tossed away his win by that point; far from being rated as brilliant, I think his computer opponent would likely have ranked him as weaker than itself, since the human's moves did not as closely match its own quirks. What about Crafty? I imagine it would rank most other programs as superior to human GMs for similar reasons, including those who are objectively superior. (and perhaps unappreciated by the PC, though I have argued in this thread that PCs are in fact not so bad at rating positions that require positional moves, even exchange sacs). The question is not "are they bad" at rating positions, but rather, it is "are they good enough" to *accurately* rank the world champions. My view is that if they are good enough, it is probably programs such as Rybka and Shredder -- essentially, the ones rated at the top of the heap -- which are good enough for the job. And we need a decent sample size for this sort of approach to render meaningful data; at least one of the champs had only one single world championship match, of course, against but one single opponent; that is no way to do this sort of thing properly. In fact, Camp #1's arguments are better if we were trying to rate "correspondence chess" champions rather than OTB champions, since in correspondence chess tactics are much less important than deep positional moves. I wouldn't say that. It would be more accurate to say that in correspondence play, there are fewer gross tactical blunders and that in general, the tactics are deeper and better executed. In sum, it is a closer match to what the chess programs rate as perfect chess. But that was not the inquiry of the original article ranking of champions: it was for OTB world championship play. However, that said, I would not be surprised that even for correspondence chess players, rating such players with Fritz 5.31 at 5 seconds a move would give you a pretty clear indication of the best correspondence chess players, since good positional moves and good tactical moves are largely one and the same in chess Not true. In one of my current games at RedHot, I just chose to double my Rooks on the only open file, as opposed to snatching a free pawn and surrendering that file to the opponent, even though doing so would not make a whole lot of difference. The point is, grabbing at the maximum of material gain will most likely rate higher with a shallow search by a mediocre program, such as Fritz 5.31; yet my move is likely to result in an immediate resignation because it stomps out any imagined counter play and thereby underscores the fact that I am up the exchange for nothing and can win material almost at will. Suppose Crafty_12 ply penalizes either move as inferior to the other -- how does this style issue contribute to ranking the world champs *accurately*? (again, this goes to chess being 99% tactics). Personally, I think this obsession with the figure "99%" may be the root of the problem. If you can learn to accept that tactics are, let us say, higher in rank than positional play, but not so overwhelming as to merit such a figure as this "99%", then you will finally begin to grasp the issue. Are Generals and Colonels 99% of the military? Do airplanes and jets and missiles constitute 99% of the armed forces? No, but their greater weight may make it seem to be such. Suppose that GM Tarrasch had instead stated that "artillery is 99% of winning the war" -- would this mean that all other segments could then be dismissed as utterly irrelevant? Would you then sit down and just start counting up beans (here, artillery pieces)? To sum up, there are *at least* two problems with the approach taken: 1) sample size too small (except with GMs like Steinitz and Botvinnik) 2) 12 plys depth of search is likely insufficient -- help bot |
|
#215
|
|||
|
|||
|
On May 14, 10:59 pm, help bot wrote:
In one of my current games at RedHot, I just chose to double my Rooks on the only open file, as opposed to snatching a free pawn and surrendering that file to the opponent, even though doing so would not make a whole lot of difference. The point is, grabbing at the maximum of material gain will most likely rate higher with a shallow search by a mediocre program, such as Fritz 5.31; yet my move is likely to result in an immediate resignation because it stomps out any imagined counter play and thereby underscores the fact that I am up the exchange for nothing and can win material almost at will. Suppose Crafty_12 ply penalizes either move as inferior to the other -- how does this style issue contribute to ranking the world champs *accurately*? Update: Contrary to my prediction that seizing the only available file with both Rooks would quite possibly result in a resignation, my opponent stubbornly insisted on "contesting" the file, thereby immediately *hanging* his Rook. My new, updated prediction? A resignation. (If it turns out that my opponent instead hangs his only remaining piece, a Bishop, I will swear off making predictions forever, and get a real job.) -- help bot |
|
#216
|
|||
|
|||
|
On May 14, 7:37 pm, (Dr A. N. Walker) wrote:
In article . com, Martin Brown wrote: In positional terms, I trust my own judgement more than Fritz, so I'm really using the computer only for blunder-checking. If In that case it is certainly worth downloading and running something like Fruit2.2.1 (evaluation free for 14days) as a kibitzer to see the sort of things that you are missing. [...] What sorts of things do you think I am missing that Fruit [or any other strong engine] might show me? We are perhaps somewhat I tend to like having a faster engine analyse the game at a deeper level. On paper there is little to chose between Fritz & Shredder but in reality they give a very different view of some positions where the strongest move is not a capture. at cross-purposes, in that I'm primarily interested [when entering my own games] *only* in the stupid tactical things I've missed. If Fritz doesn't show it in 10 seconds, then it wasn't that stupid after all. Or perhaps it was, but Fritz is blinded to it by a tempting looking swapoff line. The sorts of things that it will predictably miss are the passive looking single moves that set up a future forcing combination or longer term positional advantage. The one that Phil Innes challenged me with earlier in the thread being a canonical example. The key move there Nh4 aiming for a weak spot on g6 where it becomes a major thorn in black's side is beyond hope of Fritz ever seeing it. Rybka & Shredder both find it quickly and Fruit managed it after half an hour just before I lost patience with it. [...] Roughly Crafty19.19 takes 1-3mins to reach 12ply in this mode but in 60s Shredder10 typically reaches 15-16ply in all but the most complex positions. OK, but different engines mean different things by "12 ply". I agree that ply has a somehwhat random meaning, but here we are talking of a notional full depth search with pruning to at least that depth with various extensions. It is in the choice of lines to extend that different engines can give radically diferent results. Fritz & Shredder/Rybka are poles apart. Rybka doesn't show how far search extensions go. [My own program would mean "anything from 6-ply upwards", as it has variable depth-reduction, depending how "interesting" a move is, and particularly boring moves count double; I know this is eccentric.] I grant you that the notional evaluation displayed in blundercheck as n.nn/MM MM is decidedly +- 1 or 2. Presumably some miscounting of singular extensions or cache lookups. The problem here is that Crafty is frequently out by more than 50cp on key variations and has been in all the GM games I have fed it so far. Hang about! The *true* value of any position is "won", "drawn" or "lost", so "out by 50cp" is meaningless except as an evaluation against some amorphous scale of "slight advantage", "definite advantage", "surely a winning advantage" and so on. If Crafty is getting the *material* wrong, that's probably serious, but otherwise 50cp simply means that Crafty has a different scale for positional edges. It's not, of itself, right or wrong *unless* it causes Crafty to lose a drawn position or lose/draw a won one. If you run an engine engine match with the stronger engine penalised on time to give Crafty a chance you can watch as the game unfolds. Both sides claim to be winning for a while until one gets a deep tactical edge over the other. I did one last night which illustrates my point - here annotated here by the victor at 60s/move. [Event "AOI, Blitz:4'+2""] [Site "East Rounton"] [Date "2007.05.14"] [Round "1"] [White "Shredder 10"] [Black "Crafty 19.01"] [Result "1-0"] [ECO "D25"] [WhiteElo "9999"] [BlackElo "9999"] [Annotator "0.30;0.36"] [PlyCount "83"] [TimeControl "240+2"] {Intel(R) Pentium(R) 4 CPU 3.00GHz 2992 MHz W=13.1 ply; 354kN/s B=12.2 ply; 835kN/s; 2 TBAs} 1. d4 {Both last book move 0.30/16 12} Nf6 {0.36/12 27} 2. Nf3 {(Bf4) 0.27/15 11} d5 {0.32/12 26} 3. c4 {(e3) 0.27/16 11} dxc4 { (e6) -0.26/12 25} 4. e3 {(Nc3) 0.34/14 20} b5 {(Bf5) -0.21/12 24} 5. a4 { 0.79/14 13} c6 {-0.27/12 24} 6. axb5 {(Be2) 0.90/13 5} cxb5 {-0.11/13 23} 7. Nc3 {0.72/14 17} Qb6 {(Bd7) -0.26/12 23} 8. b3 {(Ne5) 1.01/13 12} e6 { (b4) -0.15/11 22} 9. bxc4 {last book move 0.81/13 15} b4 {(Bb4) 1.22/16 22} 10. c5 {(Qa4+) 1.22/16 14} Qb7 {1.22/14 21} 11. Rb1 {1.37/15 8} Nc6 {1.22/15 21} 12. e4 {(Bc4) 1.22/13 13} a6 {(Be7) 1.51/14 20} 13. Bc4 {(Bf4) 1.33/14 22} Qc7 {1.33/14 19} 14. Ne2 {(e5) 1.17/16 7} Nxe4 {0.66/14 21} 15. O-O {0.66/13 5} f5 {(Bb7) 1.79/15 19} ({Shredder 10:} 15... Bb7 16. Bf4 Qd7 17. Qb3 Be7 18. d5 exd5 19. Bxd5 Qf5 20. Bxc6+ Bxc6 21. Nfd4 Qxc5 22. Nxc6 Qxc6 23. Rfd1 {0.66/13} ) 16. Bf4 {1.65/13 11} Qd7 {2.01/15 24} 17. Bb3 {(Ne5) 1.77/13 6} Be7 { (Qb7) 2.58/16 18} ({Shredder 10:} 17... Qa7 18. Ba4 Bd7 19. Qb3 Qb7 20. Bxc6 Qxc6 21. Ne5 Qb5 22. Nxd7 Qxd7 23. Qxb4 Kf7 24. Qc4 Be7 {1.77/13}) 18. Ba4 { 2.58/14 5} Bf6 {2.58/16 21} 19. Rxb4 {2.58/14 6} Nxb4 {2.58/16 17} 20. Ne5 { 2.58/18 7} Bxe5 {2.58/18 17} 21. Bxd7+ {2.58/18 4} Bxd7 {2.58/18 5} 22. Bxe5 { 2.58/16 6} O-O {2.58/16 16} 23. f3 {2.55/17 3} Nf6 {2.55/15 16} 24. Nf4 { (Qd2) 2.51/15 5} a5 {2.59/15 15} 25. Qb3 {2.58/14 5} Ra6 {(Rfe8) 2.87/15 15} 26. Kf2 {(Re1) 2.86/14 7} Nfd5 {(Kf7) 2.86/14 15} 27. Nxd5 {2.96/16 1} exd5 { 2.97/16 14} 28. Ke3 {(Rc1) 2.96/14 3} Rg6 {(Bb5) 3.15/16 14} 29. Rg1 { (g3) 2.96/15 3} Bb5 {2.98/14 14} 30. g4 {(Kd2) 2.63/16 4} Re8 {(f4+) 3.06/14 14 } 31. h3 {(Kd2) 2.93/14 4} Bc4 {(fxg4) 3.65/16 13} 32. Qa4 {3.62/14 2} Nc6 { 5.63/19 15} ({Shredder 10:} 32... Rf8 33. Rb1 fxg4 34. hxg4 Nc6 35. Rb7 Rc8 36. Rb6 Kh8 37. f4 {3.62/14}) 33. gxf5 {5.63/17 2} Rxg1 {5.63/17 13} 34. Qxc6 { 5.63/15 2} Re1+ {(Kf7) 7.00/17 13} 35. Kf4 {7.12/18 3} Kf7 {7.00/18 12} 36. Qc7+ {(f6) 7.00/20 2} Re7 {7.00/18 37} 37. Qxa5 {7.00/16 2} Rc1 { (Rg1) 9.36/17 12} ({Shredder 10:} 37... Rg1 38. c6 g5+ 39. fxg6+ Rxg6 40. c7 Rc6 41. Qa7 Ba6 42. Qa8 Ree6 43. Qd8 h6 {7.00/16}) 38. Qd8 {9.41/16 2} Re8 { (Bb5) 16.78/17 22} ({Shredder 10:} 38... Rg1 39. Bd6 Rge1 40. Bxe7 Rxe7 41. c6 Re8 42. c7 Ba6 43. Qxd5+ Kf8 44. Qd6+ Re7 45. f6 gxf6 46. Qd8+ Kf7 {9.41/16}) 39. Qg5 {16.78/16 2} Rxe5 {17.52/16 10} 40. dxe5 {19.80/16 3} d4 {21.73/16 37} 41. Qd8 {20.98/14 3} h6 {#151/15 15} ({Shredder 10:} 41... h5 42. Qd7+ Kf8 43. c6 Bf7 44. e6 Bxe6 45. fxe6 Kg8 46. e7 g5+ 47. Kxg5 Re1 48. Qe8+ {20.98/14}) 42. Qd7+ {#150/19 2} 1-0 Both engines playing with no opening book. Shredder10 has found a plausible Queens Gambit sideline [D25] ab initio out to move 9. The evaluation looks like a picket fence for large parts of the game. Admittedly the first two were engine showpieces but the second pair were randomly chosen high level games. You can see it happen most prominently in the longer game where it misses the crucial winning line and mis scores a host of moves systematically wrong because it doesn't understand what is going on. ... But does it miss the crucial winning line because it has *tactical* shortcomings, or because it misunderstands how to play positionally? "Missing a winning line" sounds more like the former [or you might have said (eg) "misses the winning plan"]. How are you judging "systematically wrong"? Incapable of seeing deep enough to catch significant move refutations, or in some cases unable to see them at all no matter how much time it is given. There is no objective meaning to be attached to "White is 1.23 centipawns ahead" other than "Rybka/Fruit/Crafty gives this as its evaluation". I reckon the rms noise on most lines is always around 10cp no matter how deep you go. A few quiet lines may have smaller rms errors, but the active ones tend to bounce around a bit. But that is still enough to have some confidence in finding gross evaluation errors of 50cp or more (which is what Crafty at 12 ply does). [...] I do think that a fair proportion of the "errors" that the G&B analysis says the GMs have made are in reality just the rms error of Crafty's evaluation which is something like 30cp multiplied by the number of times they do something that it doesn't expect. Possibly; and we won't know unless/until someone does the experiment. But in that case, the actual figures for most WCs of around 13cp/move, and less for the WCs who most of us would regard as the most "accurate" in their positional judgement and tactical awareness, are surprisingly low. Further, it's interesting that the strongest and best WCs, by reasonably common consent, are those whose judgement differs least from that of Crafty. But they may well differ even less from the output of a stronger engine. Certainly of the GM games I have tried Crafty12ply on it has seen "improvements" that stronger engines at deeper ply levels can easily refute. If the GM makes the move Crafty expects it doesn't matter how wrong the evaluation is. The identity X + (-X) = 0 * is very helpful. It is only when the GM makes a different move that evaluation errors hurt the scoring. * Thanks to USPO Xerox have a patent on this blindingly obvious identity as applied to JPEG decoding. It is time to turn the question around slightly. Can anyone find a GM level game where Crafty at 12ply avoids missing important winning lines and obtains reasonable blundercheck agreement to within say 20cp against any other top rated engine run for 60s/move? So far all the games I have tested have shown serious discreprancies (50cp). I don't think this is an interesting question *unless* we can produce an objective meaning [beyond Crafty/Rybka] of 20cp. Annotating a few more GM games with both Crafty12pl, your favourite engine 12ply, and your favourite engine 60s/move would go a long way to settling the dispute of whether or not Crafty12ply scoring is adequate. I have tried the experiment and so far found Crafty-12ply wanting. YMMV I think it is worth trying to agree a test protocol that could be used to produce say 100 top level games consistently annotated by multiple engines. Then we might be able to get some half decent stats. Hunches really don't cut it. Regards, Martin Brown |
|
#217
|
|||
|
|||
|
In article .com,
help bot wrote: In one of my current games at RedHot, I just chose to double my Rooks on the only open file, as opposed to snatching a free pawn [...]; yet my move is likely to result in an immediate resignation because it stomps out any imagined counter play and thereby underscores the fact that I am up the exchange for nothing and can win ^^^^^^^^^^^^^^^ material almost at will. *Now* he tells us! OK, you are the exchange up, control the only open file, and can win material at will. So -- purely a guess! -- Fritz, Crafty, Rybka and any other engine except possibly Sanny's is scoring your position at +3 or more? ... Suppose Crafty_12 ply penalizes either move as inferior to the other -- how does this style issue contribute to ranking the world champs *accurately*? ... In which case this position is outside the [-2..2] range, and is discarded by the G&B methodology, really for exactly the reasons you gave. So Crafty12 would not penalise your move. -- Andy Walker, School of MathSci., Univ. of Nott'm, UK. |
|
#218
|
|||
|
|||
|
In article .com,
Martin Brown wrote: [...] The one that Phil Innes challenged me with earlier in the thread being a canonical example. The key move there Nh4 aiming for a weak spot on g6 where it becomes a major thorn in black's side is beyond hope of Fritz ever seeing it. Rybka & Shredder both find it quickly [...] Yes, but I don't *need* Fritz to see it -- I need Fritz to confirm to me that Nh4 isn't getting trapped by ... g5 and/or that it wasn't doing something important relative to d4/e5 [not, of course, difficult in this particular position, but arguing much more generally]. That's why I think we are somewhat at cross purposes. You see Rybka/whatever as a strong player going through your game and pointing out the best moves; I see Fritz as a slightly annoying spectator saying [Harry Enfield voice:] "You can't do that, you've just dropped a piece" [/HE] as I look at the things I or my opponent might have done. If you run an engine engine match with the stronger engine penalised on time to give Crafty a chance you can watch as the game unfolds. Both sides claim to be winning for a while until one gets a deep tactical edge over the other. Sure. But you surely aren't claiming that Crafty is so stupid that it thinks doubled pawns are good and centralised pieces are bad? So I'm guessing that when both sides think they are winning [by something significant, not by 20cp or so], one of them has overlooked something of tactical importance, which is why, after a bit, it turns into a tactical win. I did one last night which illustrates my point - here annotated here by the victor at 60s/move. [Event "AOI, Blitz:4'+2""] [...] OK, so around 8-second chess, and an amusing crunch. It looks as though Crafty-sans-book has not the foggiest idea about developing and getting castled, and is also tactically unaware. But not relevant to the present debate! [There are some interesting discrepancies between evaluations on successive ply in the annotations, but these too are not that relevant, unless we find that Shredder is much more or less prone to these than other engines.] I reckon the rms noise on most lines is always around 10cp no matter how deep you go. A few quiet lines may have smaller rms errors, but the active ones tend to bounce around a bit. [In which case Capa/Kramnik's 10cp difference per move is startlingly good ....] But that is still enough to have some confidence in finding gross evaluation errors of 50cp or more (which is what Crafty at 12 ply does). Yes, but you still seem to be missing something. 100cp is a pawn, and you can understand that very directly. 50cp is what? It will matter if at some point we swap a 50cp advantage for a pawn-up with 50cp compensation, but until then it's an arbitrary measure. And even after that, it matters only if the implied equation "it's worth giving up the two bishops in order to win a doubled pawn" [or whatever] is so wrong that [eg] a won position is now drawn. GMs don't normally talk in those terms, nor about a 50cp advantage, but in terms of concrete material, specific positional pros and cons, and plans in a specific position. [...] Further, it's interesting that the strongest and best WCs, by reasonably common consent, are those whose judgement differs least from that of Crafty. But they may well differ even less from the output of a stronger engine. Possibly. But if G&B's table 3 is showing anything at all objective about Crafty, it is that Crafty12 plays "rather like" all the WCs except perhaps Steinitz, and much more like Capablanca and Kramnik than like other WCs. If Crafty12 is so rotten, it's been amazingly lucky. After all, in the game you showed above, Crafty10 [assuming that's roughly what it was managing in 8s] deviates by around 35cp/move from Shredder by G&B rules. So either there's a *huge* improvement between C10 and C12 [and C12 would agree almost exactly with Shredder] or else C12 is not only strong enough to assess WC play, but is actually closer than you might expect to emulating it. If the GM makes the move Crafty expects it doesn't matter how wrong the evaluation is. Yes, but this doesn't matter *anyway* unless it results in scoring moves in the wrong order -- and in that case, the GM should *not* be playing Crafty's move. You can't have it all ways! I think it is worth trying to agree a test protocol that could be used to produce say 100 top level games consistently annotated by multiple engines. Then we might be able to get some half decent stats. Hunches really don't cut it. Absolutely. But I don't think the stats will mean what you seem to think they mean. -- Andy Walker, School of MathSci., Univ. of Nott'm, UK. |
|
#219
|
|||
|
|||
|
On May 15, 7:26 am, (Dr A. N. Walker) wrote:
In article .com, help bot wrote: In one of my current games at RedHot, I just chose to double my Rooks on the only open file, as opposed to snatching a free pawn [...]; yet my move is likely to result in an immediate resignation because it stomps out any imagined counter play and thereby underscores the fact that I am up the exchange for nothing and can win ^^^^^^^^^^^^^^^ material almost at will. *Now* he tells us! OK, you are the exchange up, control the only open file, and can win material at will. So -- purely a guess! -- Fritz, Crafty, Rybka and any other engine except possibly Sanny's is scoring your position at +3 or more? ... I wouldn't know. BTW, I recently downloaded a few of the free chess programs but have not yet been able to get any of them to work "as advertised" so I can analyze my games. One of these was Fritz 5.32, but its game analysis seems to just vanish into thin air. All in all, I would guess that your figure (+3 or more) is about right since all of my pawns were (yes, it's over; he resigned immediately after I captured his Rook for free) on the color opposite to his Bishop and therefore immune to capture so long as I kept his Rook at bay. Suppose Crafty_12 ply penalizes either move as inferior to the other -- how does this style issue contribute to ranking the world champs *accurately*? ... In which case this position is outside the [-2..2] range, and is discarded by the G&B methodology, really for exactly the reasons you gave. So Crafty12 would not penalise your move. Once again, you have demonstrated a complete, utter inability to read my comments *in context*. Look back at my original post. I was (obviously) replying to this comment by Ray Lopez: However, that said, I would not be surprised that even for correspondence chess players, rating such players with Fritz 5.31 at 5 seconds a move would give you a pretty clear indication of the best correspondence chess players, since good positional moves and good tactical moves are largely one and the same in chess To the idea that good positional moves and good tactical moves are *one and the same thing*. This silly notion is why I gave the example from my game where I had deliberately chosen a positional move over the sharper, tactical, material grab. Clearly, in this context, it would not matter if I had a dozen extra Queens; there *is* a substantial difference between positional and tactical moves. Among the world champions, those who tended toward the positional were often described as having a "dominating" style, while those who liked to live on the edge were often described as "aggressive", "dynamic", or perhaps more accurately, "reckless". :D ---- On the other subject, I strongly disagree that Fritz5.31 could *accurately* rank top correspondence players at 5 seconds per move (quick blunder check). This assumption relies on the silly idea that "chess is 99% tactics", and the remaining 1% is largely irrelevant. IMO, the remaining portion -- whether it be only 1% or many times that -- is not only relevant, but very *important*. ---- As for the G&B methodology, it was never described in any detail in any of the articles which I read by following the links earlier in this thread. Clearly, if I had wished to skewer their "methodology", I would probably want to know what it was. But having already learned that the reason for the sloppiness was a shortage of time and a complete disregard for quality work, I have no interest in further details regarding the authors' methodology. -- help bot |
|
#220
|
|||
|
|||
|
On May 15, 7:00 pm, (Dr A. N. Walker) wrote:
In article .com, Martin Brown wrote: [...] The one that Phil Innes challenged me with earlier in the thread being a canonical example. The key move there Nh4 aiming for a weak spot on g6 where it becomes a major thorn in black's side is beyond hope of Fritz ever seeing it. Rybka & Shredder both find it quickly [...] Yes, but I don't *need* Fritz to see it -- I need Fritz to confirm to me that Nh4 isn't getting trapped by ... g5 and/or that it wasn't doing something important relative to d4/e5 [not, of course, difficult in this particular position, but arguing much more generally]. That's why I think we are somewhat at cross purposes. Perhaps, but I still think that you might find trying another engine instead of Fritz an interesting experiment (and if you have a dual core CPU you can run a second engine effectively for free as long as you don't want to do anything else at the same time). Fruit2.2.1 is free for the first 14 days trail period and it shows one alternative game view. You see Rybka/whatever as a strong player going through your game and pointing out the best moves; Yes. I am having the engine look for deep tactical themes that I would like to be able to recognise and find in the future. Most of the time the engine agrees with me (and I am not a GM) but when it sees something I or my opponent missed completely that is interesting and worth a second look. Blundercheck is my preferred method. "Analysis" is too verbose and not so informative. I see Fritz as a slightly annoying spectator saying [Harry Enfield voice:] "You can't do that, you've just dropped a piece" [/HE] as I look at the things I or my opponent might have done. OK If you run an engine engine match with the stronger engine penalised on time to give Crafty a chance you can watch as the game unfolds. Both sides claim to be winning for a while until one gets a deep tactical edge over the other. Sure. But you surely aren't claiming that Crafty is so stupid that it thinks doubled pawns are good and centralised pieces are bad? So I'm guessing that when both sides think they are winning [by something significant, not by 20cp or so], one of them has overlooked something of tactical importance, which is why, after a bit, it turns into a tactical win. It started from the opening in this game. Shredder +0.90 Crafty -0.27 peaking at +1.40 vs -0.10 then converging a bit until the fateful 17. ... Be7 2.77 vs 0.6 then for a while the scores agreed before again diverging to 1.24, 3.27. It was pretty clear in this game that Crafty simply did not know which way was up! For the sake of balance in a best of 3 engine match at this time penalty was 1 win 1 draw 1 loss for each engine. BTW is there a way to get the graphical display of time taken and engine score shown in Chessbase window or does the game have to be put into the playing window to see that info? I did one last night which illustrates my point - here annotated here by the victor at 60s/move. [Event "AOI, Blitz:4'+2""] [...] OK, so around 8-second chess, and an amusing crunch. It looks as though Crafty-sans-book has not the foggiest idea about developing and getting castled, and is also tactically unaware. But not relevant to the present debate! Remember that is Crafty working at roughly the same search depth setting as was being used to judge the play of world champoin chess players. It may be a bit unfair to make it play the opening (where its performance is very poor). [There are some interesting discrepancies between evaluations on successive ply in the annotations, but these too are not that relevant, unless we find that Shredder is much more or less prone to these than other engines.] A lot of engines have some move parity issues depending on turn to move. I reckon the rms noise on most lines is always around 10cp no matter how deep you go. A few quiet lines may have smaller rms errors, but the active ones tend to bounce around a bit. [In which case Capa/Kramnik's 10cp difference per move is startlingly good ....] I think that is probably because he tended to play down the lines with intrinsically small rms errors whereas some of the wilder more exciting players go for positions where Crafty practically grinds to a standstill at ply12 whilst it tries to figure out all the complications (and fails). But that is still enough to have some confidence in finding gross evaluation errors of 50cp or more (which is what Crafty at 12 ply does). Yes, but you still seem to be missing something. 100cp is a pawn, and you can understand that very directly. 50cp is what? It will matter if at some point we swap a 50cp advantage for a pawn-up with 50cp compensation, but until then it's an arbitrary measure. I was using that as an example. Looking more carefully at that game there were long sustained periods where Crafty was more than 100cp off the mark and about 10 moves where it was more than 200cp out (and in the middlegame). This doesn't bode well for its ability to score GM level play. [...] Further, it's interesting that the strongest and best WCs, by reasonably common consent, are those whose judgement differs least from that of Crafty. But they may well differ even less from the output of a stronger engine. Possibly. But if G&B's table 3 is showing anything at all objective about Crafty, it is that Crafty12 plays "rather like" all the WCs except perhaps Steinitz, and much more like Capablanca and Kramnik than like other WCs. If Crafty12 is so rotten, it's been amazingly lucky. I don't think it is that rotten. Just that it misses a lot of the rare but absolutely key GM moves and marks them down because it doesn't understand them. It probably gets the 95% of the routine moves exactly right, but it is the handful of other moves that make all the difference. After all, in the game you showed above, Crafty10 [assuming that's roughly what it was managing in 8s] deviates by around 35cp/move from Shredder by G&B rules. So either there's a *huge* improvement between C10 and C12 [and C12 would agree almost According to the metrics line Crafty was at average search depth 12.2 and Shredder at 13.1 (but in 1/3 the time) during this test match. exactly with Shredder] or else C12 is not only strong enough to assess WC play, but is actually closer than you might expect to emulating it. One other thign worrying about Crafty is that its accuracy does not improve with increased ply at anything like the rate of other modern engines. Its search seems to get stuck in the same local optimum ruts. A problem it shares with Fritz. I think it is worth trying to agree a test protocol that could be used to produce say 100 top level games consistently annotated by multiple engines. Then we might be able to get some half decent stats. Hunches really don't cut it. Absolutely. But I don't think the stats will mean what you seem to think they mean. Can we define a set of rules then and lets see if enough volunteers can be mustered to get an agreed set of games from a recent match analysed with multiple eniges. We need to set the cutoff for annotation - I normally use 8cp to avoid getting meaningless dross, but for this purpose and to make the test as close as possible to the G&B protocol I guess we try either zero or if that isn't allowed 1cp window. Lets see if we can get a dataset first... Regards, Martin Brown |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| rec.games.chess.misc FAQ [2/4] | pribut@yahoo.com | rec.games.chess.misc (Chess General) | 0 | February 19th 06 05:44 AM |
| Play chess online! Internet chess games. | nateg5@yahoo.com | rec.games.chess.misc (Chess General) | 0 | January 7th 06 01:24 AM |
| Play chess online! Internet chess games. | nateg5@yahoo.com | alt.chess (Alternative Chess Group) | 0 | January 7th 06 01:22 AM |
| Play chess online! Internet chess games. | nateg5@yahoo.com | alt.chess (Alternative Chess Group) | 0 | December 29th 05 07:04 PM |
| rec.games.chess.misc FAQ [2/4] | pribut@yahoo.com | rec.games.chess.misc (Chess General) | 0 | October 19th 05 05:37 AM |