![]() |
| If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|||||||
| Tags: capa, chess, cuz, greatest, karpov, kasparov, kramnik, lie, order, players, puters |
|
|
Thread Tools | Display Modes |
|
#191
|
|||
|
|||
|
Dr A. N. Walker wrote:
I don't *expect* anyone to do that. But anyone who might want to check my results [or, in the present case, *G&B's* results] *can* so modify their source code, and can thereby get exactly the same results, even on different hardware [...] But `reproducible' doesn't mean `anyone can reproduce exactly the same data.' It means that anyone should be able to repeat the experiment and produce consistent results. For example, the experiment of rolling a die 100 times and calculating the mean score is reproducible, even though two runs of the experiment are staggeringly unlikely to produce the same raw data and very unlikely to produce exactly the same mean. A better method would be to take the strongest program available (i.e. Rybka) and have it score all the games, and only THEN apply some selected algorithm to the standardized results. And if I did the scoring again tomorrow, with a different "seed" to the RNG, or with different contents in the cache/hash, or perhaps after adding a new disc or RAM to my computer, let alone if you tried to repeat my experiment, would you expect to see the same results? I should hope so! At least, the results should be very close. If you expect to see dramatically different results from anything other than increasing the size of the hash table (and, if necessary, the amount of RAM) or the speed of the system, then the test is so totally dependent on the initial conditions as to be worthless. A brief search didn't reveal what the actual hardware for the recent Kramnik-DeepFritz match was, but from the node-counts quoted it seems to have been roughly equivalent to 6x 2.5GHz PCs. I don't recall the exact details but it was off-the-shelf hardware. Expensive, top-of-the-range hardware but still off-the-shelf. Indeed. My take is that it is silly to think a computer (at this point in time) can accurately determine who among the world champions was "the greatest". Then perhaps it's lucky that G&B weren't trying to determine that? Then perhaps they shouldn't have written a summary claiming that this is what they were trying to determine. ``Who is the best chess player of all time? [...] we were interested in the chess players' quality of play regardless of the game score, which we evaluated with the help of computer analyses of individual moves made by each player. [...] We also give a carefully chosen methodology for using computer chess programs for evaluating the true strength of chess players.'' At least, I assume that the `greatest' World Champion is the same thing as the `strongest'. Dave. -- David Richerby Poetic Accelerated Wine (TM): it's www.chiark.greenend.org.uk/~davidr/ like a vintage Beaujolais but it's twice as fast and in verse! |
| Ads |
|
#192
|
|||
|
|||
|
In article ,
David Richerby wrote: [free, as you're in the UK] from your nearest decent library. Excepting that I don't live in the UK. Oh. Well, clearly we have to take your word for it, but don't you think it's a little misleading not only to have a UK address but also to describe yourself* as a PhD student at the University of Cambridge Computer Laboratory? We just assumed you were a little slow getting it .... OTOH, no-one with an interest in Tom Lehrer can be all bad. * "http://www.chiark.greenend.org.uk/~davidr/me.html" -- Andy Walker, School of MathSci., Univ. of Nott'm, UK. |
|
#193
|
|||
|
|||
|
Dr A. N. Walker wrote:
help bot wrote: [...] driving a F1 racing car. How tough can it be -- you just stay on the track and keep turning left (again and again and again). I would stay near the grass, so the other drivers could pass (again and again and again). That may or may not be a workable strategy at Indianapolis, but I strongly advise you not to try it in a F1 racing car. Esp not at Monaco, where the grass is in short supply [but the shops and the sea are quite near]. And right turns are in the majority... Dave. -- David Richerby Old-Fashioned Tool (TM): it's like a www.chiark.greenend.org.uk/~davidr/ handy household tool but it's perfect for your grandparents! |
|
#194
|
|||
|
|||
|
In article ,
David Richerby wrote: I don't *expect* anyone to do that. But anyone who might want to check my results [or, in the present case, *G&B's* results] *can* so modify their source code, and can thereby get exactly the same results, even on different hardware [...] But `reproducible' doesn't mean `anyone can reproduce exactly the same data.' It means that anyone should be able to repeat the experiment and produce consistent results. For example, the experiment of rolling a die 100 times and calculating the mean score is reproducible, [...], Right. But we have quite a good mathematical model of the statistics of rolling dice; there is no such model for values of chess positions. G&B's results are [or should be] *exactly* reproducible, which doesn't mean that you *have* [or in the PP's words that I *expect* you to have] done so, but if you have any doubts about their results, you can. You could then use that as a basis for all sorts of other experiments, such as using other ply depths and other engines, or other conditions [such as seeing what happens after (say) 30 seconds per position]. Would that make a difference? I don't know, but if you have 36 computers and a spare month available, feel free. A better method would be to take the strongest program available (i.e. Rybka) and have it score all the games, and only THEN apply some selected algorithm to the standardized results. And if I did the scoring again tomorrow, with a different "seed" to the RNG, or with different contents in the cache/hash, or perhaps after adding a new disc or RAM to my computer, let alone if you tried to repeat my experiment, would you expect to see the same results? I should hope so! At least, the results should be very close. Perhaps. And perhaps someone can run the experiment, if they have 36 computers and *two* months going spare. But one of the things we have learned from computer chess is that intuition is a very poor substitute for doing the experiment. If you expect to see dramatically different results from anything other than increasing the size of the hash table (and, if necessary, the amount of RAM) or the speed of the system, So in other words you would be happy to see different results if we ran the experiment again next year, with twice as much RAM, a bigger cache, more disc, and the next version of the OS, even if it's the same version of Rybka? Or do you just not see the value of an experiment that *could* be re-run on *any* computer, from a 20-yo PDP-11 [though it would take a long time ...] to whatever super- duper system we might have 20 years hence [when it might run in a few seconds rather than a CPU-year] and give identical results as the basis for further experiments? [FWIW, I was recently running some 30-yo programs, partly out of mere interest but also partly as a benchmark, and it was gratifying, and a confidence boost in the compiler/OS, when results were identical, and disturbing when some of the floating-point numbers in one test were systematically different for reasons we have not yet managed to track down.] then the test is so totally dependent on the initial conditions as to be worthless. Very possibly it is. Perhaps it isn't. We won't know unless someone runs the experiment. My own gut feeling is that G&B's basic results are probably pretty stable, and would be reproduced by any decent engine at any decent ply level, though I would not be at all surprised if a few of the 70-80 centipawn "blunders" turned out well at greater depth and a few non-blunders turned out to be dubious. Swings, roundabouts. BICBW. Indeed. My take is that it is silly to think a computer (at this point in time) can accurately determine who among the world champions was "the greatest". Then perhaps it's lucky that G&B weren't trying to determine that? Then perhaps they shouldn't have written a summary claiming that this is what they were trying to determine. ``Who is the best chess player of all time? [...] Firstly, "greatest" and "best" are far from synonymous, even if you [as in the snippage] slip in a "strongest". There are factors such as longevity, superiority over contemporaries, innovation, triumph over adversity, competitiveness, ... to consider as well as pure analytical chess-playing skill. And if you're going to quote their initial question, you might perhaps have quoted also their following sentence: "Chess players are often interested in this question to which there is no well founded, objective answer, ...". OK, perhaps the implication is that they should have stopped there and then. But if historical Elo ratings are of interest, then I see no reason why another objective measure of *something* need not be. They *do* have an objective measure. It *does* seem that their results correlate well with *some* quality that we can recognise in the play of Capablanca, Petrosian, Tal, etc. Their methodology is at least interesting, even if flawed. The problem comes if we all start to take it too seriously. It's a semi-amateur investigation that resulted in a conference paper and a somewhat light-hearted summary at a commercial chess site. I have no problem with that. -- Andy Walker, School of MathSci., Univ. of Nott'm, UK. |
|
#195
|
|||
|
|||
|
Dr A. N. Walker wrote:
David Richerby wrote: [free, as you're in the UK] from your nearest decent library. Excepting that I don't live in the UK. Oh. Well, clearly we have to take your word for it, but don't you think it's a little misleading not only to have a UK address but also to describe yourself* as a PhD student at the University of Cambridge Computer Laboratory? We just assumed you were a little slow getting it .... Hehe. Actually, I'm just a little slow at updating that page. It was true when I wrote it... *cough* Dave. -- David Richerby Surprise Old-Fashioned Watch (TM): www.chiark.greenend.org.uk/~davidr/ it's like a precision chronometer but it's perfect for your grandparents and not like you'd expect! |
|
#196
|
|||
|
|||
|
On May 9, 7:26 pm, (Dr A. N. Walker) wrote:
In article , David Richerby wrote: I don't *expect* anyone to do that. But anyone who might want to check my results [or, in the present case, *G&B's* results] *can* so modify their source code, and can thereby get exactly the same results, even on different hardware [...] But `reproducible' doesn't mean `anyone can reproduce exactly the same data.' It means that anyone should be able to repeat the experiment and produce consistent results. For example, the experiment of rolling a die 100 times and calculating the mean score is reproducible, [...], Right. But we have quite a good mathematical model of the statistics of rolling dice; there is no such model for values of chess positions. G&B's results are [or should be] *exactly* reproducible, which doesn't mean that you *have* [or in the PP's words that I *expect* you to have] done so, but if you have any doubts about their results, you can. You could then use that as a basis for all sorts of other experiments, such as using other ply depths and other engines, or other conditions [such as seeing what happens after (say) 30 seconds per position]. Would that make a difference? I don't know, but if you have 36 computers and a spare month available, feel free. OK. But without doing that for the moment. What settings do you use to analyse annotate your own games? I would be prepared to bet it is nothing like as shallow as 12 ply fixed + quiessence. I find even 14 ply minimum sometimes misses things that are intuitive to a human player going over the blundercheck annotated game. And that is in mere club level games annotated with a much stronger engine than Crafty. I agree with your original analysis that 12 fixed play was the least bad compromise for the data, time and kit available. A better method would be to take the strongest program available (i.e. Rybka) and have it score all the games, and only THEN apply some selected algorithm to the standardized results. And if I did the scoring again tomorrow, with a different "seed" to the RNG, or with different contents in the cache/hash, or perhaps after adding a new disc or RAM to my computer, let alone if you tried to repeat my experiment, would you expect to see the same results? Well if it has to be open source then Fruit 2.1 (~2780) might be another alternative to try against Crafty (~2670). An extra 100 points and a bit less materiallistic evaluation would be closer to human GM level play. Fruit 2.2.1 just about stumbled onto that tricky line that Phil Innes sets so much stall by engines not finding before I pulled the plug. I should hope so! At least, the results should be very close. Perhaps. And perhaps someone can run the experiment, if they have 36 computers and *two* months going spare. But one of the things we have learned from computer chess is that intuition is a very poor substitute for doing the experiment. Agreed. But equally when the experiment has a systematic error due to using a relatively shallow fixed depth (but reproducible) searching to score the moves played it doesn't take much intuition to conclude that an engine that cannot annotate club level games accurately at that level is completely out of its depth on superGMs. It will penalise GMs that have formed plans extending beyond 12 ply if there is no obvious gain made inside its quiessence horizon. And it hardly ever sees material sacrifices for gains in positional advantage or tempo. expect to see dramatically different results from anything other than increasing the size of the hash table (and, if necessary, the amount of RAM) or the speed of the system, So in other words you would be happy to see different results if we ran the experiment again next year, with twice as much RAM, a bigger cache, more disc, and the next version of the OS, even if it's the same version of Rybka? Or do you just not see the value of an experiment that *could* be re-run on *any* computer, from a 20-yo PDP-11 [though it would take a long time ...] to whatever super- duper system we might have 20 years hence [when it might run in a few seconds rather than a CPU-year] and give identical results as the basis for further experiments? Exact reproducibility probably isn't so important here. Getting the maximum accuracy of the move evaluation function for the limited amount of time available is the key. Fixed depth does not do that. [FWIW, I was recently running some 30-yo programs, partly out of mere interest but also partly as a benchmark, and it was gratifying, and a confidence boost in the compiler/OS, when results were identical, and disturbing when some of the floating-point numbers in one test were systematically different for reasons we have not yet managed to track down.] Usually something to do with rounding conventions or bugs in old trig function implementations. then the test is so totally dependent on the initial conditions as to be worthless. Very possibly it is. Perhaps it isn't. We won't know unless someone runs the experiment. My own gut feeling is that G&B's basic results are probably pretty stable, and would be reproduced by any decent engine at any decent ply level, though I would not be at all surprised if a few of the 70-80 centipawn "blunders" turned out well at greater depth and a few non-blunders turned out to be dubious. Swings, roundabouts. BICBW. I don't think they are swings and roundabouts though. GM level games are littered with precisely the sort of positions that chess engines find really difficult to score accurately. And they usually occur at pivotal moments. Indeed. My take is that it is silly to think a computer (at this point in time) can accurately determine who among the world champions was "the greatest". Then perhaps it's lucky that G&B weren't trying to determine that? Then perhaps they shouldn't have written a summary claiming that this is what they were trying to determine. ``Who is the best chess player of all time? [...] Firstly, "greatest" and "best" are far from synonymous, even if you [as in the snippage] slip in a "strongest". There are factors such as longevity, superiority over contemporaries, innovation, triumph over adversity, competitiveness, ... to consider as well as pure analytical chess-playing skill. And if you're going to quote their initial question, you might perhaps have quoted also their following sentence: "Chess players are often interested in this question to which there is no well founded, objective answer, ...". OK, perhaps the implication is that they should have stopped there and then. But if historical Elo ratings are of interest, then I see no reason why another objective measure of *something* need not be. They *do* have an objective measure. It *does* seem that their results correlate well with *some* quality that we can recognise in the play of Capablanca, Petrosian, Tal, etc. Their methodology is at least interesting, even if flawed. Agreed. The experiment is worth repeating with a much stronger engine. The problem comes if we all start to take it too seriously. It's a semi-amateur investigation that resulted in a conference paper and a somewhat light-hearted summary at a commercial chess site. I have no problem with that. Nor do I. I think it mostly has found the players with the lowest blunder rate fairly convincingly. Regards, Martin Brown |
|
#197
|
|||
|
|||
|
On May 10, 1:22 am, Martin Brown
wrote: Well if it has to be open source then Fruit 2.1 (~2780) might be another alternative to try against Crafty (~2670). An extra 100 points and a bit less materiallistic evaluation would be closer to human GM level play. Fruit 2.2.1 just about stumbled onto that tricky line that Phil Innes sets so much stall by engines not finding before I pulled the plug. Lots of commercial chess programs will have a database of "tricky" positions with "model" answers, just to fool people into rating them higher. Tricks of the trade. So in other words you would be happy to see different results if we ran the experiment again next year, with twice as much RAM, a bigger cache, more disc, and the next version of the OS, even if it's the same version of Rybka? Or do you just not see the value of an experiment that *could* be re-run on *any* computer, from a 20-yo PDP-11 [though it would take a long time ...] to whatever super- duper system we might have 20 years hence [when it might run in a few seconds rather than a CPU-year] and give identical results as the basis for further experiments? Exact reproducibility probably isn't so important here. Getting the maximum accuracy of the move evaluation function for the limited amount of time available is the key. Fixed depth does not do that. I disagree. Normalization, see one of my posts in this 200 post thread. Very possibly it is. Perhaps it isn't. We won't know unless someone runs the experiment. My own gut feeling is that G&B's basic results are probably pretty stable, and would be reproduced by any decent engine at any decent ply level, though I would not be at all surprised if a few of the 70-80 centipawn "blunders" turned out well at greater depth and a few non-blunders turned out to be dubious. Swings, roundabouts. BICBW. I don't think they are swings and roundabouts though. GM level games are littered with precisely the sort of positions that chess engines find really difficult to score accurately. And they usually occur at pivotal moments. A pivotal moment is immaterial if you use normalization. As I explained in a post in this thread, the fact that a player enters a 60 move mating net set by his opponent, unseen by Crafty with a 14 ply move horizon, is immaterial since at some point Crafty will see the mating net (namely, 7 moves before checkmate) and rate the losing player lower than the winning player. OK, perhaps the implication is that they should have stopped there and then. But if historical Elo ratings are of interest, then I see no reason why another objective measure of *something* need not be. They *do* have an objective measure. It *does* seem that their results correlate well with *some* quality that we can recognise in the play of Capablanca, Petrosian, Tal, etc. Their methodology is at least interesting, even if flawed. Agreed. The experiment is worth repeating with a much stronger engine. Yes, agreed. As I posted 47.5 posts ago, for very close, nearly tied rankings, the stronger chess program might make a difference. But for clear demarcation breakpoints, such as between Capa and Kramnik versus Karpov and Kasparov, a stronger chess engine doesn't matter. The problem comes if we all start to take it too seriously. It's a semi-amateur investigation that resulted in a conference paper and a somewhat light-hearted summary at a commercial chess site. I have no problem with that. Nor do I. I think it mostly has found the players with the lowest blunder rate fairly convincingly. Regards, Martin Brown And, chess being 99% tactics (say many GMs, including Tarrach or Teichman), the player with the lowest blunder rate is often the best champion. Blunders = Function(overall strength). In fact, a study from a few years ago found that the difference in most moves between a patzer and a GM was not so much in the unexpected move made, but rather in the fact GMs blundered far less than a Class C player. RL |
|
#198
|
|||
|
|||
|
"Martin Brown" wrote in message oups.com... Agreed. But equally when the experiment has a systematic error due to using a relatively shallow fixed depth (but reproducible) searching to score the moves played it doesn't take much intuition to conclude that an engine that cannot annotate club level games accurately at that level is completely out of its depth on superGMs. I'd wager that this method would give generally meaningful results for club players, *despite* the fact that it will inaccurately analyze certain positions. It's a pity that the authors did not apply the method to the games of players with different ELO. That's an easy and obvious extension that would have gone a long way to validating the worth of the method. The argument that the method is refuted by finding one position that the computer analyzes incorrectly is false. There are analogous issues in ELO rating: which games should be rated, and what is the significance of each game. |
|
#199
|
|||
|
|||
|
In article .com,
raylopez99 wrote: Yes, agreed. As I posted 47.5 posts ago, for very close, nearly tied rankings, the stronger chess program might make a difference. But for clear demarcation breakpoints, such as between Capa and Kramnik versus Karpov and Kasparov, a stronger chess engine doesn't matter. It appears, RayLopez, that you missed an earlier post of mine which had two questions related to this very point. Since I'm sure it was an innocent omission - it's easy to miss a single post in a long thread, I'll repeat the questions here. 1) Would you feel equally confident if we only gave crafty 11 ply? 10? 8? 4? Where do you draw the line? What non-arbitrary criteria are you using to suggest that 12-ply is meaningful whereas 3 ply, obviously, would not be? 2) What objective criteria are you using to define "extremely close" such that you don't trust the computer's ability to rank players properly? I'm very curious to hear your answers to these questions. -Ron |
|
#200
|
|||
|
|||
|
On May 10, 8:03 pm, Ron wrote:
In article .com, raylopez99 wrote: Yes, agreed. As I posted 47.5 posts ago, for very close, nearly tied rankings, the stronger chess program might make a difference. But for clear demarcation breakpoints, such as between Capa and Kramnik versus Karpov and Kasparov, a stronger chess engine doesn't matter. It appears, RayLopez, that you missed an earlier post of mine which had two questions related to this very point. Since I'm sure it was an innocent omission - it's easy to miss a single post in a long thread, I'll repeat the questions here. 1) Would you feel equally confident if we only gave crafty 11 ply? 10? 8? 4? Where do you draw the line? What non-arbitrary criteria are you using to suggest that 12-ply is meaningful whereas 3 ply, obviously, would not be? 2) What objective criteria are you using to define "extremely close" such that you don't trust the computer's ability to rank players properly? I'm very curious to hear your answers to these questions. -Ron I expect that his fixation on "normalization" has become so severe that RL would actually argue that 11 plys are good enough, that 10 plys are more than adequate, that 9 plys still work well, and that even 7 or 8 plys rank the players reasonably accurately. In his own infamous words, "a stronger chess engine doesn't matter". -- help bot |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| rec.games.chess.misc FAQ [2/4] | pribut@yahoo.com | rec.games.chess.misc (Chess General) | 0 | February 19th 06 05:44 AM |
| Play chess online! Internet chess games. | nateg5@yahoo.com | rec.games.chess.misc (Chess General) | 0 | January 7th 06 01:24 AM |
| Play chess online! Internet chess games. | nateg5@yahoo.com | alt.chess (Alternative Chess Group) | 0 | January 7th 06 01:22 AM |
| Play chess online! Internet chess games. | nateg5@yahoo.com | alt.chess (Alternative Chess Group) | 0 | December 29th 05 07:04 PM |
| rec.games.chess.misc FAQ [2/4] | pribut@yahoo.com | rec.games.chess.misc (Chess General) | 0 | October 19th 05 05:37 AM |