| Source Data
|Most historical game collection CDs place justifiable emphasis on those games where the moves are actually known. For the purpose of rating calculations, the moves are irrelevant; all I care about is who won or lost. In fact, it is enough to just know the overall head-to-head results from each event, so sometimes we just know a final match score and that's fine. Thus the widely-available game collection CDs have not been as exhaustive as I would have liked and so I had to cast my net wider, especially for 19th century results.
|I spent a lot of work on improving the quality of the 19th century data in my database. In this effort Anders Thulin was extremely helpful. He sent me an electronic version of some data from P Fenstra Kuiper's Hundert Jahre Schachzweikaempfe, 1967, the standard reference for match data, as well as Bachmann's 1922 book on the Teplitz-Schonau tournament which was an earlier attempt at the same reference. For tournament crosstables, of course, the incomparable four-volume set by Jeremy Gaige, covering 1851 through 1930, has been invaluable. Anders also composed an extremely useful index of those Gaige books. I have merged data from Gaige's Volume I (1851 through 1900) along with many other sources, to the point that I believe I have one of the most complete and exhaustive datasets in existence that cover match and tournament results from the 19th century. The other source that was absolutely priceless in my efforts was Jerry Spinrad's web page containing chess results from 1835 through 1863, a period that is otherwise very difficult to obtain data for. I also used Levy and O'Connell's Oxford Encyclopedia of Chess Games, Volume 1 to nail down the earliest data, enabling me to have ratings going into 1843 with its famous Staunton vs Saint Amant matches. I also used La grande storia degli scacchi and various other online sources. Bill Wall's chess timeline gave me the months for several events, which are extremely useful to my rating calculations.
|The 20th century data is a huge ongoing project and I don't know if it will ever end. As I said, with the combination of the Kuiper and Gaige data, I think I have pretty good coverage on paper through 1930, but getting it into the computer is another issue. Right now, my data from 1901 through 2002 is taken from the ChessBase BigBase 2003 (you can find the latest version of the ChessBase CDs here). I know that the Kuiper and Gaige data is more complete than the BigBase data so it would clearly be an improvement to get that data in. I already have used the Anders Thulin index to make a create electronic record of who played in each Gaige-reported tournament from 1901 to 1930, but actually entering the crosstable information is a daunting task even for me. Certainly, as you go later and later in time, it just gets impossible for one person (with a day job!) to go through everything manually, and so it will just be based on the ChessBase game collection for now.
|The 21st century data is yet another issue. Chessbase comes out with annual updates to the BigBase/MegaBase line, but annual is a little too long to wait, so I use the monthly Chess Base Magazine. That provides the great advantage that there is no need to map names from one database to another, since they are both based on the ChessBase player names. The other option is to use the weekly game data generously provided by Mark Crowther's TWIC archives. That was my only option, back when I was writing lots of pre-tournament prediction articles, because there can be a lag of more than a month before games make it into ChessBase Magazine, presumably because they want to annotate the games thoroughly. That's why it's March and you're seeing ratings only through the start of January.
|There is a further challenge of great importance, which is how to exclude inappropriate games. Offhand games, odds games, rapid games, blitz games, simultaneous events, exhibition events, correspondence events, thematic events... all of them have gotta go. This is particularly painful these days when many events are knockouts with rapid tiebreakers and so you actually have events that are partially valid and partially invalid. I have read about something called "Big Reference" which supposedly takes care of that for you, but I've never looked into it. Of course, there is also the issue of how to handle this on an ongoing (monthly/weekly) basis, so it's even more challenging. Even with the existing data, I know there are still lots of problems. When I could go through events manually (such as the 19th century data) it was possible to catch most of the undesired events. But even now as I casually browse through my site, I can see a Rice Gambit Lasker-Chigorin match, a SUI-Kasparov match, Karpov participating in a thematic Sicilian tournament... I know there's more that needs to be done! In any event, please email me if you have any suggestions or if you would like to help. If people are indeed interested in helping, I will be organizing projects to improve the source data even more:
|Read about future projects