/edit: I'm editing a summary of the systems proposed at the bottom of the post. - Obi
Note: I got the okay from Groudon80 to make this thread.
I've been thinking about making my own tournament, but because the tournaments applications are locked, I spent my time researching on tournament formats and rating systems, in hope that I would have a very fair and very informative tournament. I started on this journey thinking that the research would be simple, but it is not. Indeed, tournaments and ratings are in fact a difficult mathematical problem that some people base their PHD Thesis on attempting to make the systems more fair.
And believe me, when I discovered this, I did indeed wish to keep it simple. But reading up on a few of the non-technical papers, I am convinced that at least a bit of sound theory is necessary in order to conduct a fair and proper tournament.
Overview of Tournament Formats
If you want to adhere to the KISS principle, I suggest to use the KISS method of Single Elimination. However, there are issues with the Single Elimination Tourney that must be addressed. The first one is that it is easy for the "true best player" to get unlucky and lose one round. The best player would then be knocked out, maybe early, maybe later, but either way, the "true" champion is not really known.
For example, Snow Cloak or Sand Veil has a 20% chance of activating, which translates to a 4% chance of activating 2 turns in a row. However, in the first round alone of a 16 person Tournament, the chances of this happening somewhere in the tournament (or similar, like a key critical-hit hax) is 39%. I'm not only counting Snow Cloak hax here, this is also the same probability as Dynamic Punch hitting 4 or 5 times in a row. While hax do not happen often in a single game, when many people are playing many games, a "hax loss" is a statistical certainty somewhere along the bracket.
With that, comes the Double Elimination Tournament. You have 2 chances. If you get haxxed once, you get sent to the loser's bracket and then face the champion later on. And from this concept also stems the Triple Elimination Tournament, where you have 3 chances to get screwed. Needless to say, these systems are far more complicated than the Single Elimination Tournament, and they take up more time to do. However, they are clearly more fair as the best player has a better chance of actually winning, while lucky (but worse) players have a much better chance of losing. More on these later.
At the most extreme, we have Swiss Tournaments. These ensure that all players get to play during all rounds, and at the end the win/loss records are compared. The primary advantage is that all players get to play every round, so you get the most balance.
The Approachs to Ratings
The naive approach to ratings is the Win / Loss system. That is to say, after a tournament is done, you tally up the win/losses for every player, and then publish it. The most wins is the best, and the most losses is the worst. It is easy to understand, and easy to apply. But there are major issues. It is not uncommon to have many people with the same win/loss record. We can expect many people to have 4 wins, 4 losses for example, and the win/loss record doesn't help to determine who is better here.
Also, someone who has a 8-2 win/loss record might be better than someone who has a 9-1 win/loss record. The 9-1 guy just got "lucky" because he faced 9 easy opponents, and the 8-2 guy faced all hard opponents.
Chess players solved this problem a while ago. And their system has evolved to tackle new problems and challenges with the system (that I might add... the win/loss system doesn't work here either). The current systems today in Chess are
The Elo system (old but mature and battle tested for decades)
The Glicko system (Elo with modifications. Years of proven results)
The Glicko2 system (Glicko with modifications. State of the Art rating system)
The Elo system is explained in 49+ pages detail over here. You can understand it by reading the first 20ish pages. Here's an executive summary (and yes, I'm making up words to help explain it):
The Elo system assumes every player has a "average ability", and then the "played ability". The "Played ability" is what you did during that round, while the "average ability" is how good you are on the average. Professor Elo then actually turned this "average ability" into a number. So if your average ability is the same as someone else's average ability, you have a 50% chance of beating them. ("Beating them" in mathematical terms means that your "played ability" was better than their "played ability") If your average ability is 35 points higher than someone else's average ability, then you have a 55% chance of winning. So on and so forth.
Elo then provided a method to make your score closer to the true score based on how you performed in a tournament. The formula is listed in that paper.
This system takes into account that your opponent was good. Lets say you are a 1500 ranking, and you play a 1600 ranking and you win 45% of the time. While the opponent won more often than you, you will gain points while he loses points. This is so that you get a better estimation of your true ranking. 1600 should win 63% of the time against you, and you should win only 37% of the time. Therefore, you played better than expected, and he played worse than expected. Thus the points are adjusted as so.
The Glicko system goes ahead and realizes that, hey, I can only estimate someone's ranking. So instead of giving a solid number as an estimate, it gives a range. IE: a new player can have a ranking of 1200 to 1800, while a well seasoned player will have a ranking of 1800 to 1850. The more you play, the more precise the system gets with its estimation.
This way, If a 1550 to 1650 player faces a newbie, who has a rating of 1200 to 1800, then the new player will gain a many points when he wins, but the older player won't lose too many points. This is because the system is "unsure" of how good the new player is, so it won't penalize the old player's ratings that much, while it is somewhat sure how the older player is. (notice, his rating is between 1550 to 1650, while the newbie's range is 1200 to 1800)
The Glicko system is explained in detail here (requires a postscript reader). An example of the Glicko system in action is here.
------------
That is all the time I have for now. I'll post more of what I've learned later. Hopefully, we all can learn a little about tournament formats and make Smogon an even better place to competitivly battle pokemon. Yeah, I didn't get to discuss the research on Double Elimination Tournies (Double-Elimination Tournaments: Counting and Calculating by Christopher T. Edwards) or about Glicko2, among other things... perhaps I'll have time later.
/edit: Summary!
Elo: Everyone gets a rating. If you win against someone, you get points and they lose points. If their rating is higher than yours, a lot of points are exchanged. If you win against someone with a lower rating than yours, few points are exchanged.
Glicko: Mostly the same as Elo, except instead of having a set rating, you have a range of ratings. It basically says, "You are rated somewhere between these scores", which helps account for newer players having less fixed ratings. The more you battle, the narrower this range becomes.
Glicko2: Mostly the same as Glicko, except it adds another factor (the "volatility factor"), which is a measure of how consistent you are. If you win 50% of your matches, you are more consistent if you win every other match of 100 battles than if you win 50 matches, then lose 50 matches. - Obi
Note: I got the okay from Groudon80 to make this thread.
I've been thinking about making my own tournament, but because the tournaments applications are locked, I spent my time researching on tournament formats and rating systems, in hope that I would have a very fair and very informative tournament. I started on this journey thinking that the research would be simple, but it is not. Indeed, tournaments and ratings are in fact a difficult mathematical problem that some people base their PHD Thesis on attempting to make the systems more fair.
And believe me, when I discovered this, I did indeed wish to keep it simple. But reading up on a few of the non-technical papers, I am convinced that at least a bit of sound theory is necessary in order to conduct a fair and proper tournament.
Overview of Tournament Formats
The Naive approaches work, but it can be improved upon. Indeed, they are the most simple to understand and are the tournament formats that we would use. The naive tournament format is the Single Elimination Tournament, which is the vast majority of what this forum has to offer. Fortunately, research does indeed show that the naive "power of two" Single Elimination Tourney is actually optimal. There are a few issues, but I'll discuss that later.Naive: 3.having or marked by a simple, unaffectedly direct style reflecting little or no formal training or technique.
If you want to adhere to the KISS principle, I suggest to use the KISS method of Single Elimination. However, there are issues with the Single Elimination Tourney that must be addressed. The first one is that it is easy for the "true best player" to get unlucky and lose one round. The best player would then be knocked out, maybe early, maybe later, but either way, the "true" champion is not really known.
For example, Snow Cloak or Sand Veil has a 20% chance of activating, which translates to a 4% chance of activating 2 turns in a row. However, in the first round alone of a 16 person Tournament, the chances of this happening somewhere in the tournament (or similar, like a key critical-hit hax) is 39%. I'm not only counting Snow Cloak hax here, this is also the same probability as Dynamic Punch hitting 4 or 5 times in a row. While hax do not happen often in a single game, when many people are playing many games, a "hax loss" is a statistical certainty somewhere along the bracket.
With that, comes the Double Elimination Tournament. You have 2 chances. If you get haxxed once, you get sent to the loser's bracket and then face the champion later on. And from this concept also stems the Triple Elimination Tournament, where you have 3 chances to get screwed. Needless to say, these systems are far more complicated than the Single Elimination Tournament, and they take up more time to do. However, they are clearly more fair as the best player has a better chance of actually winning, while lucky (but worse) players have a much better chance of losing. More on these later.
At the most extreme, we have Swiss Tournaments. These ensure that all players get to play during all rounds, and at the end the win/loss records are compared. The primary advantage is that all players get to play every round, so you get the most balance.
The Approachs to Ratings
The naive approach to ratings is the Win / Loss system. That is to say, after a tournament is done, you tally up the win/losses for every player, and then publish it. The most wins is the best, and the most losses is the worst. It is easy to understand, and easy to apply. But there are major issues. It is not uncommon to have many people with the same win/loss record. We can expect many people to have 4 wins, 4 losses for example, and the win/loss record doesn't help to determine who is better here.
Also, someone who has a 8-2 win/loss record might be better than someone who has a 9-1 win/loss record. The 9-1 guy just got "lucky" because he faced 9 easy opponents, and the 8-2 guy faced all hard opponents.
Chess players solved this problem a while ago. And their system has evolved to tackle new problems and challenges with the system (that I might add... the win/loss system doesn't work here either). The current systems today in Chess are
The Elo system (old but mature and battle tested for decades)
The Glicko system (Elo with modifications. Years of proven results)
The Glicko2 system (Glicko with modifications. State of the Art rating system)
The Elo system is explained in 49+ pages detail over here. You can understand it by reading the first 20ish pages. Here's an executive summary (and yes, I'm making up words to help explain it):
The Elo system assumes every player has a "average ability", and then the "played ability". The "Played ability" is what you did during that round, while the "average ability" is how good you are on the average. Professor Elo then actually turned this "average ability" into a number. So if your average ability is the same as someone else's average ability, you have a 50% chance of beating them. ("Beating them" in mathematical terms means that your "played ability" was better than their "played ability") If your average ability is 35 points higher than someone else's average ability, then you have a 55% chance of winning. So on and so forth.
Elo then provided a method to make your score closer to the true score based on how you performed in a tournament. The formula is listed in that paper.
This system takes into account that your opponent was good. Lets say you are a 1500 ranking, and you play a 1600 ranking and you win 45% of the time. While the opponent won more often than you, you will gain points while he loses points. This is so that you get a better estimation of your true ranking. 1600 should win 63% of the time against you, and you should win only 37% of the time. Therefore, you played better than expected, and he played worse than expected. Thus the points are adjusted as so.
The Glicko system goes ahead and realizes that, hey, I can only estimate someone's ranking. So instead of giving a solid number as an estimate, it gives a range. IE: a new player can have a ranking of 1200 to 1800, while a well seasoned player will have a ranking of 1800 to 1850. The more you play, the more precise the system gets with its estimation.
This way, If a 1550 to 1650 player faces a newbie, who has a rating of 1200 to 1800, then the new player will gain a many points when he wins, but the older player won't lose too many points. This is because the system is "unsure" of how good the new player is, so it won't penalize the old player's ratings that much, while it is somewhat sure how the older player is. (notice, his rating is between 1550 to 1650, while the newbie's range is 1200 to 1800)
The Glicko system is explained in detail here (requires a postscript reader). An example of the Glicko system in action is here.
------------
That is all the time I have for now. I'll post more of what I've learned later. Hopefully, we all can learn a little about tournament formats and make Smogon an even better place to competitivly battle pokemon. Yeah, I didn't get to discuss the research on Double Elimination Tournies (Double-Elimination Tournaments: Counting and Calculating by Christopher T. Edwards) or about Glicko2, among other things... perhaps I'll have time later.
/edit: Summary!
Elo: Everyone gets a rating. If you win against someone, you get points and they lose points. If their rating is higher than yours, a lot of points are exchanged. If you win against someone with a lower rating than yours, few points are exchanged.
Glicko: Mostly the same as Elo, except instead of having a set rating, you have a range of ratings. It basically says, "You are rated somewhere between these scores", which helps account for newer players having less fixed ratings. The more you battle, the narrower this range becomes.
Glicko2: Mostly the same as Glicko, except it adds another factor (the "volatility factor"), which is a measure of how consistent you are. If you win 50% of your matches, you are more consistent if you win every other match of 100 battles than if you win 50 matches, then lose 50 matches. - Obi