Elo, Glicko, and Truskill rating systems are most probably wrong

luckis11 · Feb 13, 2021

A player has (wins+0.5draws)/(wins+draws+losses)=(200+0.5*0)/(200+0+0) and another has (1200+0.5*0)/(1200+0+1000). This rating shows correctly that the first player is much stronger than the second. Whereas the Elo, Glicko and Truskill rates (roughly about) the first with 800+200*8-0*8=2400 and the second with 800+1200*8-1000*8=2400, if the points with which they started is 800. In essence what Elo etc count is wins-losses, whereas (w+0.5d)/(w+d+l)<=>(w-l)/(w+d+l). They fail to show that the first is much stronger than the second, they say that they are of the same strength! It is WRONG and it needs to be replaced. Correct me if I am wrong, e.g. that these are not the points Glicko gives to these 2 players (then how many points does it give), but am I wrong enough that the point I mentioned does not apply? These, if they had faced opponents from all ranges of ratings (selected randomly), and not opponents of almost of the same rating, and the latter is what happens. They should face opponents from all ranges of ratings and not opponents of almost the same rating, because the latter makes it (at least for me) impossible to judge the system.

Dale · Feb 13, 2021

Elo etc are substantially more complicated than that.

luckis11 said:

In essence what Elo etc count is wins-losses

This is not an accurate statement. The key point is that the Elo etc consider the skill of the opponent. Based only on win and loss record you cannot say what the Elo rating is.

The other thing to consider is convergence. It is important that the ranking converge quickly to a useful rating.

Infrared · Feb 13, 2021

Just to expand on @Dale's accurate answer:

luckis11 said:

the Elo, Glicko and Truskill rates (roughly about) the first with 800+200*8-0*8=2400 and the second with 800+1200*8-1000*8=2400

This is not correct. Just knowing the number of wins/draws/losses is not enough to tell you the Elo rating. For example, beating an opponent rated 2500 will help your Elo much more than beating an opponent rated 1500. Also, since Elo ratings are designed to converge pretty quickly to your actual strength, your recent results will affect your current rating more, so the order in which you got your results matters too. And the same for Glicko (though I confess I haven't heard of Truskill).

Also, your first player has a perfect win rate, so you can't really estimate their strength well: all you can say is that they're probably much stronger than the opponents they played. I think Elo will give them a provisional rating of 400 points higher than the highest rated player they beat, but I could be misremembering this. Anyway, this problem doesn't really come up in practice since as you win games and gain rating, you'll be matched against stronger opponents until you have enough losses/draws to give a good estimate.

Edit: "ELO"->"Elo"

Dale · Feb 13, 2021

Infrared said:

I confess I haven't heard of Truskill

It has been a while since I looked into this, but I believe that trueskill is just the Microsoft implementation of Glicko for the x box. It may have some modifications to handle team games.

luckis11 · Feb 14, 2021

I am playing at chess.com and I get 8 points for a win and -8 points for a loss against a player with a rating almost the same as mine, +9 points for a win and -7 points for a loss against a player with a rating a bit higher than mine, +10 points for a win and -6 points for a loss against a player with a rating quite higher than mine, +7 points for a win and -9 points for a loss against a player with a rating a bit lower than mine, and so on. So, on average it is +8 points for a win and -8 points for a loss. Thus the Glicko rating is wins-losses. Whereas the correct rating is (wins-losses)/(wins+draws+losses) which is equivalent to (wins+0.5draws)/(wins+draws+losses).

Dale · Feb 14, 2021

luckis11 said:

So, on average it is +8 points for a win and -8 points for a loss.

That is only true on average if you are competing and both winning and losing with a variety of opponents who have a variety of skill levels both above and below yours. In the case in the OP the first player was only competing against weaker opponents and never losing, and so the algorithm will not produce the simple average you suggest.

Again, your description of the Elo algorithm is oversimplified to the point that it doesn’t apply to the situation you described.

luckis11 · Feb 14, 2021

Your first sentence is correct. Your second sentence is wrong. Perhaps what you want to say is since in chess you face opponents of almost the same Glicko or etc rating as yours and not from all ranges of Glicko or etc ratings, my point does not apply and the investigation becomes complex. But what I am saying is that the surely correct system of (w-l)/(w+d+l) (or the equivalent (w+0.5d)/(w+d+l) which I prefer) and confronting players from all ranges of strengths should be applied instead of investigating whether confronting only players of the same Glicko or etc rating as yours corrects (by coincedence or not) things.

Dale · Feb 14, 2021

luckis11 said:

what I am saying is that the surely correct system of (w-l)/(w+d+l) (or the equivalent (w+0.5d)/(w+d+l) which I prefer) and confronting players from all ranges of strengths should be applied instead of investigating whether confronting only players of the same Glicko or etc rating as yours corrects (by coincedence or not) things.

But your example in your OP does not make that point. Your first player played only against weaker opponents. So it is not a relevant counterexample. Furthermore, when you do analyze a scenario like your first player the Elo algorithm does not at all behave as you stated.

You are proposing an edge case that is not relevant to what you now claim and in that edge case you are incorrectly asserting what the Elo rating would be by oversimplifying the algorithm.

I have coded the Glicko system in Mathematica several years back (for running pinewood derby tournaments). I can dig it up and run some simulations if you want. But I can tell you unambiguously that your statements about how it would behave in this edge case are flat out wrong.

luckis11 · Feb 14, 2021

My OP meant if opponents were selected randomly thus from all ranges of strengths, and not placing opponents of the same strength. Therefore my first player did not played only against weaker players.

Dale · Feb 14, 2021

luckis11 said:

Therefore my first player did not played only against weaker players.

He won all of his matches therefore he only played against weaker players. Particularly if the game is mostly skill rather than luck

Edit: I guess by “weaker” you may mean “weaker than average”. I meant “weaker than him/her”

Infrared · Feb 14, 2021

Your claim in your OP that those two players will have the same rating is not correct. If they both played the same level of opposition and the results of the second player were roughly evenly distributed (i.e. not something like 1000 losses followed by 1200 wins), then the first player would have a rating much higher rating than the second.

For example, suppose both players start at 1500 and they play opponents rated around 1500 on average (some higher, some lower). When the second player begins to gain rating by winning more than losing, then they'll start to lose more for a loss than they gain for a win, and this will compensate for their slightly higher than even win rate, and the rating will stabilize.

Dale · Feb 14, 2021

luckis11 said:

Whereas the Elo, Glicko and Truskill rates (roughly about) the first with 800+200*8-0*8=2400 and the second with 800+1200*8-1000*8=2400, if the points with which they started is 800.

Ok, so I found my Glicko code. In my code the starting score is 1500. I made a population of 200 players and had them compete 1000 times each to get a starting population of “background” players with established Glicko scores.

Then I had made our two test players. I had them compete with random background opponents as you described. My mid tier player won 1203 games and lost 997 (the random selection made it not exactly 1200 wins and 1000 losses, but that is close enough). My top tier player won 200 and lost 0. Glicko scores were updated after each game.

At the end the mid tier player’s score was 1585, which is as expected just slightly better than average. At the end the top tier player’s score was 2776, which was the highest in the population.

So as I said above your description of the Elo/Glicko computation is wrong. The actual calculation, gives a substantial distinction between the two players. It is not remotely correct that “they say that they are of the same strength” as you claim.

luckis11 said:

confronting players from all ranges of strengths should be applied

As you can see above, if you randomly select players from the entire range the Glicko method gives good results. So broad-based pairings is acceptable for Glicko. However, such broad-based pairings are mandatory for your suggested method. That is an insurmountable weakness of your approach.

It is far more enjoyable and productive to play reasonably close matched opponents. A novice vs a grand master is boring for the grand master and bewildering for the novice. It is unenjoyable and a waste of both player’s time.

luckis11 · Feb 15, 2021

Let me suppose that your 1585 and 997 results are correct. Then how do you explain that in the >500 games I played, on average I was getting +8 points for a win and -8 points for a loss, thus what Glicko was counting is (initial amount)+8(wins)-8(losses), i.e. wins-losses.

As for the necessity to play against all ranges of strengths, consider this example at soccer: Last season in premier league they played 38 games each, and in league 2 they played 46 games each, and suppose that the 2 champions of each league got (w+0.5d)/(w+d+l)=0.75. With this data alone it is impossible to calculate how much stronger the premier league champion is than the league 2 champion. It is necessary for some games to be known between premier league and league 2 teams. That, while THERE IS a modification of (w+0.5d)/(w+d+l) which gives more points when a team faced opponents of high strength and less points when a team faced opponents of low strength (strengths based on the 38 and 46 results). I have found such a modification but it needs a matrix of past results. However I have constructed such a matrix. But this modification in this case of 38+46 games equals to the no modification (it gives the same ratings as the no modification) because each team of premier league and league 2 has faced opponents of all strengths, i.e. of the same strengths. Therefore the modification is useless to help calculate which of the 2 champions is stronger. Thus it is reasonable to conclude that Elo etc are also useless to calculate which of the 2 champions is stronger.

PeroK · Feb 15, 2021

luckis11 said:

Thus it is reasonable to conclude that Elo etc are also useless to calculate which of the 2 champions is stronger.

If there were a premier league of chess, where the top 20 players played exclusively against each other, then it would be difficult to relate playing strength between divisions. But, chess games are not segretated completely: the top players also play against ordinary GM's who play regularly against IM's who play regularly against experts, who play regularly against club players etc.

There is enough overlap, therefore, to allow the relative strength of players across the spectrum to be estimated - even though an 1800 player will almost never play a 2600+ player.

Dale · Feb 15, 2021

luckis11 said:

Then how do you explain that in the >500 games I played, on average I was getting +8 points for a win and -8 points for a loss, thus what Glicko was counting is (initial amount)+8(wins)-8(losses), i.e. wins-losses.

The comment following your “thus” is a false inference. Your average gain and average loss numbers in no way imply that your simplified formula represents the Glicko algorithm. For any player there will be some average gain and average loss. That doesn’t mean that the algorithm is simply wins - losses.

The Glicko algorithm is published and well known. Anyone, including you, can read the literature and see that your description is inaccurate. Your description is inconsistent with the literature on the calculation and demonstrably incorrect when applied to the scenario in your OP.

Are you unhappy with your rating? Is that why you are clinging to demonstrably false statements about the rating system?

luckis11 · Feb 15, 2021

It's obvious that on average it is (initial amount)+8(wins)-8(losses). If I am wrong you do not prove that I am wrong, you just say read the literature. There is no need to understand the literature to conclude this. By the way, I hope that the persons who decide which rating system to be used understand the literature, as it is un-understandable as usual (i.e. usually the symbolisms mathematicians use are un-understandable). And since you are curious whether I am happy with my present Glicko rating, I just do not know if I would have a better or worse rating with the surely correct system I propose. That's because e.g. the number of my wins draws and losses were not out of a random selection of opponents (random, thus from all ranges of strengths), but of opponents of similar Glicko rating as mine. I am unhappy with THAT, that the real strength of each player is unknown with the system used. At least, if they had chosen the opponents randomly and used Elo or etc, then the (w-l)/(w+d+l)<=>(w+0.5d)/(w+d+l) or even the (Glicko~initial rating+8*wins-8*losses)/(number of games) would show something sure. Hey, I just realized this now!: They do not need to change the rating they use, they only have to select the opponents randomly, then the surely correct rating would be there: (w-l)/(w+d+l)<=>(w+0.5d)/(w+d+l). Both mine and Glicko ratings would be there for everyone to see and perhaps be able to judge how much Glicko etc are also correct as the (w-l)/(w+d+l)<=>(w+0.5d)/(w+d+l) is. I am impressed with the checkmate I have just done to you. However, the rating I propose needs some modifications, e.g. when the results are only 1 win in 1 game, he has the best rating possible while almost certainly he is not that strong player. I have a solution but it is based on intuition, the correct solution is extremely difficult to locate, it is based on Bayesian inference and most probably it needs a matrix of past results. And e.g. I would give more weight to the more recent results.

Dale · Feb 15, 2021

luckis11 said:

If I am wrong you do not prove that I am wrong

In fact, I did completely prove it by actually running a full Glicko implementation. I thereby proved that it does not behave the way you claim.

PeroK · Feb 15, 2021

Dale said:

Actually, I did completely prove it by actually running a full Glicko implementation. I thereby proved that it does not behave the way you claim.

... checkmate!

Dale · Feb 15, 2021

luckis11 said:

It's obvious that on average it is (initial amount)+8(wins)-8(losses).

By the way, as a follow up I investigated this for the "background" population that I described above. Since I used a starting value of 1500 we don't expect the same numbers, but I was curious.

For the middle background player the average increase was 2.3 and the average decrease was -2.0. For the top quarter player it was 2.2 and -1.8. For the bottom quarter player it was 1.8 and -2.3. For the second player it was 2.2 and - 2.5. For the second worst player it was 2.8 and -2.2.

Now, all of those were over the full 1000 matches that established the background rankings. The process itself is highly dynamic so an overall average number is not that informative. Here is a plot of the ranking changes over time for the top quarter player.

While the ranking is uncertain the changes are large and rapid. As the ranking becomes more certain the changes become smaller. This dynamic behavior is what actually drives the algorithm. Compare the averages over the full 1000 matches to the averages over the first 5% of the matches: for the middle player it was 30 and -16 in the first 5% compared to 2.3 and -2.0 overall. For the top quarter player it is 29 and -13 compared to 2.2 and -1.8. Similarly for the others.

Now, you can always take any rating and win/loss history and generate an average increase and an average decrease. That is completely irrespective of the algorithm used to generate the ratings. So to simply look at an average (which can always be done) and claim that it tells you anything about the algorithm is nonsense. You cannot use such data to claim that "what Glicko was counting is (initial amount)+8(wins)-8(losses), i.e. wins-losses". If that were the case then the change would be consistent over time and consistent between players, neither of which is true.

This whole thread you have been misrepresenting the Glicko algorithm. It simply does not do what you say it does.

luckis11 said:

they only have to select the opponents randomly

This is a terrible idea. Nobody wants to have grandmasters playing amateurs (except apparently you). I can hardly believe that you think that is a good idea. A good rating system has no need to do that, so if a proposed rating system requires that then it is de facto a terrible system.

luckis11 · Feb 16, 2021

Please be as lucid (clear) as you can (that's what I do, and I edit my words many times to achieve this), as I quess I have to read your last post 15 times before I perhaps understand it. Your last argument that my system is terrible because it needs opponents to be selected from all ranges of strengths is a childish one. Let me help you understand how much self-evident it is that it is correct. In the past I wanted to place bets at soccer matches. So I needed a rating that shows correctly the strength of each soccer team, and then construct a matrix that in each box of it I entered the wins, draws, losses that gathered by the following way:
https://www.soccerstats.com/results.asp?league=england_2019&pmtype=resultsgrid
Suppose that Arsenal had rating a and Bournemouth had rating b. Arsenal (at home) won Bournemouth, so I added 1 win at the box of the raw "rating a" and of the column "rating b". Repeating the process for all results, on each box gathered many wins, many draws, and many losses of the home team. Then to estimate the probabilities for win-draw-loss for the next future match in order to place my bet, I had to see the rating of the team that will play at home, see the rating of the team that will play away, and see what the box that these 2 ratings are met, says. The box says e.g. 70 wins, 20 draws, 10 losses. So, the probability estimation is 70% for the home team to win, 20% probability for the draw, 10% probability for the away team to win. And if the odds offered for the draw were 8, I would place my bet on the draw. Now, would I trust Glicko etc rating to construct my matrix and estimate probabilities this way in order to place bets with my little valuable money, or the (w-l)/(w+d+l)? The latter of course, because it is blind obvious that it is correct, whereas Glicko etc UNKNOWN AND UNDER DOUBT whether they are correct. And especially now that I observed the (initial rating)+8*wins-8*losses. You say I am wrong about this last observation, but at the best case is under doubt.

PeroK · Feb 16, 2021

luckis11 said:

Suppose that Arsenal had rating a and Bournemouth had rating b. Arsenal (at home) won Bournemouth, so I added 1 win at the box of the raw "rating a" and of the column "rating b". Repeating the process for all results, on each box gathered many wins, many draws, and many losses of the home team. Then to estimate the probabilities for win-draw-loss for the next future match in order to place my bet, I had to see the rating of the team that will play at home, see the rating of the team that will play away, and see what the box that these 2 ratings are met, says.

My friend has written a whole system to predict English Premier League football results based on this system. It quickly became a lot better than his previous guesses. It's quite sophisticated now, although the next step would have to be non-results-based data, like injuries to key players etc. Or a "change-of-manager" factor!

Dale · Feb 16, 2021

luckis11 said:

And especially now that I observed the (initial rating)+8*wins-8*losses. You say I am wrong about this last observation, but at the best case is under doubt.

We are done here. From your first post to your last you have been making false claims about how these systems work. You have had the algorithm correctly described to you, you have been referred to the literature on the topic, you have been shown the actual results and direct explicit evidence, and you have had your mathematical mistake explained. There is no room for rational or honest doubt.

On this site all posts must be consistent with the professional scientific literature. Your claim is not.

Your continued assertion of a demonstrated false claim has gone beyond the mistaken belief of an honest learner. Your claims are the assertions of someone with an agenda (frankly a very weird agenda) who is unwilling to change their opinion in the face of clear explicitly contradictory evidence.

That sort of behavior is tolerated on other social media platforms, but PF has a higher standard. Please read the rules before further posting, but this topic is closed.

Elo, Glicko, and Truskill rating systems are most probably wrong

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad Understanding permutations and combinations in a coin toss experiment

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect