The OU Algorithm revisited

X-Act · Sep 24, 2009

I'm going to be immediately frank in this thread: the current OU algorithm is flawed. However, to save my ass, I'm going to say for now that the inherent problems in it are something that are out of our control. Thankfully, ShoddyBattle 2 seems to be well underway in its production, and should incorporate some changes, which makes me look into the algorithm again.

First of all, an explanation of how the OU algorithm currently works is in order. What it does is the following: if a Pokemon has a probability of at least 0.5 of being present in one in twenty teams, then it is considered OU; otherwise, it is not.

Here are the shortcomings of the above method:

1) Why 'one in twenty'? Why not 'one in fifteen', or 'one in 25' or whatever? The reason why 20 was chosen is only because the majority wanted that number, and that is not good enough.
2) Why are all Pokemon usages treated equally? Specifically, why is a Scizor usage from a 1500/350 player be considered the same as a Scizor usage from a 1850/40 player? We are basically treating good and bad players in the same way here.

Let's tackle the second question first. The problem with not having weighted usages was summed up by Doug a few months ago. In short, he asked the following questions:

a) What does a player having a CRE of 1765 mean exactly? How good or bad is he?
b) Is the rating of a player even reliable when he might have several accounts pertaining to him?

Because of these problems, he decided not to provide a weighted usage chart, because it wouldn't represent anything useful.

To this end, I researched on the Glicko-2 system a bit and saw that its inventor provides a formula that gives the probability that a player beats another player. From this formula, I provided an excellent approximation to how good a player actually is as a percentage, which I called GLIXARE, short for 'GLIcko X-Act Rating Estimate', and hence Question a) above is solved.

As for Question b), this cannot be solved unless only one account is allowed to be played on the ladder. Assuming that ShoddyBattle 2 will have measures for this to happen, we'll have Question b) solved as well, which would provide us with weighted usages... actually with excellent weighted usages. And hence Question 2) above is solved by GLIXARE + limiting the ladder to only one account per person. (Of course, cheaters would fuck up this, but I suppose that cheaters always have and always will exist.)

Now we tackle Question 1). It is obvious that we require an OU definition that has no subjectivity whatsoever. I recently began providing the overused leads of every metagame, and my definition of an 'overused lead' was the following:

An 'Overused Lead' is a member of the smallest group of Pokemon that, together, lead teams more often than not (mathematically, lead more than 50% of teams).

The above definition is completely objective, and, in fact, when I used it, I got no complaints whatsoever. So why not try to apply this definition for an 'Overused Pokemon' as well?

An 'Overused Pokemon' is a member of the smallest group of Pokemon that, together, would recreate more than half of the teams.

The problem with the above definition is that we don't have teams stats; we only have stats for individual Pokemon. Another problem with this definition is that, even if we have all the necessary teams information, it's not that easy to extract the smallest set of Pokemon that comprise more than half the teams.

However, all is not lost. Although we don't have whole teams stats, we do have teammate statistics, which can provide an approximate snapshot of how the teams were. In fact, I tried to write a program that attempts to regenerate how the teams were, given the teammate stats and regular stats, but it doesn't work fine yet. :( (Currently it's going on an infinite loop.) When that program works fine, I then need to see the minimum number of Pokemon that comprise more than half the teams, which requires another algorithm.

What I'll actually do, however, is to see whether the Pokemon output by this algorithm are smaller or larger in number than those for the current OU algorithm. If I find out that it is always smaller (or larger) by roughly the same percentage, we could just as well edit the T variable in the old algorithm to align itself with the new, but extremely slow, algorithm. Then we can finally say that the value of T chosen for our OU algorithm is not subjective, but found from (rather intense) research and tests.

That is my long-term plan for this algorithm. It's a long and tedious road, I know, but I'll try to go through it, given also the limited amount of time I have to do this.

Of course, if people are willing to help me, I'm all ears. I'm also all ears if people want to suggest anything, or have some insight on some algorithms, or want to otherwise comment on any aspect of what I wrote above. And that's the purpose of this thread.

Cathy · Sep 25, 2009

Just posting that I plan to sometime in the next week or so collect the data (i.e. the list of teams that were used) necessary to work out OU using this definition.

X-Act · Sep 25, 2009

Thanks a lot Colin. I'll try to see how OU with the new definition goes when I have the data.

I spoke with Tangerine on irc yesterday about this and he has some reservations about this new method, which I completely understand. I kinda want to confirm that this method works myself before I commit to it.

lati0s · Sep 29, 2009

Just curious, how big of a change would you expect this method to be over the old one? would it be enough to move a significant number of pokemon into/out of UU?

X-Act · Sep 30, 2009

My prediction is that this method would lower the amount of OU Pokemon. I don't know by how much exactly, though.

The OU Algorithm revisited

X-Act

np: Biffy Clyro - Shock Shock

Cathy

Banned deucer.

X-Act

np: Biffy Clyro - Shock Shock

lati0s

X-Act

np: Biffy Clyro - Shock Shock

Users Who Are Viewing This Thread (Users: 1, Guests: 0)