In Part 1, I introduced Bill James’ Pythagorean expectation for estimating the expected win percentages in Major League Baseball. Here in Part 2, I discuss how it can be improved [1].

As mentioned at the end of Part 1, Pythagorean expectation and Wins Above Replacement (WAR) use runs as the single measuring stick for calculating expected win percentage (for teams) and value (of players), respectively. Two teams that score the same average runs per game (RPG) should split games between them evenly. Two players with equal WAR have equal value. But is it possible that there are other factors at play?

Consider three teams, each of which scores an average of 5.0 RPG: Team A, which always scores 5 runs (5/5/5), Team B, which scores 4 runs two-thirds of the time and 7 runs one-third of the time (4/4/7), and Team C, which scores 6 runs two-thirds of the time and 3 runs one-third of the time (6/6/3). In none of these cases do the teams split their games evenly in head-to-head matchups, as predicted by the Pythagorean expectation. It is easy to show that Team A beats Team B 6 games out of 9, Team B beats Team C 5 games out of 9, and Team C beats Team A 6 games out of 9!

This rock/scissors/paper example shows that which is the better team is a non-transitive property. It immediately follows that no single numerical parameter, e.g., RPG, is sufficient to determine which team is better in a head-to-head matchup. Although simplistic, this example clearly shows that sometimes we must go beyond RPG to determine which team is better. The shape of the run distribution can also be an important factor: Team A is very consistent, Team B scores below their average more often (negative skew), while Team C scores above their average more often (positive skew).

In practice, baseball run distributions tend to have negative skew, i.e., like Team B above, with many games a little below average and fewer games above average but with larger deviations from the average. For example, a team that averages 5.0 RPG may have many games with 3 or 4 runs, but a few with 10 or more runs. Since Team B tends to lose more to Team A, it is a general rule that teams that are more consistent, and have a smaller negative skew in their run distribution, should win more games than just their RPG would otherwise indicate. A team with smaller skew also has a smaller standard deviation; standard deviation can be used as a second parameter, in addition to RPG, to help rank teams.

Teams A, B, and C above are very unrealistic. A more realistic (but still idealized) example is the following: Let team HR only hit home runs when they get a hit, Team 2B only hits doubles, Team 1B only hits singles with the runners advancing one base, and team 1B+ only hits singles with the runners always advancing from first to third on a single. Then Team HR scores once every hit, Team 2B scores on the second hit of an inning, Team 1B+ scores on the third hit of the inning, and team 1B scores on the fourth hit of an inning. Once any of these team has scored a run in an inning, they always score another run for each successive hit in that inning.

If we assume these teams all average 5.0 RPG, the HR team must have a batting average (AVG) of .156, the 2B team .284, the 1B+ team .373, and the 1B team .441. Pythagorean expectation would say that they all should split games against each other equally, but is that the case? The following table shows the winning percentage of these teams against each of the others, along with their standard deviation.

In each case, the team with the smaller standard deviation is favored to win more than 50% of the time, and the larger the difference in standard deviations, the more they are favored. Looking at the run distributions (below), we see that the HR team has the least skew, 2B team is next, followed by 1B+, and the 1B team has the most skew.

The difference in skew makes sense; the 1B team takes more hits in an inning to score, but once they get there, each successive hit scores a run, and with their higher batting average, that happens a lot. So they tend to have more bigger innings than the HR team.

Of course no team produces only one kind of hit. A more accurate model would allow a team to hit singles, doubles, triples, and home runs, take walks, and allow runners to sometimes advance extra bases on a hit (e.g., go from first to third on a single), or advance one base on an out (e.g., on a sacrifice or sacrifice fly). Using a Markov chain analysis, one can assume a team has a certain set of batting and base-running characteristics and then calculate their average RPG and the shape of their run distribution. Then playing different teams against one another, one can determine what modification is needed to the Pythagorean expectation to account for the shape of the run distribution.

It turns out that slugging percentage (SLG) provides the best additional parameter to use, even better than standard deviation. SLG is calculated by taking all the bases achieved on hits (one for a single, two for a double, etc.) and dividing it by the number of times at bat. For example, our HR team above has a SLG of .625, and the 2B team .567. A team with a higher SLG has a smaller standard deviation and will win more games, given the same RPG. The new, improved winning percentage formula for Team 1 playing Team 2 is

for the current run environment where an average team scores about 4.5 RPG; R stands for RPG and S for SLG. This can be compared to the (modified) Pythagorean expectation formula

The net effect of the new formula is that each additional .080 in team SLG (for the same RPG) adds about one win in a 162-game season. Thus “.080 SLG = 1 win” can be used as a supplement to the standard sabermetric valuation “10 runs = 1 win.” For reference, the current range of SLG in Major League Baseball is about .110. Thus the effect is small, but if you are looking for any small advantage, including SLG in your evaluation of players (above and beyond WAR itself) might be worth considering.

The idea that a more consistent scoring team wins more can be applied in a variety of ways. For example, if consistent offense is better, consistent pitching is worse (all other things being equal). A pitcher who is inconsistent, with many good games and a few very bad games, would be better for a team than a pitcher with the same allowed RPG who is more consistent; the inconsistent pitcher tends to lose in the bad games, but there are fewer of them.

[1] This article is a very condensed version of the paper I presented at the 2010 MIT Sports Analytics Conference, which was chosen as the best paper submitted by an academic.