Barry Bonds is on the verge of breaking Hank Aaron’s all time home run record in Major League Baseball. It’s inevitable. In fact by the time you read this it might have already have happened. But, eh, nobody seems to care much. Well, nobody outside of the Bay Area or Bristol, Connecticut.
This probably stems from the fact that hardly anybody not being paid by Barry seems to like Barry very much and that people think he’s cheated his way to the record through the use of various anabolic steroids, growth hormones (both human and cow apparently), and a couple of BALCO products known as “the cream” and “the clear.”
And of course flaxseed oil.
But other than once for amphetamines, Bonds has never tested positive for any PEDs. Still, this hasn’t ended the speculation.
For a few reasons there is still rampant suspicion that Bonds’ inflated power numbers come from better living through chemistry. First, there is his involvement with BALCO, Victor Conte, and trainer Greg Anderson. Then there’s this. Finally, just look at Barry. There has clearly been a dramatic change in his physique from his earliest days as a skinny kid in a Pittsburgh Pirate uniform to the Giant (both senses) he is today.
To quote Stuart Mackenzie, “Look at the size that boy’s heed. I’m not kidding, it’s like an orange on a toothpick.”
Still most of the reporting remains focused on the feds, the remnants of the BALCO scandal, the toothless investigation by the Mitchell committee, or some other informant du jour. It’s almost as if the media and MLB are both waiting for someone to hand over a smoking gun registered to Barry with his prints on the still warm grip (“Hey look, Godot!”). What there hasn’t been much of is an examination of the probability that what Barry has done was legit from a purely statistical standpoint.
The below is attempt to do just that. It’s a analysis of Bonds’ 2001 single-season record-breaking home run mark of 73. It’s long. Sorry. But, statistically, it’s interesting because Bonds really didn’t hit 73 home runs in 2001. Or at least you can almost prove it.
Admittedly that’s a bit of an irresponsible use of the word “prove.” The numbers don’t prove anything. And Bonds actually did hit 73 baseballs that cleared the fence during that season. What the numbers do show is that it was so improbable that it would almost be more rational to believe it didn’t happen.
Chicks Dig the Long Ball
It’s difficult to pinpoint exactly when steroids or other PEDs became a problem enough to compromise the historical integrity of stats in baseball, but the strike-shortened season of 1994 provides a decent enough breaking point. The numbers from that year really don’t mean much as the season was halted in August. And the following year, players started knocking the ball across zip codes.
In the entire history of baseball from 1876 to 1993, only two players—Babe Ruth in 1927 and Roger Maris in 1961—had ever hit 60 or more home runs in a single season. From 1995 to 2004 it happened six times.
Additionally, from the inception of the game through the 1993 season only 123 players had hit 40 or more home runs in a season. It was actually fewer players because, for example, Babe Ruth did it 11 times. So to be technical it’s 123 player-seasons. From 1995 through the 2004 season, there were 93 player-seasons of 40 or more home runs.
Put another way, the first 100 years of baseball had about 125 40-plus home run seasons, at the current rate, the next 100 years will have 900.
So clearly, despite baseball’s love of its own timelessness, something about the game changed. The ball is juiced. The players are just bigger and stronger. The ballparks are smaller. Expansion has diluted the pitching. In 1993 the Colorado Rockies franchise had its inaugural season, so since then, National League teams would have been launching shots in the thin Denver air. There has been all kinds of theorizing, but even every team playing every pitch of every game in Denver probably wouldn’t get you to 900.
What Bonds did, though, was so statistically outlandish that it can’t reasonably be explained by any or all of those (except perhaps *ahem* “bigger and stronger”). To demonstrate this, a little statistics lesson is necessary. It’s pretty straightforward as there are only three things you need to understand: a mean, a standard deviation, and a normal distribution.
If you never took a basic stats class or even went to college but you’ve watched the Price is Right, you’re two-thirds the way home.
The mean is just the arithmetic average that you learned in grade school. Add up all of your data points, and divide the sum by the number of points there are. Cake.
Standard deviation is more or less a mathematical measure of how “spread out” your data are from the mean. As a quick example, look at these two sets of ten numbers:
A) 25, 28, 29, 21, 20, 17, 29, 33, 24, 24
B) 10, 0, 190, 3, 7, 28, 4, 2, 5, 1
Both sets have the same average of 25, but set B has a much larger standard deviation because the data are much more spread out from that average. If you calculate it out, A has a standard deviation of about 4.6, while set B has one of 58.5. That’s a relatively large difference.
(The formula for standard deviation is not given here, but for the intellectually inquisitive, it’s not hard to look up. Or if you are even lazier than I am, any stats package or spreadsheet application is likely to be able to do it in a couple of keystrokes).
The last concept, the normal distribution, is probably more familiarly identifiable as a “bell curve.” If that doesn’t do it for you, either peek down the page or think of the game Plinko from the Price is Right. If you drop hundreds of pucks down the center of the Plinko board, most of them would end up bunched around the middle, some spread to the sides about the middle, and a few would be out toward the tail ends. Each puck would represent a data point, and cumulatively the pucks would look “normally” distributed. There is a mathematical function to describe the normal distribution but simply even printing it here would probably put you to sleep. Just know that it is indeed shaped kind of like a bell and it is symmetrical around the middle.
But there is a really important phenomenon of normally distributed data (and also really elegant, given how it combines all three relevant concepts here): About 68% of all your data points (your Plinko pucks) will be within one standard deviation of each side of the mean; and about 95% of your data will be within two standard deviations from the mean.
So look at the graph above. It’s of a normal distribution. The blue represents plus/minus one standard deviation, the pinkish represents plus/minus two. So for normally distributed data, almost everything sits within 2 standard deviations. If you go out to three (the green)? That’s 99.73% of your data. And that’s pretty much everything.
Don’t Know Much About History?
Back to Bonds. Again, he hit 73 home runs at the age of 37. Historically, as most baseball players have gotten into and past their mid-30s, they experience a drop in hitting power. That translates into fewer homers.
So, after going through the data, I found at least 38 baseball players who had three characteristics that made them similar to Bonds.
First, they were sluggers. This was kind of an arbitrary definition. But for these purposes, a slugger was someone who had at least one season of 30 home runs of more. There’s nothing particular to that number other than a 30-home run season is generally considered a solid benchmark of a power hitter.
Second, they played until at least the age of 37. That’s kind of self-explanatory. Bonds turned 37 in his record-setting year, so we are comparing him to players of the same age.
Third, in the season when the player turned 37, he was still an everyday player. Again, another arbitrary benchmark was chosen to define “everyday” but if the player appeared in around120 games (about 75% of the season) this was sufficient.
There are a couple of exceptions to these three criteria. Roberto Clemente is included even though his largest single-season home run total was 29. Mickey Mantle retired before age 37, so his HR total from age 36 is used. Also a few players did not play much at age 37 (Ott, Killebrew, Winfield). This is presumable because of injuries. Their numbers from age 36 were also used.
Additionally, with just a couple of exceptions, the players in the data set had a 30-HR season before 1994. This is so that the sample is (hopefully) free from having any players on PEDs in it.
In other words we are trying to get a non-steroid sample, and compare what Bonds did to that to show how statistically absurd his 73 home runs would be if he were indeed clean. Yes, players in the Seventies might have been on speed or cocaine. That’s a flaw that can’t be avoided and whether being wired on blow helps you hit more home runs or not is a separate argument.
Just meeting all the above criteria leaves you with a roster that reads like roll call at the Hall of Fame: Willie Mays, Mickey Mantle, Reggie Jackson, Babe Ruth, Ted Williams, Andre Dawson, Ernie Banks, Mike Schmidt, Hank Aaron, etc.
But by the time those guys hit age 37, collectively their power was clearly declining from their career highs. And the average number of home runs hit by the 38 players in the sample during the season they turned 37 was 22.32. For the same data the standard deviation was 9.19
Just think about this qualitatively for a second. For decades of baseball you’ve got all of these great hitters, just legends. And when they get old, they start hitting 22 home runs, 11, home runs, 29 home runs, etc. The only two players with remotely gaudy numbers at that age were the game’s two greatest home run hitters—Ruth with 41 and Aaron with 47. Then suddenly here comes this one guy and, when he gets to be that old, he hits 73. That’s more than a 50% increase over the next best guy!
Looking at it that way, just on the surface it seems a little outlandish.
Using that mean and standard deviation calculated above—22.32 and 9.19 respectively—and the normal distribution we can calculate precisely how outlandish it is.
Relative to that same data set, Bonds’ home run total of 73 was 5.51 standard deviations from the mean.
Actually that should read; 5.51 STANDARD DEVIATIONS FROM THE MEAN!!!!
So, what does that mean?
Well, remember from the diagram above that more than 99% of your data should be within 3 standard deviations of the mean in a normal distribution. So, based upon the statistics of players who played before the time PEDs were thought to have become a problem in baseball, we could play hundreds and hundreds and hundreds more years of baseball and we wouldn’t expect to find hardly anyone at age 37 hitting more than the mean plus 3 standard deviations, or about 50 home runs in a season (This is kind of born out in that even the greatest home run hitters ever, only managed 41 and 47).
Bonds is another 2-plus standard deviations away! There simply aren’t outliers that far out. Just liars.
Okay, cheap shot. Yes. But we can put an actual number on the improbability.
Take that 5.51 number (for those of you who have had anything beyond a basic stats class, you might also know this as a Z-score) and compare it to what’s called a standard normal. A standard normal is just a normal distribution with a mean of zero and a standard deviation of one. By using a standard normal distribution and a Z-score you can calculate the probability of what Bonds did. Ready?
It’s so remote it doesn’t even exist.
Sort of. Going up from zero, most standard normal tables stop at Z-scores of 3.89. Huh?
What that means is once you get something that is roughly 4 or more standard deviations from the mean, statistically it’s so close to a zero-probability event that statisticians don’t even bother. Remember that almost alll of the data points should be within 3 standard deviations and the farther out you go the smaller the tails get.
Bonds was 5.51. Think about that for a second.
In fact the two textbooks I own with Z-score tables both stopped short of 4. I had to go search online for a table with values large enough.
Once you get to 5.51 standard deviations, the probability of a 37-year old hitting 73 homers is .000000019 (that’s seven zeros before getting to that one-nine).
To put it another way, from a statistical standpoint, almost 53 million sluggers need to play meaningful baseball through the age of 37 before you would expect to see one guy who hits 73 home runs in a season.
Fifty. Three. Million.
So far, in the 125-year history of the game, there have been about 50 sluggers playing meaning baseball after age 37.
Again. Fifty-three million. Compare that to less than fifty.
If you think Bonds is 1-in-a-million, you are off by a factor of 53.