Numbers Don’t Lie, or Do They? Simpson’s Paradox Explains December 2, 2009Posted by dataduchess in Uncategorized.
Tags: baseball, data, graphs, statistics, unemployment, WSJ
The Numbers Guy over at The Wall Street Journal had a really interesting article today. He explains a concept called Simpson’s Paradox, which essentially says aggregated data is sometimes misleading. For example,
… in both 1995 and 1996, Derek Jeter of the New York Yankees had a lower batting average for each season than David Justice, then of the Atlanta Braves.
Combining the two years, however, Mr. Jeter had a better average. The paradox resulted from the fact that in 1995 Mr. Jeter had only 48 at-bats with a .250 average while Mr. Justice had more at-bats (411) with a .253 average. The following year, Mr. Jeter had 582 at-bats with a .314 average while Mr. Justice had only 140 at-bats with a higher average of .321, pushing the two-year average in Mr. Jeter’s favor.
Other examples of the paradox can be found in all types of data, from air travel delay statistics and medical procedure success statistics, to education and unemployment data.
In the graph below, you can see that although the unemployment rates for each of the separate groups are higher now than they were in 1983, because the size of the group with the lower rate is so much bigger, the overall unemployment rate is lower than it was in 1983.
Confused? Don’t worry about it. The lesson here is to be wary of “hard data,” and remember that statistics can still be spun to fit any argument. This WSJ graph shows that unemployment is both better than in 1983, and worse. It only depends on which point you want to make.