Welcome to Anscombe's Quartet, a fundamental lesson in statistical analysis. Created by statistician Francis Anscombe in 1973, this quartet demonstrates why visual data exploration is crucial in statistics. The quartet consists of four datasets that appear identical statistically but are vastly different when visualized. This teaches us that statistics without visualization can be misleading.
Here are the four datasets that make up Anscombe's Quartet. Each dataset contains eleven data points with x and y coordinates. Dataset one shows various x values with corresponding y values. Dataset two has similar x values but different y coordinates. Dataset three again shows different patterns in the y values. Dataset four is particularly interesting with mostly the same x value of eight, except for one point at nineteen. While these numbers appear quite different, they share remarkable statistical properties.
Here we see the remarkable statistical similarities between all four datasets. The mean of x is exactly nine point zero for all datasets. The mean of y is seven point five for all. The variance of x is eleven point zero across all datasets. The variance of y is approximately four point one two. The correlation coefficient is about zero point eight one six for all datasets. Most surprisingly, they all have the same linear regression line equation: y equals three point zero zero plus zero point five zero zero times x. Traditional statistical analysis would conclude these datasets are essentially identical, setting up the dramatic contrast that visualization will soon reveal.
Now comes the dramatic moment of truth. When we plot all four datasets, we see completely different patterns despite their identical statistics. Dataset one shows a clear linear relationship with points scattered around the regression line. Dataset two reveals a perfect quadratic curve, clearly not linear at all. Dataset three shows a perfect linear relationship except for one dramatic outlier that pulls the line. Dataset four shows no relationship whatsoever, except for one influential point at x equals nineteen that creates the artificial correlation. The same regression line fits all plots statistically, but tells completely different stories visually.
Let's analyze each dataset in detail. Dataset one represents the ideal case for linear regression, showing a clear linear relationship with points scattered normally around the regression line. Dataset two reveals a perfect quadratic curve, demonstrating that linear regression is completely inappropriate for this data pattern. Dataset three shows what appears to be a perfect linear relationship, but one outlier point at coordinates thirteen, twelve point seven four dramatically affects the analysis. Dataset four is particularly interesting, showing no relationship at all except for one influential point at nineteen, twelve point five that creates the entire correlation. Understanding these patterns helps us recognize when outliers and influential points can mislead our analysis.