Where did exploratory data analysis come from and who is John Tukey?

Ian Littlejohn
Jun 6, 2024
2 min read

“The greatest value of a picture is when it forces us to notice what we never expected to see.”

John Wilder Tukey is one of the pioneers in data science. He is credited a number of developments and made significant contributions to statistical practice and data analysis in general. He established many of the key foundations of what came to be known as data science.

He received his doctorate in mathematics at Princeton in 1939, became director of the newly founded Statistics Research Group when it was set up in 1956 and was the first Head of the Department of Statistics, which he founded at Princeton in 1965.

He is credited with having coined the terms “software” and “bit”. In a memo in 1947 he abbreviated the term "binary information digit" to "bit", and then in an article published in the American Mathematical Monthly in 1958 he introduced the word “software” to describe programs on which electronic calculators ran.

During the 1960s, he challenged the dominance of what he called "confirmatory data analysis" (the statistical analyses driven by rigid mathematical configurations) and highlighted the importance of having a more flexible approach towards data analysis and exploring data with an open mind to see what structures and information it might contain. He called this "exploratory data analysis".

"This is my favorite part about analytics: Taking boring flat data and bringing it to life through visualization."

Tukey also realized the importance of computer science to EDA. While much of Tukey's work focused on displays that could be drawn by hand, he realized that computer graphics would be much more effective. PRIM-9, the first program for viewing multivariate data was conceived by him. It was released in 1974.

In 1970 he created the boxplot (also called box-and-whisker plot), and in 1977 he published a book titled ”Exploratory Data Analysis”, which presented the box plot to the world.

At the time of publishing his book, he felt that too much emphasis was being placed on statistical hypothesis testing and that more emphasis needed to be placed on using data to suggest hypotheses to test. He said that confusing the two types of analyses and using them on the same set of data could lead to systematic bias due to the issues inherent in testing hypotheses suggested by the data.

Exploratory data analysis involves analyzing and visualizing data to understand key characteristics, uncover patterns and identify relationships between variables. It allows you to explore your data to show predominant traits, discover patterns and locate outliers. Once completed, a well-executed EDA should provide good initial insights and enable you to move into data modeling and deeper data analysis.

"Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise."

Comments