Science. Communication. Community.
Rectangles and rainbows may not be the best way to visualize scientific data. Scientists and communicators are changing their thinking about the most appropriate way to represent data without bias.
Scientists and science communicators, we need to talk. It’s time to change our thinking about graphs.
One of the finest complements to any scientific study is a great visualization. A well-chosen graphic not only draws a reader in with aesthetic appeal but also shares information that makes the findings more understandable.
Sometimes it’s a compelling photo of the site of the study or a portrait of a device or subject. But when the take-home message is “how much?” the standby is a plain old graph.
Although standard graphic techniques like the bar graph and pie chart are widely accepted and understood, scientists are changing the way they think about conventional methods of visualizing data.
A quartet of editorials in a recent issue of the journal Nature Methods explains how simple graphing techniques can mislead readers to inappropriate interpretations. Instead, the authors assert, less-common (but more meaningful) visualization techniques should be more widely used.
Possibly the most common way to compare numbers visually is with a bar graph. When used appropriately, a bar graph uses the relative size of rectangles to represent the number of times something happened. Larger rectangles mean something happened more times.
Over time, the scientific community started accepting the use of bar graphs not just to display counts but also to visualize average values.
Let’s say you want to know if your cat’s body temperature goes up when it’s purring. So, you design an experiment, and for 10 days, you take your cat’s temperature at rest, pet the cat until it starts purring, and take its temperature again.
When you want to display the results, many people would type their measurements into a spreadsheet, compute the average before and after petting, and make a bar graph of the averages. Although software like Microsoft Excel makes this task simple, representing the data this way may not be best. It may even be misleading.
Instead, the writers in Nature Methods say, a plotting technique like a box plot or its cousins the violin plot and the bean plot should be used to visualize summary data about multiple measurements. The problem with using a bar graph to displaying the average of several measurements is that even one spurious measurement could offset the magnitude of the average.
Let’s say that of your 10 measurements, nine are small numbers but one is very big, perhaps because your thermometer was broken or because your cat had a fever. The one big number might throw off the entire average. Using a bar graph, you might never know this was the case.
Instead, box plots tell us about the distribution of measurements in a study. Are they closely clustered together, or are they spread out? Are there individual measurements that are extremely dissimilar to the rest of the measurements?
Using a rectangle with “whiskers” to represent the main cluster of measurements and dots to represent “outlier” points, a box plot communicates much more information in the same space as a bar graph. Let’s consider the following example:
These two graphs represent the exact same made-up data (I made these with a random number generator). At left, a simple bar graph — ubiquitous in scientific reporting — displays the average value of a few items.
Here, the red bar isn’t very scary but the blue bar is highly scary. The green bar is somewhere in the middle—it definitely looks worse than the red bar. At right, the exact same data is shown using box plots. This tells us a lot more about the samples in this study.
It turns out that the red and green items are almost exactly the same. There are just two outlier points in the green column that were completely skewing the mean in the bar graph. The reader may draw a completely different conclusion from the box plot than from the bar graph, even though these are two visualizations of the exact same data.
So, if box plots are more appropriate than bar graphs at representing data, why are they used so often? One plausible explanation is that there are barriers to visualization alternatives. For example, spreadsheet software like Microsoft Excel can make bar graphs but not box plots, so many researchers may be inclined to use a bar graph out of convenience, rather than choosing a better representation of their data.
The bar graph isn’t the only visualization that is commonly misused, however. Another visualization mistake that has been called “harmful” is the rainbow color map, one of the most common color maps in use today. Color maps are used when scientists want to display information that changes with position.
If you have a bunch of measurements with some kind of spatial relationship, you can visualize them by changing the color of each point based on its value. You’ve probably seen this technique used in situations like choropleth maps that use color to display parameters like population density or surface temperature.
Rainbow color maps use our good friend Roy G. Biv to display measurement values. Blue (cold) colors are used to represent low values, and red (hot) represents high values.
It’s a popular choice, in part because brightly-colored rainbows are visually appealing. But it may also be a popular choice simply because it’s the default color map in many graphical plotting software packages.
The problem with the rainbow color map is that it visually stratifies information, affecting the viewer’s interpretation. Observers mentally distinguish blue from yellow from red from all the other colors, potentially leading to false interpretations that the colors represent different “zones”. Consider the following visualizations:
These images represent the walls of a blood vessel in a human. The same vessel with the same data is plotted at left and right, and the only difference is the color map. This is a gross oversimplification, but the color of the vessel is related to the likelihood that atherosclerosis will occur at any given point.
Look at the image at left where the arrow is pointing. It looks like those wall values are changing pretty dramatically between the dark blue and the light blue zones. But in the image at the right below the arrow, the same interpretation is unlikely. That’s because, while the difference in values between the dark blue and the light blue is very small, the color change paints a different picture.
If I played around with the assignment of color transitions in that color mapping, I could make the line between the two shades of blue appear to move (as well as the transition between any two other colors) even though the data is exactly the same. In the image at right, the difference between dark red and light red is much more subtle, and there is much less ability to play games with the color map to alter readers’ interpretations.
Visualizations are critical tools for both scientists and science communicators. They help us interpret findings in a much more engaging fashion than a wall of numbers written into a boring data table. But visualizations come with their share of pitfalls. Seemingly innocuous design choices can have major impacts on viewers’ interpretation of the data set.
I’m happy to see that the scientific community is making an effort to better define best practices for plotting and sharing information.