February 17, 2020

Top Five: Ways to Mislead with Data Visualizations

By John Emery

Presenting data as some sort of visual — a bar or line chart, scatter plot, a map, etc. — is a powerful storytelling technique. Instead of sifting through hundreds or thousands of individual data points, we can aggregate them to show trends, patterns, or outliers. Tools like Tableau and PowerBI give analysts and designers unprecedented ability to sift through massive amounts of data to become visual storytellers. That ease of access, unfortunately, makes it very easy to mess up, mislead, or flat-out lie with data.

Unscrupulous people can manipulate data to tell the story they want to tell, and non-data-savvy viewers may not know the difference. In an era of 24/7 news and rampant misinformation, being data literate is more important than ever.

In this post, I’ll go over five common and easy ways to mislead with data visualizations. I made all of the examples with Tableau and simple Excel spreadsheets. None of these examples took me more than a couple of minutes to put together — that’s how easy it is to mislead with our modern tools.

1 - Abbreviated Axes

Bar charts are among the most used chart types. When we see a bar chart, we compare the lengths or heights of the bars to one another to come to some conclusion. They are easy to build and almost everybody has an understanding that bigger bars = more stuff.

In the bar chart above, we can easily compare the visitation statistics for these selected National Parks. The scale on the Y-axis runs from 0 to 2.2 million. The scale on the chart below, however, starts at 1.4 million…

… and makes it look like Cuyahoga Valley and Gateway Arch receive significantly more visitors than the others.

By truncating the y-axis, I’ve compressed the scale and this has the effect of magnifying differences between bars. This trick can be used to mislead viewers into believing differences are greater than they are actually.

2 - Dualing Data

A dual-axis chart can be an excellent visualization choice to show two related series of data that don’t share the same scale. They are often used where one set of values may be very large while the other is very small.

A dual axis chart with synchronized axes.

In the chart above, we can see that the number of home runs hit during a season is quite a bit larger than the number of intentional walks. We can also see that home runs have spiked in recent years while walks have been trending downward for over a decade. Because I synchronized the axes we can trust that apples-to-apples comparisons are accurate and we can make further conclusions from there. In the chart below, however…

A misleading dual axis chart with non-synchronized axes.

… I did not synchronize the axes. This has an effect similar to truncating an axis as we saw above by exaggerating differences between values. In addition to that, this chart makes it look like there used to be more walks than home runs, which we know from the first chart has never been the case.

This misuse of dual-axis charts appeared during a Congressional hearing in 2015 with an added twist: there were no axis labels on the chart. With no labels, it is impossible to say what’s truly going on without digging into the raw data. This is a very nefarious way of obscuring data to support a given agenda. Please don’t do this.

3 - Confusing Charts

The different chart types we know and love today weren’t invented just to look pretty. Bar charts are excellent at comparing a single measure (such as sales, or weight, or temperature) across various dimensions (such as companies, or teams, or widgets). Line charts show the change of a measure over time. Scatter plots are great at showing the relationship between two measures. Pie charts show parts of a whole.

The choice of chart type is very important. When you use an inappropriate chart type the results can be highly misleading or confusing. Take the pie chart below…

A very confusing pie chart.

… what’s going on here? Each slice of the pie is the same size, but the labels show differing values which add up to 193%.

This is a classic example of data taken from a poll leading up to the 2012 US Presidential election. Using a pie chart, in this case, is wrong because each person polled was asked their opinion about each candidate. We shouldn’t expect the favorability values to sum to 100% — this cross-section of data doesn’t represent a part-to-whole relationship, so a pie chart is a terrible choice.

In this example, a trusty bar chart would have been far superior. In addition to changing the chart type, the labeling needs to be fixed so viewers know what they are seeing. Let’s give this a shot…

I elected to visualize all four categories of favorability per candidate. Notice that I also rewrote the title to indicate:

  1. Who was polled for these numbers: Republican voters
  2. That each person polled was asked their opinion of each candidate

The data for this chart is good and interesting. By providing adequate labeling and depicting the data properly we can tell a much clearer story.

4 - Choropleth Coloring

We all love maps. Maps are often colorful, insightful, and easy to read. Choropleth maps, which color geographic regions based on some measurement, are common due to their ease of use and familiarity.

In this section, I want to show two ways that these maps can mislead:

  1. Absolute vs. relative coloring
  2. Binning

4a - Absolute vs. Relative Coloring

Oftentimes, when we color a choropleth map by some absolute value (number of traffic accidents, subscribers to a website), the result more-or-less mirrors a choropleth map of the population at large…

… as seen in these two maps above. As it turns out, the number of assaults is highly correlated with the total population. Intuitively, this makes sense: unless someone in Loving County, TX (population ~134) is really stabby, there will probably be fewer assaults there than in Miami-Dade County, FL (population ~2.7 million).

A better and often more interesting way to view this data is on a relative basis. Let’s maps assaults by county, but this time color by per capita assaults

… this paints a much different picture. We can now see that assaults are more evenly distributed throughout the country. Instead of making it look like a certainty that you’ll be assaulted when you visit Southern California, now you can worry about visiting Shreveport, Louisiana.

4b - Binning

US Presidential elections can be divisive and controversial.[citation needed] The most common way that election maps are drawn is by coloring each county by which party won: red for Republicans and blue for Democrats. This leads to maps like this…

… which makes it seem like Republican voters absolutely dominated Democrat voters across the country.

The raw data tells a different story: the Democratic candidate, Hillary Clinton, actually received more votes than her Republican counterpart, which is impossible to tell from the map above without outside knowledge. Another issue with the map above is that it views things as black and white (or red and blue) and ignores the fact that some people in Blue counties voted Red and vice versa.

A more nuanced view of the choropleth map with relative coloring

This map is much more nuanced. It colors counties according to the proportion of the county that voted Republican or Democrat. Additionally, I abstracted each county as a circle and sized the circles by the total number of votes cast in that county. While a view like this is not perfect, it tells a clearer story than our absolutist map above.

5 - Horrible Histograms

I want to end this post by talking for a minute about histograms. They’re great at showing the distribution of data by counting the frequency of certain values within predetermined bins.

Histograms can be misleading, however. With improper binning…

… the true shape of the data can be obscured. With a bin size of 20, it looks like we have a large spike between 80 and 100 and that’s it. But then we reduce the bin size to 10 and see a second spike come out of nowhere between 120 and 130. And then with a bin size of 1, we see that there are actually three peaks within the data.

There are plenty of ways to calculate the appropriate bin size (Tableau will do it for you automatically if you ask nicely). To avoid the mistake of misleading or obscuring data, investigate different bin sizes to make sure the story you’re telling is truthful and meaningful.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit