Many of us know the debt that comes along with undergraduate degrees. Some of you may still be paying yours down. But what about graduate degrees? A recent article from the Wall Street Journal examined the discrepancies between debt incurred in 2015–16 and the income earned two years later.
The designers used dot plots for their comparisons, which narratively reveal themselves through a scrolling story. The author focuses on the differences between the University of Southern California and California State University, Long Beach. This screenshot captures the differences between the two in both debt and income.
Some simple colour choices guide the reader through the article and their consistent use makes it easy for the reader to visually compare the schools.
From a content standpoint, these two series, income and debt, can be combined to create an income to debt ratio. Simply put, does the degree pay for itself?
What’s really nice from a personal standpoint is that the end of the article features an exploratory tool that allows the user to search the data set for schools of interest. More than just that, they don’t limit that tool to just graduate degrees. You can search for undergraduate degrees.
Below the dot plot you also have a table that provides the exact data points, instead of cluttering up the visual design with that level of information. And when you search for a specific school through the filtering mechanism, you can see that school highlighted in the dot plot and brought to the top of the table.
Fortunately my alma mater is included in the data set.
Unfortunately you can see that the data suggests that graduates with design and applied arts degrees earn less (as a median) than they spend to obtain the degree. That’s not ideal.
Overall this was a really nice, solid piece. And probably speaks to the discussions we need to have more broadly about post-secondary education in the United States. But that’s for another post.
Credit for the piece goes to James Benedict, Andrea Fuller, and Lindsay Huth.
Winter is coming? Winter is here. At least meteorologically speaking, because winter in that definition lasts from December through February. But winters in Philadelphia can be a bit scattershot in terms of their weather. Yesterday the temperature hit 19ºC before a cold front passed through and knocked the overnight low down to 2ºC. A warm autumn or spring day to just above freezing in the span of a few hours.
But when we look more broadly, we can see that winters range just that much as well. And look the Philadelphia Inquirerdid. Their article this morning looked at historical temperatures and snowfall and whilst I won’t share all the graphics, it used a number of dot plots to highlight the temperature ranges both in winter and yearly.
The screenshot above focuses attention on the range in January and July and you can see how the range between the minimum and maximum is greater in the winter than in the summer. Philadelphia may have days with summer temperatures in the winter, but we don’t have winter temperatures in summer. And I say that’s unfair. But c’est la vie.
Design wise there are a couple of things going on here that we should mention. The most obvious is the blue background. I don’t love it. Presently the blue dots that represent colder temperatures begin to recede into and blend into the background, especially around that 50ºF mark. If the background were white or even a light grey, we would be able to clearly see the full range of the temperatures without the optical illusion of a separation that occurs in those January temperature observations.
Less visible here is the snowfall. If you look just above the red dots representing the range of July temperatures, you can see a little white dot near the top of the screenshot. The article has a snowfall effect with little white dots “falling” down the page. I understand how the snowfall fits with the story about winter in Philadelphia. Whilst the snowfall is light enough to not be too distracting, I personally feel it’s a bit too cute for a piece that is data-driven.
The snowfall is also an odd choice because, as the article points out, Philadelphia winters do feature snowfall, but that on days when precipitation falls, snow accounts for less than 1/3 of those days with rain and wintry mixes accounting for the vast majority.
Overall, I really like the piece as it dives into the meteorological data and tries to accurately paint a portrait of winters in Philadelphia.
And of course the article points out that the trend is pointing to even warmer winters due to climate change.
Credit for the piece goes to Aseem Shukla and Sam Morris.
First, a brief housekeeping thing for my regular readers. It is that time of year, as I alluded to last week, where I’ll be taking quite a bit of holiday. This week that includes yesterday and Friday, so no posts. After that, unless I have the entire week off—and I do on a few occasions—it’s looking like three days’ worth of posts, Monday through Wednesday. Then I’m enjoying a number of four day weekends.
But to start this week, we have Game 6 of the World Series tonight between the Atlanta Braves and the Houston Astros. That should the Braves vs. the Red Sox, but whatever. If you want your bats to fall asleep, you deserve to lose. Anyways, rest in peace, RemDawg.
Yesterday the BBC posted an article about baseball, which is first weird because baseball is far more an American sport that’s played in relatively few countries. Here’s looking at you Japanese gold medal for the sport earlier this year. Nevertheless I fully enjoyed having a baseball article on the BBC homepage. But beyond that, it also combined baseball with history and with data and its visualisation.
You might say they hit the sweet spot of the bat.
There really isn’t much in the way of graphics, because we’re talking about work from the 1910s. So I recommend reading the piece, it’s fascinating. Overall it describes how Hugh Fullerton, a sportswriter, determined that the 1919 White Sox had thrown the World Series.
Fullerton, long story short, loved baseball and he loved data. He went to games well before the era of Statcast and recorded everything from pitches to hits and locations of batted balls. He used this to create mathematical models that helped him forecast winners and losers. And he was often right.
For the purposes of our blog post, he explained in 1910 how his system of notations worked and what it allowed him to see in terms of how games were won and lost. Below we have this screen capture of the only relevant graphic for our purposes.
In it we see the areas where the batter is like safe or out depending upon where the ball is hit. Along the first and third base foul lines we thin strips of what all baseball fans fear: doubles or triples down the line. If you look closely you can see the dark lines become small blobs near home plate. We’ve all seen those little tappers off the end of the bat that die, effectively a bunt.
Then in the outfield we have the two power alleys in right- and left-centre. When your favourite power hitter hits a blast deep to the outfield for a home run, it’s usually in one of those two areas.
We also have some light grey lines, which are more where batted balls are going to get through the infielders. We are talking ground balls up the middle and between the middle infielders and the corners. Of course this was baseball in the early 20th century. And while, yes, shifting was a thing, it was nowhere near as prevalent. Consequently defenders were usually lined up in regular positions. These correspond to those defensive alignments.
Finally the vast majority of the infield is coloured another dark grey, representing how infielders can usually soak up any groundball and make the play.
The whole article is well worth the read, but I loved this graphic from 1910 that explains (unshifted) baseball in the 21st century.
I will try to get to my weekly Covid-19 post tomorrow, but today I want to take a brief look at a graphic from the New York Times that sat above the fold outside my door yesterday morning. And those who have been following the blog know that I love print graphics above the fold.
Of the six-column layout, you can see that this graphic gets three, in other words half-a-page width, and the accompany column of text for the article brings this to nearly 2/3 the front page.
When we look more closely at the graphic, you can see it consists of two separate parts, a scatter plot and a line chart. And that’s where it begins to fall apart for me.
The scatter plot uses colour to indicate the vote share that went to Trump. My issue with this is that the colour isn’t necessary. If you look at the top for the x-axis labelling, you will see that the axis represents that same data. If, however, the designer chose to use colour to show the range of the state vote, well that’s what the axis labelling should be for…except there is none.
If the scatter plot used proper x-axis labels, you could easily read the range on either side of the political spectrum, and colour would no longer be necessary. I don’t entirely understand the lack of labelling here, because on the y-axis the scatter plot does use labelling.
On a side note, I would probably have added a US unvaccination rate for a benchmark, to see which states are above and below the US average.
Now if we look at the second part of the graphic, the line chart, we do see labelling for the axis here. But what I’m not fond of here is that the line for counties with large Trump shares, the line significantly exceeds the the maximum range of the chart. And then for the 0.5 deaths per 100,000 line, the dots mysteriously end short of the end of the chart. It’s not as if the line would have overlapped with the data series. And even if it did, that’s the point of an axis line, so the user can know when the data has exceeded an interval.
I really wanted to like this piece, because it is a graphic above the fold. But the more I looked at it in detail, the more issues I found with the graphic. A couple of tweaks, however, would quickly bring it up to speed.
Last week I wrote about how CBS News’ coverage of the California recall election featured a misleading graphic. In particular, the graphic created the appearance that the results were closer than they really were.
This week we had another election and, sadly, I find that I have to write the same sort of piece again. Except this time we are headed north of the border to Canada.
I was watching CBC coverage last night and I noticed early on that the vote share bar chart looked off given the data points. Next time it popped up I took a screenshot.
First we need to note these are three-dimensional and the camera angle kept swinging around—not ideal for a fair comparison. This was the most straight-on angle I captured.
Second, at first glance, we have the Conservative share at a little more than 3/4 the Liberal vote share. That looks to be about right. Then you have the New Democratic Party (NDP) at roughly half the vote of the Conservatives. And the bar looks about half the height of the blue Conservative bar. Checks out. Then you have the People’s Party of Canada at roughly 1/4 the amount of NDP votes. But now look at the bar’s height. The purple bar is nearly the same height as the orange bar.
Clearly that is wrong and misleading.
The problem, I think, is that the designers artificially inflated the height of the bars to include the labels and data points for the bars. The designers should have dropped the labelling below the bars and let the bars only represent the data.
I created the following graphic to show how the chart should have looked.
Here you can more clearly see how much greater the NDP victory was over the People’s Party. The labelling falls below the charts and doesn’t distort the height comparison between the bars. In some respects, it wasn’t even close. But the original graphic made it look else wise.
I just wish I knew what the designers were thinking. Why did they inflate the bars? Like with the CBS News graphic, I hope it wasn’t intentional. Rather, I hope it was some kind of mistake or even ignorance.
Credit for the original piece goes to the CBC graphics department.
One of the long-running critiques of Fox News Channel’s on air graphics is that they often distort the truth. They choose questionable if not flat-out misleading baselines, scales, and adjust other elements to create differences where they don’t exist or smooth out problematic issues.
But yesterday a friend sent me a graphic that shows Fox News isn’t alone. This graphic came from CBS News and looked at the California recall election vote totals.
If you just look at the numbers, 66% and 34%, well we can see that 34 is almost half of 66. So why does the top bar look more like 2/3 of the length of the bottom? I don’t actually know the animus of the designer who created the graphic, but I hope it’s more ignorance or sloppiness than malice. I wonder if the designer simply said, 66%, well that means the top bar should be, like, two-thirds the length of the bottom.
The effect, however, makes the election seem far closer than it really was. For every yes vote, there were almost two no votes. And the above graphic does not capture that fact. And so my friend asked if I could make a graphic with the correct scale. And so I did.
One really doesn’t need a chart to compare the two numbers. And I touch on that with the last point, using two factettes to simply state the results. But let’s assume we need to make it sexy, sizzle, or flashy. Because I think every designer has heard that request.
A simple scale of 0 to 66 could work and we can see how that would differ from the original graphic. Or, if we use a scale of 0 to 100, we can see how the two bars relate to each other and to the scale of the total vote. That approach would also have allowed for a stacked bar chart as I made in the third option. The advantage there is that you can easily see the victor by who crosses the 50% line at the centre of the graphic.
Basically doing anything but what we saw in the original.
Credit for the original goes to the CBS News graphics department.
A few weeks back, a good friend of mine sent me this graphic from Statista that detailed the global beer industry. It showed how many of the world’s biggest brands are, in fact, owned by just a few of the biggest companies. This isn’t exactly news to either my friend or me, because we both worked in market research in our past lives, but I wanted to talk about this particular chart.
At first glance we have a tree map, where the area of each “squarified” shape represents, usually, the share of the total. In this case, the share of global beer production in millions of hectolitres. Nothing too crazy there.
Next, colour often will represent another variable, for market share you might often see greens or blues to red that represent the recent historical growth or forecast future growth of that particular brand, company, or market. Here, however, is where the chart begins to breakdown. Colour does not appear to encode any meaningful data. It could have been used to encode data about region of origin for the parent company. Imagine blue represented European companies, red Asian, and yellow American. We would still have a similarly coloured map, sans purple and green,
But we also need to look at the data the chart communicates. We have the production in hectolitres, or the shape of the rectangle. But what about that little rectangle in the lower right corner? Is that supposed to be a different measurement or is it merely a label? Because if it’s a label, we need to compare it to the circles in the upper right. Those are labels, but they change in size whereas the rectangles change only in order to fit the number.
And what about those circles? They represent the share of total beer production. In other words the squares represent the number of hectolitres produced and the circles represent the share of hectolitres produced. Two sides of the same coin. Because we can plot this as a simple scatter plot and see that we’re really just looking at the same data.
We can see that there’s a pretty apparent connection between the volume of beer produced and the share of volume produced—as one would (hopefully) expect. The chart doesn’t really tell us too much other than that there are really three tiers in the Big Six of Breweries. AB Inbev is in own top tier and Heineken is a second separate tier. But Carlsberg and China Resources Snow Breweries are very competitive and then just behind them are Molson Coors and Tsingtao. But those could all be grouped into a third tier.
Another way to look at this would be to disaggregate the scatter plot into two separate bar charts.
You can see the pattern in terms of the shapes of the bars and the resulting three tiers is broadly the same. You can also see how we don’t need colour to differentiate between any of these breweries, nor does the original graphic. We could layer on additional data and information, but the original designers opted not to do that.
But I find that the big glaring miss is that the article makes the point despite the boom in craft beer in recent years, American craft beer is still a very small fraction of global beer production. The text cites a figure that isn’t included in the graphic, probably because they come from two different sources. But if we could do a bit more research we could probably fit American craft breweries into the data set and we’d get a resultant chart like this.
This more clearly makes the point that American craft beer is a fraction of global beer production. But it still isn’t a great chart, because it’s looking at global beer production. Instead, I would want to be able to see the share of craft brewery production in the United States.
How has that changed over the last decade? How dominant are these six big beer companies in the American market? Has that share been falling or rising? Has it been stable?
Well, I went to the original source and pulled down the data table for the Top 40 brewers. I took the Top 15 in beer production, all above 1% share in 2020, and then plotted that against the change in their beer production from 2019 to 2020. I added a benchmark of global beer production—down nearly 5% in the pandemic year—and then coloured the dots by the region of origin. (San Miguel might not seem to fit in Asia by name, but it’s from the Philippines.)
What mine does not do, because I couldn’t find a good (and convenient) source is what top brands belong to which parent companies. That’s probably buried in a report somewhere. But whilst market share data and analysis used to be my job, as I alluded to in the opening, it is no longer and I’ve got to get (virtually) to my day job.
After a rainy weekend in Philadelphia thanks to Hurricane Henri, we are bracing for another heat wave during the middle of this week. Of course when you swelter in the summer, you seek out shade. But as a recent article in the Philadelphia Inquirer pointed out, not all neighbourhoods have the same levels of tree cover, or canopy.
From a graphics standpoint, the article includes a really nice scatter plot that explores the relationship between coverage and median household income. It shows that income correlates best with lack of shade rather than race. But I want to focus on a screenshot of another set of graphics earlier on in the article.
I enjoyed this graphic in particular. It starts with a “simple” map of tree coverage in Philadelphia and then overlays city zip codes atop that. Two zip codes in particular receive highlights with bolder and larger type.
Those two zip codes, presumably the minimum and maximum or otherwise broadly representative, then receive call outs directly below. Each includes an enlarged map and then the data points for tree cover, median income, and then Black/Latino percentage of the population.
I don’t think the median income needs to be in bar chart form here, especially given the bars do not line up so that you can easily compare the zip codes. The numbers would work well enough as factettes or perhaps a small dot plot with the zip codes highlighted could work instead.
Additionally, the data labels would be particularly redundant if a small scale were used instead. That would work especially well if the median income were moved to the lowest place in the table and the share charts were consolidated in one graphic. Conceptually, though, I enjoy the deep dive into those two zip codes.
Then I wanted to highlight some great design work on the maps. Note how in particular for Chestnut Hill, 19118, the outline of the zip code is largely in a thicker, black stroke than the rest of the map. At the upper right, however, you have two important roads that define the area and the black stroke breaks at those points so the roads can be clearly and well labelled. The other map does the same thing for two roads, but their breaks are shorter as the roads run perpendicular to the border.
Overall this was just a great piece to read and I thoroughly enjoyed the graphics.
Every four years (or so) I have to confess that I think fondly back upon my former job, because I worked with a few wonderful colleagues of mine on some data about the Olympics. And the highlight was that we had a model to try and predict the number of medals won by the host country as we were curious about the idea of a host nation bump. In other words, do host countries witness an increase in their medal count relative to their performance in other Olympiads?
We concluded that host nations do see a slight bump in their total medal count and we then forecast that we expected Team GB (the team for Great Britain and Northern Ireland) to win a total of 65 medals. We reached 64 by the final day and it wasn’t until the women’s pentathlon when, in maybe the last event, Team GB won a silver medal bringing its total to 65, exactly in line with our forecast.
Of course we also looked at the data for a number of other things, including if GDP per capita correlated to Olympic performance. We also looked at BMI and that did yield some interesting tidbits. But at the end of the day it was the medal forecast that thrilled me in the summer of 2012.
So yeah, today’s a shameless plug for some old work of mine. But I’m still proud of it two olympiads later.
Another day, more cases of coronavirus and Covid-19. So let’s take a look at Sunday’s data as there were some interesting things going on.
First, let’s dispense with Virginia. The state is enhancing its reporting structure, and so they admit the data is likely an underestimate of the present situation in Virginia. So here’s Virginia, nothing really changed.
Moving on, we have Pennsylvania. Here we are beginning to truly see the disparity between the cities in the southeast and southwest, namely Philadelphia and Pittsburgh, and the T that describes what sometimes is used to describe Pennsyltucky. (Though it also includes cities like Harrisburg, the state capital.) The point is that the T of Pennsylvania has yet to suffer greatly from the outbreak. Of course, it’s also the part of the state least equipped to deal with a pandemic.
New Jersey is just bad. One can make the argument that South Jersey is hanging on. (Though I will touch on that later with an idea for today’s afterwork work.) Bergen County in the northeast is likely to surpass 10,000 cases on its own today. And that will put it above most states.
Delaware is tough because it sits as a small state next to several much larger ones. But, the numbers seem to indicate the outbreak is still worsening. Though in terms of geographic spread, there’s little to say other than that New Castle County, home to Wilmington, in the north is the heart of the state’s outbreak.
Illinois is a fascinating state, because of how dissimilar it is compared to Pennsylvania, a state which has a similar number of people.
The map shows that geographic spread still has a little way to go before reaching every county in the state. But the outbreak has been there longer than in Pennsylvania. And most of the darker purples are concentrated in the northeast, in Chicago and its collar counties. Compare that to Pennsylvania above where you will see dark purple scattered across the cities of its eastern third, e.g. Allentown and Scranton, and in the western parts near Pittsburgh. This too could be worth exploring in depth in the future.
Lastly I want to get to the cases curves charts. Here we look at the daily new cases in each state.
And unfortunately Sunday’s numbers will impact the Virginia curve, but it overall looks as if the state is worsening. I would argue that Illinois, which appears to be bending towards a steadying condition is likely in a weird weekly pattern where it appears to stabilise on weekends and then resumes reported infections come Monday. Pennsylvania might well be flattening its curve. I would want to see a few more days’ worth of data before stating that more definitively. Let’s give it to Wednesday or Thursday.
And then in New Jersey we have a fascinating trend. The curve of increasing number of cases has clearly broken. But it also is not shrinking. Instead, it seems to be more of a plateau. And in that case, the outbreak in New Jersey is not getting worse, but it’s also not getting any better. At least not numerically. However, the goal of flattening the curve is to create a slower, more steady increase in case numbers to help hospitals cope with surge volumes. So good news?