Low Expectations

Today the 2021 Major League Baseball season begins its playoffs. Tomorrow we get the Los Angeles Dodgers and the St. Louis Cardinals. Why the Dodgers, the team with the second-best record in all of baseball, need to play a one-game play-in is dumb, but a subject for perhaps another post. Tonight, however, is the American League (AL) Wildcard game and it features one of the best rivalries in baseball if not American sports: the Boston Red Sox vs. the New York Yankees.

Full disclosure, as many of you know, I’m a Sox fan and consider the Yankees the Evil Empire. But at the beginning of the year, the consensus around the sport was that the Yankees would win first place in their division and be followed by the Tampa Bay Rays or the Toronto Blue Jays. The Red Sox would place fourth and the lowly Baltimore Orioles fifth. The Red Sox, as the consensus went, were, after gutting their team of top-flight talent and a no-good, rotten, despicable 2020 showing, nowhere near ready to reach the playoffs. The Yankees were an unstoppable offensive juggernaut.

When the 2021 season ended Sunday night, as the dust around home plate settled, the Rays dominated the AL East to take first. But it was the Red Sox that finished second and the Yankees who took third. Whilst the two teams had the same record, in head-t0-head match-ups the Red Sox won more games than the Yankees, 10–9. Not bad for a team that everyone thought couldn’t make the playoffs and would be in fourth place.

That got me thinking though, how wrong were our expectations? After doing some Googling to find individual reports and finding a Red Sox twitter account (@RedSoxStats) that captured as many preseason forecasts as he could, I was ready to make a chart. The caveat here is that we don’t have data for all beat writers, who cover the Red Sox exclusively or almost exclusively on a daily basis, or even national media writers, who cover the Red Sox along with the rest of the sport and its teams. For example, ESPN polled 37 of its writers, but all we know is that 0 of 37 expected the Red Sox to make the playoffs. I don’t have a single estimate for the number of wins, which obviously determines who gets into said playoffs, for those 37 forecasts. Others, like CBS Sports, broke down each of their five writers’ rankings for the division and all five had the Red Sox finishing fourth. But again, we don’t have numbers of wins. So in a sense, if we could get numbers from back in the winter and early spring, this chart would look even crazier with the Red Sox being even more outperform-ier than they do here.

Dirty water

We should also remember that during September, in the lead-up to the playoffs, the Red Sox were struggling with a Covid-19 outbreak that put nearly half their starting roster on the Injured List (IL). The Sox had the backups to the backups starting alongside the backups, some of whom then also went on the IL with Covid-19 leading to signings of players who, despite being integral to the September success, are not eligible to play in the playoffs due to when they signed. José Iglesias brought some 2013 magic to be sure. Earlier in the year, MLB would postpone games when significant numbers of players were unavailable, but the Red Sox, for whatever reason, had to play every game. And there were instances where players started the game, but in the middle of the game their tests came back positive and they had to be removed from the field in the middle of the game.

I’m not certain where I stand on how much managers influence the win-loss record in baseball. But if the Sox manager, Alex Cora, doesn’t at least get some nods for being manager of the year, I’ll be truly shocked.

The Red Sox are not a great team. This is not the 2018 behemoth, but rather an early rebuild for a hopefully competitive team in 2023. Their defence is not great. They lack depth in the rotation and the bullpen. I, for one, never doubted their offence—2020 surely had to have been a pandemic fluke. But I had serious questions about their starting rotation. Ultimately the rotation proved itself to be…adequate. And while they played through Covid-19 and kept their heads above water in September, the last few weeks were, at times, hard to watch. The Yankees swept them at Fenway, site of tonight’s game, just last weekend. Of late, the Yankees have been the better team. And all year long, the Red Sox played less competitively than I’d like against the other teams that made the playoffs.

I don’t expect them to win let alone make the World Series, but nobody expected them to be here anyway. Maybe they still have a few more surprises in them. After all, anything can happen in October baseball.

Credit for the piece is mine.

Updated DNA Ethnicity Estimates

Earlier this year I posted a short piece that compared my DNA ethnicity estimates provided by a few different companies to each other. Ethnicity estimates are great cocktail party conversations, but not terribly useful to people doing serious genealogy research. They are highly dependent upon the available data from reference populations.

To put it another way, if nobody in a certain ethnic group has tested with a company, there’s no real way for that company to place your results within that group. In the United States, Native Americans are known for their reluctance to participate and, last I heard, they are under-represented in ethnicity estimates. Fortunately for me, Western European population groups are fairly well tested.

But these reference populations are constantly being updated and new analysis being performed to try and sort people into ever more distinct genetic communities. (Although generally speaking the utility of these tests only goes back a handful of generations.)

Last night, when working on a different post, I received an email saying Ancestry.com had updated their analysis of my DNA. So naturally I wanted to compare this most recent update to last September’s.

Still mostly Irish

Sometimes when you look at data and create data visualisation pieces, the story is that there is very little change. And that’s my story. The actual number for my Irish estimate remained the same: 63%. I saw a slight change to my Scottish and Slavic numbers, but nothing drastic. My trace results changed, switching from 2% from the Balkans to 2% from Sweden and Denmark. But you need to take trace results with a pretty big grain of salt, unless they are of a different continent. Broadly speaking, we can be fairly certain about results at a continental level, but differences between, say, French and Germans are much harder to distinguish.

The Scottish part still fascinates me, because as far back as I’ve gone, I have not found an identifiable Scottish ancestor. A great-great-grandfather lived for several years in Edinburgh, but he was the son of two Ireland-born Irish parents. I also know that this Scottish part of me must come from my paternal lines as my mother has almost no Scottish DNA and she would need to have some if I were to have had inherited it from her.

Now for about half of my paternal Irish ancestors, I know at least the counties from which they came. My initial thought, and still best guess, is that the Scottish is actually Scotch–Irish from what is today Northern Ireland. But I am unaware of any ancestor, except perhaps one, who came from or has origins in Northern Ireland.

The other thing that fascinated me is that despite the additional data and analysis the ranges, or degree of uncertainty in another way of looking at it, increased in most of the ethnicities. You can see the light purple rectangles are actually almost all larger this year compared to last. I can only wonder if this time next year I’ll see any narrowing of those ranges.

Credit for the piece is mine.

The Pandemic of the Unvaccinated

Get your shots.

It’s pretty much that simple. But for just under half the country, it’s not getting through. So I went looking for some data on the breakdown of Covid-19 cases by vaccinated and unvaccinated people.

I found an analysis by the Kaiser Family Foundation (KFF), a non-profit that focuses on health and healthcare issues. They collected the data made available by 24 states—not all states provide a breakdown of breakthrough cases—and what we see across the country is pretty clear. If you want more details on their methodology, I highly recommend you check out their analysis.

Breakthrough cases

In all but Arizona and Alaska, vaccinated people account for less than 4% of Covid-19 cases. In most of these states, it’s less than 2%. For the states that we regularly cover here—Pennsylvania, New Jersey, Delaware, Virginia, and Illinois—we have New Jersey, Delaware, and Virginia represented in the data set.

Delaware leads the three with vaccinated people accounting for just 1% of Covid-19 cases. Virginia is 0.7% and New Jersey is just 0.2%. In other words, in New Jersey almost nobody vaccinated is catching Covid-19 over the observation period.

And when we look at the vaccinated population, we can see what breakthrough events—cases, hospitalisations, and deaths—they are experiencing.

In almost all states, less than 0.5% of vaccinated people are getting Covid-19. Only in Arkansas do we see a number greater than that: 0.54%. In no state do we have more than 0.6% of vaccinated people requiring hospitalisation. And with that number so low, it won’t surprise you that in no state do we have more than 0.01% of vaccinated people dying.

In other words, the rapidly climbing numbers of new cases and slowly rising deaths that we looked at yesterday, that’s almost all in people who haven’t yet gotten vaccinated.

Get your shots.

Credit for the piece is mine.

Olympic Recap/Retro

Every four years (or so) I have to confess that I think fondly back upon my former job, because I worked with a few wonderful colleagues of mine on some data about the Olympics. And the highlight was that we had a model to try and predict the number of medals won by the host country as we were curious about the idea of a host nation bump. In other words, do host countries witness an increase in their medal count relative to their performance in other Olympiads?

We concluded that host nations do see a slight bump in their total medal count and we then forecast that we expected Team GB (the team for Great Britain and Northern Ireland) to win a total of 65 medals. We reached 64 by the final day and it wasn’t until the women’s pentathlon when, in maybe the last event, Team GB won a silver medal bringing its total to 65, exactly in line with our forecast.

Probably the most Olympics I’ve ever watched.

Of course we also looked at the data for a number of other things, including if GDP per capita correlated to Olympic performance. We also looked at BMI and that did yield some interesting tidbits. But at the end of the day it was the medal forecast that thrilled me in the summer of 2012.

So yeah, today’s a shameless plug for some old work of mine. But I’m still proud of it two olympiads later.

If you’d like to see some of the pieces, I have them in my portfolio.

Credit for the piece is mine.

Easing Back into Normalcy

Happy Friday, all. Apologies for the lack of posting yesterday, I wasn’t feeling well and sitting in front of my computer typing stuff up wasn’t happening. But now the weekend is nearly upon us and to get in the mood I wanted to share this great dot plot from xkcd. It captures something I’ve definitely been thinking about.

Hopefully crossing most of these off in the next few weeks/months.

For example, on 3 March 2020, I had a friend over to my flat for drinks and to watch the Super Tuesday Democratic primary results come in. Tomorrow, if all goes according to plan, will be the first time I’ve had company over in 15 months.

In essence we have check boxes of the normal things we did in the before times and we’re just checking them off one by one until we can feel normal again.

Just please don’t contract a novel bat virus again.

Credit for the piece goes to Randall Munroe.

Choose Your Own FiveThirtyEight Adventure

In case you weren’t aware, the US election is in less than a week, five days. I had written a long list of issues on the ballot, but it kept getting longer and longer so I cut it. Suffice it to say, Americans are voting on a lot of issues this year. But a US presidential election is not like many other countries’ elections in that we use the Electoral College.

For my non-American readers, the Electoral College, very briefly, was created by the country’s founding fathers (Washington, Jefferson, Adams, Franklin, et al.) to do two things. One, restrict selection of the American president to a class of individuals who theoretically had a broader/deeper understanding of the issues—but who also had vested interests in the outcome. The founders did not intend for the American people to elect the president. The second feature of the Electoral College was to prevent the largest states from dominating smaller states in elections. Why else would Delaware and Rhode Island surrender their sovereignty to join the new United States if Virginia, Pennsylvania, and New York make all the decisions? (The founders went a step further and added the infamous 3/5 clause, but that’s another post.)

So Americans don’t elect the president directly and larger states like California, New York, and Texas, have slightly less impact than smaller states like Wyoming, Vermont, and Delaware. Each state is allotted a number of Electoral College votes and the key is to reach 270. (Maybe another time I’ll get into the details of what happens in a 269–269 tie.) Many Americans are probably familiar with sites like 270 To Win, where you can determine the outcome of the election by saying who won each state. But, even though the US election is really 50 different state elections, common threads and themes run through all those states and if one candidate or another wins one state, it makes winning or losing other states more or less likely. FiveThirtyEight released a piece that attempts to link those probabilities and help reveal how decisions voters in one state make may reflect on how other voters decide.

The interface is fairly straightforward—I’m looking at this on a desktop, though it does work on mobile—with a bunch of choices at the top and a choropleth map below. There we have a continually divergent gradient, meaning the states aren’t grouped into like bins but we have incredibly subtle differences between similar states. (I should also point out that Maine and Nebraska are the two exceptions to my above description of the Electoral College. They divide their votes by congressional district, whoever wins the district gets that Electoral College vote and then the state overall winner receives the remaining two votes.)

Below that we have a bar chart, showing each state, its more/less likely winner state and the 270 threshold. Below that, we have what I’ve read/heard described as a ball plot. It represents runs of the simulation. As of Thursday morning, the current FiveThirtyEight model says Trump has an 11 in 100 chance of winning, Biden, conversely, an 89-in-100 chance.

But what happens when we start determining the winners of states?

Well, for my non-American readers, this election will feature a large number of voters casting their ballots early. (I voted early by mail, and dropped my ballot off at the county election office.) That’s not normal. And I cannot emphasise this next point enough. We may not know who wins the election Tuesday night or by the time Americans wake up on Wednesday. (Assuming they’re not like me and up until Alaska and Hawaii close their polls. Pro-tip, there’s a potentially competitive Senate race in Alaska, though it’s definitely leaning Republican.)

But, some states vote early and/or by mail every year and have built the infrastructure to count those votes, or the vast majority of them, on or even before Election Day. Three battleground states are in that group: Arizona, Florida, and North Carolina. We could well know the result in those states by midnight on Election Day—though Florida is probably going to Florida.

So what happens with this FiveThirtyEight model if we determine the winners of those three states? All three voted for Trump in 2016, so let’s say he wins them again next week.

We see that the states we’ve decided are now outlined in black. The remainder of the states have seen their colours change as their odds reflect the set electoral choice of our three states. We also now have a rest button that appears only once we’ve modified the map. I’m also thinking that I like FiveyFox, the site’s new mascot? He provides a succinct, plain language summary of what the user is looking at. At the bottom we see what the model projects if Arizona, Florida, and North Caroline vote for Trump. And in that scenario, Trump wins in 58 out of 100 elections, Biden in only 41. Still, it’s a fairly competitive election.

So what happens if by midnight we have results from those three states that Biden has managed to flip them? And as of Thursday morning, he’s leading very narrowly in the opinion polls.

Well, the interface hasn’t really changed. Though I should add below this screenshot there is a button to copy the link to this outcome to your clipboard if, like me, you want to share it with the world or my readers.

As to the results, if Biden wins those three states, Trump has less than a 1-in-100 chance of winning and Biden a greater than 99-in-100.

This is a really strong piece from FiveThirtyEight and it does a great job to show how states are subtly linked in terms of their likelihood to vote one way or the other.

Credit for the piece goes to Ryan Best, Jay Boice, Aaron Bycoffe and Nate Silver.

Mask Up

Well, we made it to Friday. But, if you’ve been following me on the social, you’ll know that Covid is beginning to spread once again in Pennsylvania, New Jersey, Delaware, Virginia, and Illinois. I live in a tower block and I can say that many of my neighbours are no longer wearing masks indoors. Yet mask-wearing is the easiest defence we have against the spread of the coronavirus. So let’s take a look at the most effective types of masks, thankfully charted by xkcd.

Credit for the piece goes to Randall Munroe.

The Covid Recession’s Continuing Impact on Youth

Earlier this week, some of the work work my team does was published. We produced a one-page summary of a far larger and more comprehensive (relative to the scope of the summary) survey of consumers during the Covid Recession. I will spare you the details of recreating existing templates from scratch and the design decisions that went into that bit—neither insignificant nor unsubstantial—and rather focus on the one graphic we designed.

The broad thrust of the summary is that while overall we are beginning to see some job recovery, that the recovery is uneven and that, in fact, those below the age of 36 are getting hit pretty hard (my words, not the authors). That while in some industries the young are recovering in good numbers, in other industries, industries with a larger share of the youth population, young people are still losing jobs. Then we broke those top line numbers out by industries in the below graphic captured by screenshot.

How different age groups in different industries are faring in the recession.

There are a couple of things from a design side to discuss. We had about two or three days from when we started the project to develop some ideas and then execute and produce the summary. And as I noted above, that also included quite a bit of time in emulating existing documents and building ourselves a new template should we need to do something similar in the future.

But for that graphic in particular, there’s one thing I wanted to highlight: the lack of values on the axis. The challenge here was that the data displayed is people not working. And when we compared this time period (Wave 3) to the earlier waves, we were looking for declines. And so if we going to say that 36+ are gaining construction jobs, that would be -2% value and the youth are about a -13% increase. If you are doing a bit of a double-take at a negative increase, so did the team. Ultimately, we used the data to generate the chart, but then opted for qualitative labelling on the axes. They simply point that in one direction, youth are either gaining or losing jobs, and the same for the 36+. To reinforce this idea, we also added some descriptors in the far corner of each quadrant that said whether the age groups were gaining or losing jobs.

Despite the unusual design decisions I took in the graphic, I’m really proud of this piece especially given its tight turnaround. It shows in almost real-time how fractured the recovery—is this a recovery?—is at this point.

Credit for the piece goes to the team on this, Tom Akana, Kate Gamble, Natalie Spingler, and myself.

Red Sox Starting Rotation: A Dumpster Fire in a Dumpster Fire Year

Baseball for the Red Sox starts on Friday. Am I glad baseball is back? Yes?

I love the sport and will be glad that it’s back on the air to give me something to watch. But the But the way it’s being done boggles the mind. Here today I don’t want to get into the Covid, health, and labour relations aspect of the game. But, as the title suggests, I want to look at a graphic that looks at just how bad the Red Sox could be this (shortened) year. And over at FiveThirtyEight, they created a model to evaluate teams’ starting rotations on an ongoing basis.

The Red Sox are just bad.
Look at the Red Sox, one of the worst in baseball.

Form wise, this isn’t too difficult than what we looked at yesterday. It’s a dot plot with the dots representing individual pitchers. The size of the dots represents their number of total starts. This is an important metric in their model, but as we all know size is a difficult attribute for people to compare and I’m not entirely convinced it’s working here. Some dots are clearly smaller than others, but for most it’s difficult for me to clearly tell.

Colour is just tied to the colour of the teams. Necessary? Not at all. Because the teams are not compared on the same plot, they could all be the same colour. If, however, an eventual addition were made that plot the day’s matchups on one line, then colour would be very much appropriate.

I like the subtle addition of “Better” at the top of the plots to help the user understand the constructed metric. Otherwise the numbers are just that, numbers that don’t mean anything.

Overall a solid piece. And it does a great job of showing just how awful the Red Sox starting rotation is going to be. Because I know who Nate Eovaldi is. And I’ve heard of Martin Perez. Ryan Weber I only know through largely pitching in relief last year. And after that? Well, not on this graphic, but we have Eduardo Rodriguez who had corona and, while he has recovered, nobody knows how that will impact people in sports. There’s somebody named Hall who I have never heard of. Then we have Brian Johnson, a root for the guy story of beating the odds to reach the Major Leagues but who has been inconsistent. Then…it is literally a list of relief pitchers.

We dumped the salary of Mookie Betts and David Price and all we got was basically a tee-shirt saying “We still need a pitcher or three”.

Credit for the piece goes to Jay Boice.

Consumer Payment Methods During the Corona Times

Okay, so we’re going to post some more of my work today, but it’s not about cases and deaths. Instead, I took some data produced by my colleagues and thought that it could do for a small transformation from a table into a chart. The original table can be found in their report on consumer payment options during the Covid-19 pandemic.

After setting the kettle on for some tea this morning we started on their Table 1. Thirty minutes later and a cup of Irish Breakfast consumed, I had transformed it into this:

Obviously I changed the language/title a little bit. But the original was too long and didn’t fit. Also this is my blog, so my rules. The visualisation improves upon the table in a number of ways, but tables do have their place. Tables are great for organising information. Find a column header and a row header and you can get any specific data point. But, if you want to make a comparison between two data points or several of them, a chart is the way to go. Now, you may lose some precision. For example, do I know to the decimal point or to the tenths even what one of those dots represents? Nope. But at a glance, can I see which dots are below the overall respondents? Yep. It’s abundantly clear that those earning less than $40,000 per year have a greater availability of debit cards than the other groups shown.

And after all, I couldn’t have made this graphic without that table.

Full disclosure, as alluded to above, I work at the Federal Reserve Bank of Philadelphia. But I had nothing to do with the data, report, or presentation thereof.

Credit for the graphic is mine. The data to the folks over at the Consumer Finance Institute.