Random ramblings of a anonymous software engineer. Contains occasional profanity. Personal opinions, not related to employer.

Japan's COVID-19 Reports - 140KBs of Unadulterated Incompetence

These are some thoughts on the daily reports issued by the Japanese Ministry of Health, Labor, and Welfare. I'll be using data from the March 27 COVID-19 status report and the English Version.

Setting aside that these reports seem to be manually written in HTML by a human being (unlike other places that offer a dashboard, e.g. South Korea) there are some serious issues with this report format. I am writing this on March 29, but I will explain why am I using a report from March 27 at the end.

Sums? Where we are going, we don't need sums to work.

First, this is the breakdown table at the top of the report. I'm using the version from the English page, for legibility reasons. Here is an illustration of the legibility problem, using a grain of sesame:

Compare this with a Wikipedia page, with the same sesame grain.

COVID-19 numbers in Japan, March 27 2020

First problem here is that none of the really important tables here are actually text based tables - they are images. Images with no alt text, hence will give you dead silence if you try to access the information with a screen reader. I won't go into the details why this is bad as that's enough material for a whole new post, but you can read more on this here.

On the other hand, it's extremely detailed, which is nice - but how do these numbers break down? The answer is, not in a very straightforward way.

So this table, confusingly enough is a quasi-hierarchy. It has these levels:

  1. Level 1: PCR tested positive, PCR tested
  2. Level 2: With no symptoms, With symptoms, Under confirmation of the symptom, Death (this is special, see below)
  3. Level 3: w/Symptoms - Already discharged, w/Symptoms - Need in-patient treatment, w/o Symptoms - Already discharged, w/o Symptoms - Need in-patient treatment
  4. And so forth.

Tested negative is implicit. It's presumably pcr_tested - pcr_tested_positive. Ideally, pcr_positive and pcr_negative would have been level 1, with level 0 being pcr_tested.
Now confusingly enough, here is how you integrate to get PCR tested positive, which is level 2:

  • pcr_tested_positive = with_no_symptoms + with_symptoms + under_confirmation

Remember to not include death, because this is a quasi-hierarchy. The table tempts you to, but it's excluded. (The quasi-nature is because it is under the PCR tested positive umbrella on level 2.) So what about integrating level 3?

That's simple. You don't. Because no matter how hard you try, the numbers won't add up. (e.g. Give it a try - you'll end up with 129 != 131 and 1147 != 1191. We'll need one of these numbers later.) Level 4 adds up though, so the plot thickens.

So, up next - do the rows integrate nicely? Fortunately - yes. Does the breakdown matter to the average joe? Not really. Unless you are a healthcare official, the details really don't matter - the average population only needs to know the summed numbers to be able to compare how bad the situation is with other countries, and how careful they should be when leaving their homes.

Here is a simplified version I made that shows only what matters to the average person.

Simplified Japanese COVID-19 Statistics

According to this data, the positive case mortality rate is around 3.21%.

The statistically disappearing ghost ship passengers

Moving on to the next chart, we can see some interesting patterns here.

Hospitilization and Discharge, dated March 26

First, there is a new number that was not disclosed in the previous table - 2059, and 672 respectively. So what are these? 2059 is a sum including the cruise ship passengers. Even worse, in the Japanese version even this table is missing - and only provides a separate table with the cruise ship numbers, and completely omits the summed count.

Cruise ship numbers in Japanese page, dated March 26

Why? Nobody knows. The problem here is that these people are no longer on the ship; and have actually landed on Japanese soil. So the real number of positive cases on Japanese soil is actually 2059, and not 1387 as the previous table suggests. It is also worth noting that 603 of the 672 people have been discharged, and nobody has a slight idea where these people are as of today.

On top of that, "cured" cases reported by the government in press releases (but not this report, so the numbers aren't off by 600 people) include patients from the cruise ship, so if you calculate the ratio of cured to infected, it's on a different magnitude. Number magic! Deaths on the other hand have not been summed, so if you are a Japanese citizen who happened to die of COVID-19 complications after getting off of the Diamond Princess, you have not contributed to the mortality rate. Yay for statistics!

Moving on to the next nit, there is a bubble that says "from severe to moderate/mild symptoms" with a value next to it, but crossing two cells. What does this mean? Nobody knows, and there is no explanation why it crosses two cells on the page either.

The previous table, which is dated "12:00, Mar. 27", this table is dated "18:00 Mar. 26". That is a 18 hour difference between two adjacent tables - yet the total cases are exactly the same. So either all the hospitals are clocking out exactly at 18:00 and halting all testing, or something is very wrong.

Look Woody, infected people everywhere. EVERYWHERE.

Now, after roughly two screens worth of a information summarized about the local situation, they move on to a static table breakdown of the global situation with no visualization. This section allocates a 53.47% of the vertical pixel real estate of the entire report. Sure, it's useful information - but not in a table which doesn't allow sorting with no plot. Here is the real-estate breakdown visualized:

What the fuck Japan.

Here is the thing - there are a dozens of places that do this better than this page, so the citizens here can look them up there. Even if English literacy is not a thing in Japan, I'm sure Ctrl+F and typing a country name in English for specific cases is not rocket science for anyone who has finished mandatory public education. Why they allocate so much space to information very few will read is a mystery - if they really want to, this should probably be a separate post.

I'm not sure what the intent of this is. Is it to show how terrible the rest of the world is and how great Japan is managing the situation? I don't know.

I normally don't sum things, but when I do - they don't add up.

Now, we finally move on to the meat of the report. Local regional breakdowns. Why this is after the global numbers is beyond me, but at this point it seems like anything goes.

Japan regional breakdown, no date disclosed

The table columns are in the order of municipality, patients, currently in-hospital, discharged, and dead.

Now to add more consistency to the report, we have yet another unsortable static table. You can also see here that unlike the previous table, it has been sorted by the second column, which is well, consistent with none of the tables we have seen so far - but at this point, who cares.

This time, the sum of patients include the dead. Try it yourself.

  • 227 = 194 + 28 + 5

There is no rhyme or reason behind the inconsistencies between the first summary table and the regional breakdown. (Not to mention that none of the tables have matching columns for baseline comparison, because that seemed like a good idea at the time to someone at Kasumigaseki.) There isn't much more to write about here, common sense seems to be a scarce resource to those who have been involved in the report.

Moving on, here is the grand total.

Wait, what? Where did 1191 come from? When is this table from? Well, nobody knows - it's not written anywhere. Let's add those up, like we did on the first row of the same table.

  • 828 + 319 + 46 = 1193


What we do know is that this is yet another set of new numbers which don't match up to anything we have seen so far. Or does it? Remember they noted 無症状病原体保有者を除く (excluding those who are asymptomatic) above? Well, let's try that - here are some assumptions we will make based on what the government probably was thinking.

  1. Awaiting is considered asymptomatic, because that lowers the count. No symptoms, right?
  2. Asymptomatic is obviously asymptomatic.
  3. The dead has no symptoms, obviously.
  4. Remember we pretend the ghost ship passengers don't exist? They don't. They don't exist. Shhh.

Ah, but there is 1191, which is the "with symptoms" column sum in the first table, right? But what about awaiting and dead and all of that? Maybe there is another way to compute this.

After trying a bunch of combinations from the table above, the conclusion is that this number might also come from this function.

Coefficients used by magic function

  • sum_local = pcr_positive - with_no_symptoms - awaiting_symptoms
  • 1191 = 1349 - 131 - 27

It turns out - this magic function works for all rows. That means the hierarchy is represented in yet another confusing form, but let's not go too deep into that.

I have no idea what they did about the dead, but the dead do not seem to be part of the equation. Why they chose this particular subset is beyond me.

Incompetence? That's our Scrapeshield.

To add insult to injury, if you want to use these reports as a foundation for analysis, you are in for a surprise. (at least I was.)

Here are some issues that I encountered:

  1. The report formats constantly seem to change
  2. Images, as noted above
  3. No semantics or meaningful selectors in the markup
  4. Multiple report types
  5. No URL patterns

The reason why I used a two day old report on March 29 is simple, there was no full report yesterday nor today. I'm guessing the Tokyo lockdown means that nobody can get to work, and PCs are too expensive to buy for home use so the people at the Ministry of Health can enjoy a nice weekend with hoarded pasta.

(4) in particular is interesting - the government releases multiple types of reports, depending on the week of day. Here are the different types: (you can see the full list here)

  • 新型コロナウイルス感染症の現在の状況と厚生労働省の対応について: Full report. Only released on weekdays. Summary table, cruise ship table, international table, followed by regional breakdown table.
  • 新型コロナウイルス感染症の現在の状況について: Weekend/public holiday edition. Summary table, cruise ship table, and international table. No local regional breakdown.
  • 新型コロナウイルスに関連した患者等の発生について: Released daily. Least useful, easiest to parse. Contains only a delta of new confirmed cases (no status, like the regional breakdown in the fully report) broke down by region.

There is still no delta report for today as of now (March 29, 16:16) and as the report release time is not noted on the posts, I'm not sure when to expect it.

(UPDATE: I see it now, as of 19:30. I'm suspecting the 18:00 timestamp from earlier is probably related to this. Just a guess.)


So, so far I've been complaining about this report type with no actionable feedback - which is bad. I doubt the Japanese government will read this post and actually take any sensible action, but here are some suggestions.

  1. Make two reports. A summary (not more than two pages) report for citizens, and another detailed one for health professionals and formal use.
  2. Remove the international table. It's not useful to 99% of the audience out there. We have WHO data for that - that's what hyperlinks are for. If you want to do your own international edition (which I believe you should not, considering the quality of your existing reports) please do it as a separate report.
  3. Accompany reports with the raw data used. Even better, provide a public data feed for people to take away and throw into a tool of their choice - you might get a nice dashboard or trend report for free from someone who is bored enough.
  4. If you can't do (3), at least add some selectors to your HTML reports that can be used to pull the data out.
  5. Make the data field availability as consistent as possible. Don't suddenly add and remove fields.
  6. Make the format of the report and data points available consistent every day.
  7. Don't invent magic equations. People notice when the numbers don't match up. If you invent an equation, disclose how you ended up with that number.
  8. Release reports and data regularly, and communicate when this will be and if you cannot communicate that too.
  9. Make the report URLs predictable, so people can scrape if needed.
  10. Please consider accessibility when making these reports available. Sure - OCR technology has advanced, but that is not a valid excuse.
  11. Enough with the bloody PDFs. Tabular data in the worst case can be released as CSV or XLS and nobody will complain. Maybe that grumpy guy who still uses his 25 year old PC-98 might, but f$#k him.