Post by Nadica (She/Her) on Nov 19, 2024 3:45:33 GMT
How our team at Our World in Data became a global data source on COVID-19 - Published Nov 18, 2024
By: Saloni Dattani, Edouard Mathieu and Lucas Rodés-Guirao
Our small team made COVID-19 data clear, reliable, and accessible to a global audience. This is how it happened.
Before COVID-19, Our World in Data (OWID) was a small team with large ambitions.
We made data and research on global issues like poverty, climate, and global health accessible to the general public. But our team took a gradual approach — we updated our charts manually, usually on an annual basis. We provided context on statistics, cleared common misconceptions, and communicated critical insights from the data.
When COVID-19 spread across the world, our small team suddenly pivoted to compiling daily data that was of global importance. OWID became the primary global source for many COVID-19 indicators in a few short months. Our datasets powered the dashboards of large media organizations and became a crucial resource for journalists, governments, academic researchers, and the wider public.
In this retrospective article, we look back on how our small team faced the large, sudden demands of global COVID-19 data and how we adapted our processes to make this work part of our mission — to make data and research on the pandemic transparent and accessible to a wide audience.
Before the pandemic
Our World in Data was launched in 2014 by Max Roser, who built it with a few part-time colleagues in his spare time.
Over the years, more team members joined — including our colleagues Esteban Ortiz-Ospina, an economist and now co-director; Joe Hasell, who worked on global poverty and is now head of product and design; and Hannah Ritchie, who took charge of environmental research and is now deputy editor.
The authors of this article, Edouard Mathieu, Saloni Dattani, and Lucas Rodés-Guirao joined the team in March 2020, February 2021, and April 2021 respectively, and the team has grown much larger since the pandemic.
Our World in Data aimed to cover a wide range of topics, but with a small team — only eight people at the start of 20202 — we placed more focus on areas like inequality, poverty, global health, climate change, and agriculture. We updated datasets yearly, which often relied on manual efforts.
Our work centered on gathering and visualizing data from trusted research organizations and presenting it to the public in a digestible format. This small-scale approach changed drastically during the COVID-19 pandemic.
The drive to tackle COVID-19
In early 2020, as large outbreaks spread from China to Italy, the team saw a looming threat and a massive gap in available data to address it.
First, the available data suggested that growth rates were high, and if no action was taken, the pandemic would only slow down on its own after around two-thirds of the population had been infected (the “herd immunity threshold”).3 By this time, many countries would see vast numbers of deaths and healthcare workers stretched beyond their limits.
Second, the world needed to track the spread of COVID-19 worldwide in real-time, in a centralized, accurate, and constantly updated fashion.
But official data sources were patchy at best. Early on, the World Health Organization (WHO) provided daily updates only in an online spreadsheet. These updates often had critical entry errors, such as global totals that didn’t match the sum of every country and cumulative deaths that were lower than the previous day.
Other data publication sites, like Worldometers, lacked transparency about their sources or contradicted official figures. There was nowhere to easily compare trends between countries or visualize the epidemic’s progression over time.
With this in mind, our team’s work shifted almost entirely to COVID-19 in late February 2020 — compiling official data sources, communicating the trends and limitations of available data, and clearing up misconceptions.
Early work on COVID-19
As the pandemic took hold across countries, the initial work was very challenging.
We launched the first version of our COVID-19 page in early March 2020, initially embedding other dashboards published by Johns Hopkins University, the WHO, and the University of Oxford. At the time, the OWID Grapher tool, which powers our charts, could not handle daily data, which prevented us from visualizing it ourselves.
It quickly became apparent that many of these alternative dashboards focused on the latest cumulative estimate, making it difficult to interpret trends over time.
Understanding how the pandemic changed over time was crucial. Realizing this, our software engineers Breck Yunits and Daniel Gavrilov overhauled the Grapher tool to handle daily updates so the team could match the pandemic’s fast pace.
Hannah Ritchie and Max Roser spent many early mornings manually transcribing numbers on the number of cases and tests conducted from WHO reports while outbreaks grew worldwide. They often faced mismatched totals, confusing formats, and outdated numbers.
They also wrote pages to explain these indicators and how to interpret them and compare statistics across sources.
This was crucial because many misconceptions were common. For example, the number of cases was often misinterpreted as the number of infections, even though testing rates were limited and many infections had not been confirmed.
Similarly, early calculations of the case fatality rate (CFR) were often flawed. They underestimated the actual mortality risk because of the delays between cases and deaths, limited testing, and lack of death registration in some countries.
To address these, our team also compiled a global dataset on COVID-19 testing rates.
Building the world’s datasets on COVID-19 testing and vaccination
Although most of the data we present on Our World in Data is republished from other sources with credit rather than compiled from national sources by our team, there were two prominent exceptions during COVID-19: testing and vaccination data.
Early in the pandemic, it became clear to us that testing rates were essential to correctly interpret the number of cases. Without sufficient testing, the number of cases would give a very limited picture of how rapidly new outbreaks were growing and where.
However, no global dataset on testing rates was available. So, in March 2020, our colleagues — starting with Joe Hasell and Esteban Ortiz-Ospina — began to build one, aiming to include as many countries as possible.
This was very challenging. Countries shared daily test counts in hard-to-process formats, including PDFs and HTML tables. Some countries counted “people tested”, while others counted “swabs tested”, creating inconsistent indicators across nations. This was complicated by the fact that many people would go on to be tested more than once.
This confusion also applied to the types of tests. Some countries only reported PCR tests, while others only reported antibody tests. More confusingly, some combined numbers from both types of tests.
Edouard Mathieu, now our head of data and research, joined our team in March 2020 to manage this growing data pipeline. The rest of the team, especially former team members Cameron Appel and Daniel Gavrilov, helped him compile global testing data, eventually building a dataset of more than 130 countries and territories. Cameron became an all-around contributor, helping across multiple areas of COVID-19 data.
The chart below shows cumulative testing rates across countries and clarifies which indicators are used. Our team spent many months collecting this data, contacting national health organizations to clarify these differences, and updating it.
We published a peer-reviewed article presenting this database in the journal Nature Scientific Data.
ourworldindata.org/our-world-in-data-covid-19-testing-dataset-published-by-nature
By late 2020, after it became clear that vaccines would soon become available to the general public, our team contacted other international health institutions to understand whether they had plans to compile this data internationally. However, none of them planned on creating a global vaccination dataset.
Edouard Mathieu persuaded the team that we should step up, as we had already improved our processes, and suggested that collecting the data ourselves would fill an essential data gap that the world needed.
The map below shows the first vaccination data point we added: on December 8, 2020, the first person was vaccinated outside of a clinical trial in the United Kingdom.4 Finally, we could show a positive indicator, focusing on how we could handle the pandemic, and reduce the number of lives lost.
You can see how this evolved over time by clicking on the “Play timelapse” button, or from the line chart, which both show how more and more countries began vaccinating and reporting this data over time.
Although the team was better prepared, collecting vaccination data was even more challenging. Data formats varied even more widely, from HTML tables and PDFs to press releases and even video announcements.
While we were able to automate some parts of these data extraction procedures, our team also had to watch daily videos of press conferences to note down the number of daily vaccinations from some countries.
Our vaccination dataset quickly became the only global source for COVID-19 vaccination statistics, including 210 countries and territories.
It was widely adopted by major organizations including the WHO, and served as the foundation for understanding vaccine distribution and equity worldwide.
We published a peer-reviewed article presenting this database in the journal Nature Human Behaviour.
ourworldindata.org/covid-vaccinations-nature
The COVID-19 Data Explorer
As data complexity grew, our team introduced the COVID-19 Data Explorer, a powerful new chart tool that allowed users to easily switch between indicators and countries, and explore and track the progression of the pandemic in a much more accessible way.
This tool, with its daily updates and user-friendly design, became the go-to resource for millions of people worldwide to keep up with daily updates on COVID-19.
It expanded and allowed users to explore a wide range of indicators — cases, deaths, testing rates, hospitalizations, excess mortality, vaccination rates, mobility trends, and viral strains — and compare them side by side. You can explore the current version of the COVID-19 Data Explorer online.
Our charts and data were widely used by news outlets such as The Guardian, BBC, The Financial Times, The Economist, The Spectator, Reuters, CNN, and The New York Times, academic researchers, health ministers, and political leaders of many countries, including both US Presidents Donald Trump and Joe Biden.
Open source and public data provision
Our datasets across topics have been downloadable and transparent for many years, but the open-source access we provided to our COVID-19 data and Grapher tool on GitHub became essential for maintaining global data on the pandemic.
This transparency was important because data needed to be compiled across many countries with different data collection and publishing procedures, and because those procedures occasionally changed. Web pages might be moved, data formats might be changed, and simple processing steps — such as what time the data was updated — weren’t explained.
Our GitHub repository enabled users worldwide to contribute to COVID-19 data.
The first chart below shows the number of users who contributed to our repository each week. Contributions spiked at the beginning of 2021 when the vaccine rollout began across countries, and our dataset became the only source of international data on vaccination rates.
With our open source dataset, users could help identify data sources in regions where our team couldn’t access direct information, help to translate from official sources, flag changes, and potential data errors — which we passed on to other institutions — and suggest improvements to the data pipeline.
The second chart shows that more than 700 users worldwide contributed to our data repository, with many submitting issues, pull requests, code reviews, or adding comments to help us improve our dataset. In total, they made more than 7,000 contributions.
Even now, anyone can contribute or browse each update we made to the dataset, which has been updated more than 31,000 times since it launched.
This collaborative approach made our data more transparent, maintainable, up-to-date, and far less error-prone than if we had published it in static reports.
Some contributors also joined our team. In April 2021, the team hired Lucas Rodés-Guirao, who had already contributed as a volunteer, to improve our processes on GitHub. By the end of the year, he was handling all our coronavirus data pipelines and updates.
With a growing team and user base, we were able to streamline processes and improve automation. The result was a faster, more accurate pipeline that allowed the team to focus on new work as the pandemic progressed.
Communication and public outreach
With our data powering dashboards and the team tracking trends daily, it became critical to communicate that information clearly. Our team spent many hours writing digestible explanations of how to properly interpret the figures and making these notes clear on the charts.
We also received questions and feedback from users, journalists, and officials, who used the data for policy decisions and public announcements, and we clarified indicators that could be easily misinterpreted.
Hannah Ritchie explained complex statistics in plain language on platforms like the BBC’s More or Less radio podcast and presented at the Royal Statistical Society’s evidence session on the pandemic. Max Roser spoke at the UK Parliament’s Science & Technology Committee about COVID-19 data and policies and the pandemic situation around the world. We communicated directly with the public and with those leading pandemic responses.
Edouard Mathieu published a commentary article in Nature, explaining how governments and international organizations could improve their data formats and publication processes. Charlie Giattino wrote about interpreting estimates of excess mortality and the number of infections. Edouard Mathieu and Max Roser wrote one of our most-viewed articles to explain visually how death rates were higher among those unvaccinated.
Our Twitter presence became a central avenue for rapid updates, with users flagging issues and officials from some countries directly contacting the team to clarify updates.
Our colleagues, particularly Esteban Ortiz-Ospina, reviewed direct public feedback — through our site’s feedback form, email, GitHub, and Twitter — for hours each day, to help ensure our data was clear and transparent and could be quickly improved if there were any issues.
Conclusion
Our work on COVID-19 data helped us see the importance of collaboration firsthand. People around the world helped us add new data, translate information, flag errors, and build a more accurate picture of the pandemic worldwide.
It underscored the need for open-source tools and automated workflows, which allowed us to respond quickly without sacrificing quality. By transitioning from largely manual processes to more streamlined systems, we made it possible to track crucial data more efficiently and reliably.
Unrestricted funding was also essential to initiating this project. Although we received dedicated funding for COVID-19 work much later, we were able to pivot swiftly to COVID-19 data because we had support from unrestricted donations. It gave us the flexibility to address the pressing needs of the moment.
Ultimately, our unique collaboration of researchers, programmers, and data scientists made it possible for us to communicate research and pandemic trends to the public in a clear, accessible, and maintainable way. Our experience showed how impactful a small, adaptable team could be, providing clarity and transparency at a time when the world needed it most.
By: Saloni Dattani, Edouard Mathieu and Lucas Rodés-Guirao
Our small team made COVID-19 data clear, reliable, and accessible to a global audience. This is how it happened.
Before COVID-19, Our World in Data (OWID) was a small team with large ambitions.
We made data and research on global issues like poverty, climate, and global health accessible to the general public. But our team took a gradual approach — we updated our charts manually, usually on an annual basis. We provided context on statistics, cleared common misconceptions, and communicated critical insights from the data.
When COVID-19 spread across the world, our small team suddenly pivoted to compiling daily data that was of global importance. OWID became the primary global source for many COVID-19 indicators in a few short months. Our datasets powered the dashboards of large media organizations and became a crucial resource for journalists, governments, academic researchers, and the wider public.
In this retrospective article, we look back on how our small team faced the large, sudden demands of global COVID-19 data and how we adapted our processes to make this work part of our mission — to make data and research on the pandemic transparent and accessible to a wide audience.
Before the pandemic
Our World in Data was launched in 2014 by Max Roser, who built it with a few part-time colleagues in his spare time.
Over the years, more team members joined — including our colleagues Esteban Ortiz-Ospina, an economist and now co-director; Joe Hasell, who worked on global poverty and is now head of product and design; and Hannah Ritchie, who took charge of environmental research and is now deputy editor.
The authors of this article, Edouard Mathieu, Saloni Dattani, and Lucas Rodés-Guirao joined the team in March 2020, February 2021, and April 2021 respectively, and the team has grown much larger since the pandemic.
Our World in Data aimed to cover a wide range of topics, but with a small team — only eight people at the start of 20202 — we placed more focus on areas like inequality, poverty, global health, climate change, and agriculture. We updated datasets yearly, which often relied on manual efforts.
Our work centered on gathering and visualizing data from trusted research organizations and presenting it to the public in a digestible format. This small-scale approach changed drastically during the COVID-19 pandemic.
The drive to tackle COVID-19
In early 2020, as large outbreaks spread from China to Italy, the team saw a looming threat and a massive gap in available data to address it.
First, the available data suggested that growth rates were high, and if no action was taken, the pandemic would only slow down on its own after around two-thirds of the population had been infected (the “herd immunity threshold”).3 By this time, many countries would see vast numbers of deaths and healthcare workers stretched beyond their limits.
Second, the world needed to track the spread of COVID-19 worldwide in real-time, in a centralized, accurate, and constantly updated fashion.
But official data sources were patchy at best. Early on, the World Health Organization (WHO) provided daily updates only in an online spreadsheet. These updates often had critical entry errors, such as global totals that didn’t match the sum of every country and cumulative deaths that were lower than the previous day.
Other data publication sites, like Worldometers, lacked transparency about their sources or contradicted official figures. There was nowhere to easily compare trends between countries or visualize the epidemic’s progression over time.
With this in mind, our team’s work shifted almost entirely to COVID-19 in late February 2020 — compiling official data sources, communicating the trends and limitations of available data, and clearing up misconceptions.
Early work on COVID-19
As the pandemic took hold across countries, the initial work was very challenging.
We launched the first version of our COVID-19 page in early March 2020, initially embedding other dashboards published by Johns Hopkins University, the WHO, and the University of Oxford. At the time, the OWID Grapher tool, which powers our charts, could not handle daily data, which prevented us from visualizing it ourselves.
It quickly became apparent that many of these alternative dashboards focused on the latest cumulative estimate, making it difficult to interpret trends over time.
Understanding how the pandemic changed over time was crucial. Realizing this, our software engineers Breck Yunits and Daniel Gavrilov overhauled the Grapher tool to handle daily updates so the team could match the pandemic’s fast pace.
Hannah Ritchie and Max Roser spent many early mornings manually transcribing numbers on the number of cases and tests conducted from WHO reports while outbreaks grew worldwide. They often faced mismatched totals, confusing formats, and outdated numbers.
They also wrote pages to explain these indicators and how to interpret them and compare statistics across sources.
This was crucial because many misconceptions were common. For example, the number of cases was often misinterpreted as the number of infections, even though testing rates were limited and many infections had not been confirmed.
Similarly, early calculations of the case fatality rate (CFR) were often flawed. They underestimated the actual mortality risk because of the delays between cases and deaths, limited testing, and lack of death registration in some countries.
To address these, our team also compiled a global dataset on COVID-19 testing rates.
Building the world’s datasets on COVID-19 testing and vaccination
Although most of the data we present on Our World in Data is republished from other sources with credit rather than compiled from national sources by our team, there were two prominent exceptions during COVID-19: testing and vaccination data.
Early in the pandemic, it became clear to us that testing rates were essential to correctly interpret the number of cases. Without sufficient testing, the number of cases would give a very limited picture of how rapidly new outbreaks were growing and where.
However, no global dataset on testing rates was available. So, in March 2020, our colleagues — starting with Joe Hasell and Esteban Ortiz-Ospina — began to build one, aiming to include as many countries as possible.
This was very challenging. Countries shared daily test counts in hard-to-process formats, including PDFs and HTML tables. Some countries counted “people tested”, while others counted “swabs tested”, creating inconsistent indicators across nations. This was complicated by the fact that many people would go on to be tested more than once.
This confusion also applied to the types of tests. Some countries only reported PCR tests, while others only reported antibody tests. More confusingly, some combined numbers from both types of tests.
Edouard Mathieu, now our head of data and research, joined our team in March 2020 to manage this growing data pipeline. The rest of the team, especially former team members Cameron Appel and Daniel Gavrilov, helped him compile global testing data, eventually building a dataset of more than 130 countries and territories. Cameron became an all-around contributor, helping across multiple areas of COVID-19 data.
The chart below shows cumulative testing rates across countries and clarifies which indicators are used. Our team spent many months collecting this data, contacting national health organizations to clarify these differences, and updating it.
We published a peer-reviewed article presenting this database in the journal Nature Scientific Data.
ourworldindata.org/our-world-in-data-covid-19-testing-dataset-published-by-nature
By late 2020, after it became clear that vaccines would soon become available to the general public, our team contacted other international health institutions to understand whether they had plans to compile this data internationally. However, none of them planned on creating a global vaccination dataset.
Edouard Mathieu persuaded the team that we should step up, as we had already improved our processes, and suggested that collecting the data ourselves would fill an essential data gap that the world needed.
The map below shows the first vaccination data point we added: on December 8, 2020, the first person was vaccinated outside of a clinical trial in the United Kingdom.4 Finally, we could show a positive indicator, focusing on how we could handle the pandemic, and reduce the number of lives lost.
You can see how this evolved over time by clicking on the “Play timelapse” button, or from the line chart, which both show how more and more countries began vaccinating and reporting this data over time.
Although the team was better prepared, collecting vaccination data was even more challenging. Data formats varied even more widely, from HTML tables and PDFs to press releases and even video announcements.
While we were able to automate some parts of these data extraction procedures, our team also had to watch daily videos of press conferences to note down the number of daily vaccinations from some countries.
Our vaccination dataset quickly became the only global source for COVID-19 vaccination statistics, including 210 countries and territories.
It was widely adopted by major organizations including the WHO, and served as the foundation for understanding vaccine distribution and equity worldwide.
We published a peer-reviewed article presenting this database in the journal Nature Human Behaviour.
ourworldindata.org/covid-vaccinations-nature
The COVID-19 Data Explorer
As data complexity grew, our team introduced the COVID-19 Data Explorer, a powerful new chart tool that allowed users to easily switch between indicators and countries, and explore and track the progression of the pandemic in a much more accessible way.
This tool, with its daily updates and user-friendly design, became the go-to resource for millions of people worldwide to keep up with daily updates on COVID-19.
It expanded and allowed users to explore a wide range of indicators — cases, deaths, testing rates, hospitalizations, excess mortality, vaccination rates, mobility trends, and viral strains — and compare them side by side. You can explore the current version of the COVID-19 Data Explorer online.
Our charts and data were widely used by news outlets such as The Guardian, BBC, The Financial Times, The Economist, The Spectator, Reuters, CNN, and The New York Times, academic researchers, health ministers, and political leaders of many countries, including both US Presidents Donald Trump and Joe Biden.
Open source and public data provision
Our datasets across topics have been downloadable and transparent for many years, but the open-source access we provided to our COVID-19 data and Grapher tool on GitHub became essential for maintaining global data on the pandemic.
This transparency was important because data needed to be compiled across many countries with different data collection and publishing procedures, and because those procedures occasionally changed. Web pages might be moved, data formats might be changed, and simple processing steps — such as what time the data was updated — weren’t explained.
Our GitHub repository enabled users worldwide to contribute to COVID-19 data.
The first chart below shows the number of users who contributed to our repository each week. Contributions spiked at the beginning of 2021 when the vaccine rollout began across countries, and our dataset became the only source of international data on vaccination rates.
With our open source dataset, users could help identify data sources in regions where our team couldn’t access direct information, help to translate from official sources, flag changes, and potential data errors — which we passed on to other institutions — and suggest improvements to the data pipeline.
The second chart shows that more than 700 users worldwide contributed to our data repository, with many submitting issues, pull requests, code reviews, or adding comments to help us improve our dataset. In total, they made more than 7,000 contributions.
Even now, anyone can contribute or browse each update we made to the dataset, which has been updated more than 31,000 times since it launched.
This collaborative approach made our data more transparent, maintainable, up-to-date, and far less error-prone than if we had published it in static reports.
Some contributors also joined our team. In April 2021, the team hired Lucas Rodés-Guirao, who had already contributed as a volunteer, to improve our processes on GitHub. By the end of the year, he was handling all our coronavirus data pipelines and updates.
With a growing team and user base, we were able to streamline processes and improve automation. The result was a faster, more accurate pipeline that allowed the team to focus on new work as the pandemic progressed.
Communication and public outreach
With our data powering dashboards and the team tracking trends daily, it became critical to communicate that information clearly. Our team spent many hours writing digestible explanations of how to properly interpret the figures and making these notes clear on the charts.
We also received questions and feedback from users, journalists, and officials, who used the data for policy decisions and public announcements, and we clarified indicators that could be easily misinterpreted.
Hannah Ritchie explained complex statistics in plain language on platforms like the BBC’s More or Less radio podcast and presented at the Royal Statistical Society’s evidence session on the pandemic. Max Roser spoke at the UK Parliament’s Science & Technology Committee about COVID-19 data and policies and the pandemic situation around the world. We communicated directly with the public and with those leading pandemic responses.
Edouard Mathieu published a commentary article in Nature, explaining how governments and international organizations could improve their data formats and publication processes. Charlie Giattino wrote about interpreting estimates of excess mortality and the number of infections. Edouard Mathieu and Max Roser wrote one of our most-viewed articles to explain visually how death rates were higher among those unvaccinated.
Our Twitter presence became a central avenue for rapid updates, with users flagging issues and officials from some countries directly contacting the team to clarify updates.
Our colleagues, particularly Esteban Ortiz-Ospina, reviewed direct public feedback — through our site’s feedback form, email, GitHub, and Twitter — for hours each day, to help ensure our data was clear and transparent and could be quickly improved if there were any issues.
Conclusion
Our work on COVID-19 data helped us see the importance of collaboration firsthand. People around the world helped us add new data, translate information, flag errors, and build a more accurate picture of the pandemic worldwide.
It underscored the need for open-source tools and automated workflows, which allowed us to respond quickly without sacrificing quality. By transitioning from largely manual processes to more streamlined systems, we made it possible to track crucial data more efficiently and reliably.
Unrestricted funding was also essential to initiating this project. Although we received dedicated funding for COVID-19 work much later, we were able to pivot swiftly to COVID-19 data because we had support from unrestricted donations. It gave us the flexibility to address the pressing needs of the moment.
Ultimately, our unique collaboration of researchers, programmers, and data scientists made it possible for us to communicate research and pandemic trends to the public in a clear, accessible, and maintainable way. Our experience showed how impactful a small, adaptable team could be, providing clarity and transparency at a time when the world needed it most.