Open Data: Myths and Reality

Technological developments need to be fully understood by policymakers, or we won’t be able to keep public policy current and effective

10 min readOct 4, 2019

This is a tidied-up version of a talk I gave on 3.10.2019 to the UCL School of Public Policy alongside 360 Giving’s Katherine Duerden and Open Ownership’s Thom Townsend. It was for the series of seminars “Policy and Practice” run by Prof Robert Hazell.

From healthcare automation to open data

Although I’m a civil servant now, my background is as a technologist. Technology is what drove me to open data and open government. A long time ago I used to work in healthcare automation, taking manual hospital labs into fully automated operations.

Let’s take the automation of blood counts as an example.

In a manual lab, a technician spends a few hours a day at a microscope while using a clicker counter to count all the white and red cells they can see. They then transcribe the number on a piece of paper, hand that piece of paper to a porter, to eventually ship it by hand to the doctor who had requested the test. This process is inefficient and, most importantly, it’s fertile ground for potentially life-threatening transcription errors.

In an automated lab, instead, each test tube comes with a bar code to identify the patient and the tests that have been requested. The tube travels on a conveyor belt, in a fashion similar to a sushi restaurant, and it gets picked automatically by each machine, which knows from the bar code which tests to run. The results are sent automatically to a digital data store. The doctors can see these results in real-time, be assured that they are reliable as long as the machines are configured properly, and they can quickly ask for second opinions from other clinicians.

Introducing automation and a digital data store produced efficiency and reduced to zero the likelihood of transcription errors; but what’s important is that automation also increased the number and speed of tests that could be performed and the frequency with which they could be performed, and this multiplied the volume of data available about each patient. Thanks to automation the data is no longer on pieces of paper but on digital platforms; and data on digital platforms can be searched, compared, analysed, and charted.

This huge amount of data also had a massive side effect: the enabling of serendipity, which produced innovative applications. There’s one example of this serendipitous innovation that I remember very well and of which I’m fond: the generation of a systems of alerts for superbug infections. People who are hospitalized are more at risk of dangerous infections from bacteria that have developed resistance to antibiotics. To make sure that the right antibiotics are being used, these patients are tested daily for their response to different types of antibiotics. Traditionally, this testing was used to take a reactive decision: the doctor would check a patient’s clinical file, see that they were not responding to the treatment, and decide to change it.

Having this data on a digital platform made this response proactive (and potentially life and money saving): by aggregating data about multiple patients — seeing that multiple patients were all struggling with the same infection while on the same set of antibiotics — a detection system was developed that could pick up much earlier the signal of a developing superbug hotbed. The system sent alarms if a set of conditions were met, and trigger a change of treatment. This little piece of life-saving innovation was made possible by data.

From this experience I learned three things.

The first, is that improving the data layer with automation and cleaner processes makes operations safer, more efficient and effective. In the data world we often like to use the plumbing metaphor — improving the data infrastructure is like fixing the water plumbing, which enables better flows of water. Data infrastructure is the same.

The second is that once the groundwork on the data infrastructure has been performed, better data in greater volume allows new ideas and applications to be tested. Aggregating and linking data can be an incredible innovation trigger.

The third is that making the data openly available — without overly restrictive conditions — to developers and researchers makes innovation happen.

And this is how I got into Open Data.

From Open Data to Open Government

What links open data and open government? I believe that Open Data is a great enabler of open government. Clare Moriarty, who was Permanent Secretary at DEFRA at a time when they were releasing thousands of datasets, put it rather brilliantly. She once said “open government is about more than open data […] but, for some reason, they seem to go together”. I find this very true.

A good example is the adoption of the Open Contracting Data Standard across the public sector, something that the UK Government did as part of its commitments to the Open Government Partnership. This standard defines a way to enable the disclosure of information about all phases of the procurement process. It has brought transparency by means of open data, not just in the UK but around the world. It has been used in Colombia to expose a cartel between suppliers of school meals; it has been used by the World Bank to assess investments — for example, asking why a country was building a football stadium when their hospitals were falling down; it has been used in the British local government sector to detect the early signs of overspending.

Opening up data this way has often been the vehicle to make the concepts of openness and transparency understood and practiced in government.

Fixing the government plumbing — but what is the plumbing?

We need to fully understand what fixing the plumbing means for government. It means a number of highly technical things. For example, it means using unique identifiers in datasets, it means using a common data standard to share data between different platforms; it means agreeing to the definitions of the things we want to represent, while realising that our mental, common sense definition of something doesn’t necessarily fit with the data definition of the same thing.

For example, try to build a list of addresses of GPs. On the surface this is easy: put a few names of GPs alongside the address of their surgery. But we soon realise that these concepts are rather fuzzy. When I say “I go to my GP”, is the GP a person or a practice? It can be both. It’s not the same for specific uses. For the General Medical Council, a GP is a person; for the Care Quality Commission, it is a place; and to complicate things, there are also freelance GPs who work for multiple surgeries, as well as mobile, roaming practices with no fixed establishments. How do we build such a dataset and, more importantly, how do we maintain it? If people and companies are going to build services using this data they need to know that they can depend on it in the future, not just today.

Fixing the government plumbing means asking these questions in order to build reliable services based on data, and understanding that we might have to reconsider our innate definitions: data represents things in the world; but data is not the things.

Innovation by data has delivered, but not as much as we would like

Innovation means building services that change people’s understanding of their surroundings or affect their actions, making the complex simple.

In this sense, one of the most impactful data releases has been the real-time data of river gauge levels by the Environment Agency. Using this data, services like the GaugeMap website were launched making that data immediately more accessible to people in areas affected by flooding, making them able to assess their risk of flooding in real-time taking all sort of important decisions based on it.

But similar examples of government data used to innovate are few and far between.

Two policy challenges to open data

First of all, we know that there are legal and regulatory constraints to opening up data. But these are not always as clear cut as it may seem.

Think about the companies register. Its publication nurtured organizations like Open Corporates, or enabled investigations like the Panama Papers. We can all agree that releasing this data increased transparency and corporate accountability. However, the personal details contained by the companies register including highly personal data about the directors, such as part of their date of birth. This is data that can be easily used for frauds. The question is: where do we set the bar to the publication of personal data?

With GDPR, the example about superbug infection detection would probably lie in a grey area of legality today. And what about the ethics of its data use?Looking at the past 10 years, we have learned that legality and ethics can be relatively fluid concepts.

The second policy issue is the potential for conflicts between the data producer and the data consumer. Think about TfL. TfL releases open data that enables independent journey planning services to flourish, apps like CityMapper or Transit. However, TfL is also the delivery arm of the Mayor of London’s air quality and traffic policies. Some of these apps are now starting to offer their own transport services, just like CityMapper’s bus on demand. We could easily end up in a situation in which a transport operator is using TfL open data to deliver services that conflict with TfL‘s own traffic policy.

How to solve this conundrum without being obtrusive is one of the key challenges for public open data today.

Policy-makers need to become techies (or vice versa)

We need to be aware that things in technology develop and change at fast pace. Ten years ago, when we began the journey of open government data, we didn’t have Cambridge Analytica, we didn’t have facial recognition, we didn’t have deep fakes — these technological developments have created new issues in ethics, privacy, even shedding new light on the legitimate use of existing datasets. Policy-making is becoming intrinsically technical and we need to become better at understanding difficult technical concepts if we want to keep policy-making current and effective.

A few questions and points of discussion were raised during the evening. Here I touch upon a few of them.

Data for operations, data for accountability. I think that the drive to opening up data needs to consider that there are two broad types of datasets. Accountability data, such as financial data, and operations data, such as bus timetables. I’ve started to believe that the two are interesting to entirely different audiences. Accountability data is aimed at transparency activists, who form the famous “army of armchair auditors”, except it’s not an army, it’s a niche. Operations data is for a much larger audience, the general public, sometimes mediated by apps. This split has, for me, two key consequences

the quality test: publication rules should be different for data belonging to the two groups. Transparency data should be published as it is, because the committed and highly professional group of people interested in it will be able to assess its quality and suggest corrections when appropriate. On the other hand, operations data needs to be at an acceptable quality level before it gets published: the general public will get upset if they head to a pharmacy that is no longer where it’s supposed to be, or wait half an hour for a bus that never appears.
the transparency test: does publishing the data increase transparency? I believe that accountability data should always be released once we have agreed that transparency is important, and it’s hard to argue against it. However, operations data might be subject to a different treatment, and I recognise this is a legitimate view. People like me would argue that operations data must be released in every case; some others would disagree; but what matters is that operations data is not always something that increases transparency, while accountability data is. Therefore, we shouldn’t use the transparency test to argue for the release of operations data, because it doesn’t necessarily further the case for publication.

How can you argue that Britain is #1/2/3 in openness if we’re so bad at it? I’m not a big believer in leagues about open data that pitch countries against each other because it’s not necessarily a level playing field. In transport we always hear that the Netherlands or Finland are great examples of transport data. And, well, they are; except we’re talking about countries that don’t really compare, for numbers of trips, passengers, and complexity of transport system, with Britain as a whole (maybe they do with London, and London probably is in a good place compared to them). That is not to say that some countries have taken open data and really made it a priority in all their activities, espousing publications of public documents using data standards, adding SLAs to the publication, and supporting businesses with the release of key infrastructural datasets. It’s good that we are made aware that some countries are going full into it; but in a country where you we have already some innate complexity due to the peculiar constitutional setting of 4 countries sharing some aspects of their legal and administrative systems, things are always going to be complex.

Open data presents security risks of people being identified. Yes, but I’m not sure what the point is. If I was a criminal and wanted to identify people, all I’d have to do is walk into a council and check the electoral roll. Or buy the edited electoral register, which is probably good enough if I intend to commit fraud. There is plenty of personal data out there; and there is also plenty of non-personal data that tells you a lot about people, for example the Price Paid data released by Land Registry doesn’t have anything personal in, except you now know how wealthy your neighbours are. There is the famous case of Germany where the association of family owned businesses tried to block the publication of company ownership data claiming “risks of kidnapping”. Well, I’m not entirely sure that the kidnappers don’t already know who to kidnap. But in general, talking about security makes sense providing we define the threats against which we’re trying to mitigate; without identifying threats, the only way to be entirely secure with data is to lock it in a closed room inside a Faraday cage.

The privacy of company directors isn’t important. On the other hand, we often here that “it’s right to publish personal data of company directors because by their nature they have a higher status in society and therefore higher responsibility, while the average Joe doesn’t get the same benefits as a company director”. Although I don’t disagree with the premise, there is a reality-check to be made about the real lives of company directors. In a country were many are self-employed, your average Joe is very likely to be a company director. Sure, this does not excuse them from the duties of being a company director, but I’d be very careful in being aggressive about their data, because they might be in a weaker position than you’d think.