Cookies on Knowhow

We use cookies in order for parts of NCVO Knowhow to work properly, and also to collect information about how you use the site. We use this information to improve the site and tailor our services to you. For more, see our page on privacy and data protection.

OK

Skip to content. | Skip to navigation

Community-made content which you can improve Case study from our community

Working with messy open data

This page is free to all
A case study of working with open data found in unpredictable shapes and sizes on the web. By the end, you will know what major issues to look out for when working with open data and what steps you can take to ensure you're able to analyse it afterwards.

Background

The ESRC funded a year-long project to collect data from various public bodies in the UK and present experiences of obtaining and cleaning the data through workshops, presentations and website media. There was also the chance to analyse the data if it could be cleaned to an appropriate degree. The types of data we collected were:

  • grants made by large grant makers (‘grantmakers’)
  • local authority grant information
  • local authority expenditure
  • Clinical Commissioning Group (CCG) expenditure

The issues we faced

The difficulties we faced were:

  • missing unique identifiers for organisations
  • matching organisations reported in the data to known organisations
  • non-standardised naming conventions
  • non-standardised formatting
  • varying file types
  • expenditure in PDF format

The actions we took

The first thing that we needed to do with all the types of data was to download and convert it into a standardised format so that we could understand what it was saying. To understand what it was saying meant cleaning the data so that we could understand not only the physical text and numbers but also the identity of organisations. This was not an easy task and provided for much of the work. For example, with the grantmakers data it was necessary to translate the fields to the 360Giving Standard, and for the CCG data we often had to convert PDF documents to Excel files, which in turn was entirely dependent on quality of the Excel formatting prior to the file being converted to PDF by the CCG.

Because there was no standard in which expenditure reports were written, the largest part of working with this open data was to match the names of recipients found in the downloaded speadsheets to registers of known organisations. To complete this process, it was necessary to write computer programs that would make estimates from the registers of who the recipient was likely to be. This would have been avoided had each recipient been given a recognised unique identifier, such as a charity number or company number, or even a 360Giving identifier, next to each payment received by that organisation.

Actions

  1. Download the data from websites
  2. Clean and standardise the data
  3. Identify recipients by matching names against registers of known organisations

Positive outcomes

After all the effort made to standardise each payment, we were left with a lot of information about the network of procurement relations between public bodies in the UK. We were able to produce maps of payments, which demonstrated the potential to understand the identity of recipients  as well as the value of payments made by public bodies.  This web application is an example of what can be achieved once the data has been cleaned and identities confirmed.

Negative outcomes

The time it took to clean the data so that it was in a readable format and could be linked to other data was enormous.  To do the name matching alone, it took a full-time programmer many months to find the best method to do this and apply it.  Furthermore, because expenditure reports uploaded to websites have not been standardised it still remains uncertain how much data can be cleaned and therefore how much of the overall picture can be established.  We can only say so much of the relationships formed by public bodies and their suppliers, although we did make a good start in achieving a comprehensive account of them.

Lessons learnt

The top three things we learnt on the project are:

  1. identifying people and organisations is the biggest task
  2. standardising the data takes a long time
  3. a shared data repository should be used for all raw data

Contributor

Page last edited Nov 30, 2016

Help us to improve this page – give us feedback.

1 star 2 stars 3 stars 4 stars 5 stars 3/5 from 955 ratings