The need for data lineage – the ability to trace how data moves through your organisation, and how it gets changed (and by whom) during that process – has been well-established in IT departments for years. It is standard fare when it comes to data integration tools, data governance and so on. However, it is far less commonly applied when it comes to end-user computing (EUC) software such as spreadsheets and Access databases.
That’s an issue. It is particularly an issue when it comes to regulatory compliance and, especially, with respect to GDPR (general data protection regulation). Essentially, the problem is that data in one spreadsheet, or EUC resource, is often copied or merged into other spreadsheets or resources. In addition, that data may have been sourced from elsewhere – either externally (a marketing list, say) or internally (from a database or application) – and/or data may be exported from your EUC environment into a database or application, or perhaps a data lake. And from there, the data may be processed or analysed and passed on yet again, even back to another spreadsheet.
The problem for regulatory compliance is that you need to track these movements of data. If a customer activates his right to erasure under GDPR then his personal data needs to be removed from all of the places where it resides. Simply erasing the data from the original source will not be enough so you need to know about every place that has been touched by this data.
The traditional approach to this problem has usually relied on using discovery tools to find out where private data resides and then using data matching technology to bring information together. This allows you to understand that this data element in this place reflects the same customer as the data in that place. In other words, you aim to find all the private data and then join it together. The problem with this approach is that it is both time-consuming and expensive.
Even supposing that relevant tools have the ability to work with spreadsheets (most were designed to work with relational databases), this is not an efficient approach. If you do not know which spreadsheets contain private data, you have to look at them all. And there could be tens of thousands of spreadsheets to look at, so the scale of the problem tends to be much greater than it is for databases. A far better approach is to identify private data as it first enters any particular spreadsheet and then follow it – using data lineage – as it moves across your organisation. In the event of a request for erasure, you know all the places to go to.
Of course, you are likely to be already storing a lot of sensitive data, so you also need to be able to track historic data lineage: we recognize that this is private data, where did it come from? So, you actually need two things from products providing data lineage capability: the ability to uncover it and the ability to monitor it. This is fundamental for good governance regardless of any particular regulatory regime.