There are NO complete and comprehensive data quality or data governance tools. This is because none of them cater to the quality of data in spreadsheets or the governance thereof. Given the prevalence of spreadsheets and the costs of calculation and other errors (see http://www.eusprig.org/horror-stories.htm for some examples) this means that any company seriously interested in data assurance will need a separate solution for managing spreadsheets.
Just as with data in databases, there are three phases to spreadsheet management: discovery, quality and governance. The first thing that you need to do is to discover what spreadsheets exist and where they reside on the network. However, it is not enough to treat this as a one-off exercise because new spreadsheets are created all the time. This means that you need discover new spreadsheets on an iterative basis.
Secondly, while finding spreadsheets is one thing you will also need to collect as much information (metadata) as possible about these spreadsheets. This is for a variety of purposes: to understand who owns this spreadsheet, who has access rights to it and what those rights are; to support versioning; and to identify links (broken links are a major cause of spreadsheet issues).
Once you know what spreadsheets you have, the next issue is to determine which are the most critical. There are two basic elements in this. Firstly, how complex is the spreadsheet? How likely is the spreadsheet to be broken or have errors in it? And, secondly, there is materiality: does the information in the cells of the spreadsheet suggest that this is an important spreadsheet?
This is important for assessing the risk that any particular spreadsheet poses. In this context it is worth commenting that a number of spreadsheet management offerings integrate with GRC (governance, risk and compliance) products such as IBM OpenPages, RSA Archer and MetricStream.
As far as quality is concerned, these are better described as errors. There are tools available that are solely targeted at errors. Common features used for this purpose, include:
- Spreadsheet comparisons either between two versions of the same spreadsheet or different spreadsheets
- Formula mapping, which is the ability to see how formulas have been copied (or not) across cells
- Precedent and dependent mapping, to see relationships and references across spreadsheets
- Detection of formula and other errors such as text in a data field, a sum that is adding up non-numeric fields or range checking
- Facilities to understand formulas more easily
- Data lineage, “where did this number come from?”
- Circular reference detection
- Fraudulent uses of spreadsheets
As far as governance is concerned this means putting in place a management framework for the provisioning, development and versioning of spreadsheets, plus the governance of what users are allowed to do with spreadsheets. That is, can they change this value, amend that formula, and so on?
Historically, the process of putting discovered spreadsheets under management control meant moving them to and hosting them in a central repository or database. However, more modern tools do not mandate this process and allow control without the disruption of moving spreadsheets.