Go to main content
CNC logo

School of Computer Science Intranet

HIgh PERformance computing and STAtistical Disclosure control (HIPERSTAD)

HIPERSTAD is a three year EPSRC funded project (finishing March 2007) and involves a collaboration between three departments at the University of Manchester: The Centre for Novel Computing (CNC) in the School of Computer Science, the Department of Computation and the Cathie Marsh Centre for Census and Survey Research (CCSR). The collaboration has been set up to improve the performance of computational search techniques that are currently used to protect the confidentiality of datasets, such as those released by the Office of National Statistics in London. High performance computing plays an important role in this work due to the size and complexity of the datasets involved.

Award Holders

John Gurd, CNC

Len Freeman, CNC

John Keane, Dept. of Computation

Mark Elliot , CCSR



Researchers

Anna Manning, Research Fellow, CNC

Usman Qamar, PhD Student, Dept. of Computation



Visiting Professor

David Haglin, Minnesota State University

Background


Improvements and innovations in computer processing power, disk storage and networks have led to dramatic increases in the ability to accumulate and analyse personal data. It is now routine to merge information stored in independent databases, to access and process data on-line and to use data mining tools for automatic and semi-automatic exploration and pattern discovery. In parallel with these developments there are mounting concerns about the threat to personal privacy, with a call to data collectors and providers to ensure a high level of security, sometimes with the request that the analysis of individual-level data be prevented altogether. However, in some circumstances it can be essential for databases to be considered at the level of the individual in order for any significant benefit to be derived from the investigation. Medical research often requires the use of patient records to gain knowledge about the nature of serious diseases; effective government policy-making can require the analysis of individual responses in a census; customer-level information can be essential to businesses when making marketing decisions and predictions. A control mechanism that allows sufficient evaluation of personal data while simultaneously protecting the confidentiality of individual records must therefore be applied.

If personal data is made available, even in an anonymised form, there is a risk of individuals being identified using statistical disclosure through the matching of known information with the anonymised data, resulting in material specific to those individuals being revealed. This can occur via the actions of a database intruder, the unscrupulous behaviour of a researcher and in many other ways.

The problem of preventing statistical disclosure is approached by first estimating the probability of a certain individual being identified (the risk of disclosure) and, secondly, by applying statistical disclosure control (SDC), variously recoding, masking and perturbing the data in order to reduce the statistical disclosure risk.

In this project we concentrate on the identification of individual records with a high risk of disclosure. The records belonging to certain individuals have a significant chance of being identified as their contents, or attributes, are unique and therefore have the potential to be matched directly with details (including names and addresses) from another database. An illustration of a `risky' record of this type is a sixteen-year-old widow in a population survey. A record can contain more than one such unique pattern and its classification often depends on the number and size of such attribute sets (referred to as uniques) that it contains. The general term for records possessing a high risk of disclosure due to the nature of the uniques that they include is a special unique record.

The ability to comprehensively locate and grade such records would lead to more efficient disclosure control of released data but in order to carry out an exhaustive search of this nature all possible attribute sets must be checked (directly or indirectly) for uniqueness, a process which is combinatorially explosive. The importance of speed also depends on how many times an algorithm has to be applied. SDC algorithms often have to be used repetitively on a dataset as different masking techniques are tested for their efficiency.

An ESRC grant (ESRC R000223630) has funded the development of a technique specifically designed for the special uniques problem and a sequential algorithm, SUDA (Special Unique Detection Algorithm), has been designed and implemented. SUDA, which was inspired by techniques used in association rule discovery, employs a two-stage approach: firstly, all uniques (up to a user-specified size) are located at record level and, secondly, the size and distribution of uniques within each record is used to grade its `riskiness'. In order to streamline the search SUDA has been structured around the observation that `Every superset of a unique attribute set is itself unique' (as every set is bounded by the size of its subsets). Only unique attribute sets without any unique subsets (minimal uniques) are considered in order to avoid the use of redundant information and to keep the classification process as focused as possible; the smaller the number of attributes contained in a unique pattern the more `risky' it is considered to be and therefore it is important to know if a unique is minimal. SUDA has been developed for discrete data (both numerical attributes and numerically coded categorical data) but can also accept continuous data if it is transformed into a discrete form beforehand (via multiplication by factors of 10 and/or rounding up).

The application of techniques from computer science to the special uniques problem within the SUDA project has enabled a far more sophisticated understanding of special uniques and a new risk classification system is being developed. This project has also greatly increased the depth of risk assessment possible. However, due to the demanding levels of execution time and data storage required to check all minimal uniques in stage one of SUDA's application, this algorithm is restricted to small datasets, a problem that forms the motivation for the present project. The identification of minimal uniques from a dataset is a problem that is believed to be NP-Complete and further methods for reducing the search space are required together with faster processing so that larger datasets can be handled. We propose to enhance the above bottom-up algorithm by investigating the use of graph theory for identifying the outer parameters of the search space via a top-down approach. Methods for parallelising the resulting, hybrid, bottom-up/top-down algorithm will be investigated. The most promising candidates will be implemented and compared in order to reduce the time required to conduct an exhaustive search for uniques.

The HIPERSTAD project

The aim of the project is to design, implement and compare high performance algorithms for protecting the confidentiality of anonymised survey-type data. The project has the following objectives:
  • To develop further methods for reducing the search space of uniques using a top-down approach based on graph theory.
  • To produce an exhaustive combined bottom-up/top-down sequential search method for finding all uniques.
  • To design and implement methods for parallelising the sequential algorithm.
  • To integrate the search algorithm with existing techniques for detecting and grading special uniques.

Relevant publications:

M.J. Elliot, A.M. Manning and R. W. Ford, 'A computational algorithm for handling the special uniques problem'(PDF), International Journal of Uncertainty, Fuzziness and  Knowledge Based Systems 10(5) (Oct 2002).

A. M. Manning and D. J. Haglin. A new algorithm for finding minimal sample uniques for use in statistical disclosure assessment. In Proceedings of the fifth IEEE International Conference on Data Mining (ICDM'05), pages 290-297, Houston, Texas, U.S.A., November 2005.

M. J. Elliot, A. M. Manning, K. Mayes, J. Gurd and M. Bane. 'SUDA: a program for identifying and grading special uniques'. In Proceedings of the Joint United Nations Economic Commission for Europe (UN-ECE) and European Statistics (Eurostat) Worksession on Statistical Confidentiality, Geneva, November 2005.


Organisations using SUDA:

The Office of National Statistics, London

The Australian Bureau of Statistics

Statistics New Zealand

Statistics Quebec