LibGuides: Open science practices at Centria University of Applied Sciences: Identifiable material and anonymisation

Identifiable data and anonymization

The processing of identifiable data requires special care.

This page briefly introduces the concepts of identifiable data and the measures related to the processing of identifiable data. The content of this page is based on the Data Management Manual of the Finnish Social Science Data Archive, which provides comprehensive guidelines for the processing of identifiable data in general, as well as instructions for the anonymization of quantitative and qualitative data.
Data Archive: Data Management Guidelines, Anonymisation and Personal Data

Personal data and identifiability

Personal data should only be collected to the extent necessary for the research. Personal data should not be collected just in case it might be useful. There should always be a research need for collecting personal data.

According to the definition in the EU General Data Protection Regulation, personal data means any information relating to an identified or identifiable natural person.

A natural person is considered identifiable if they can be identified directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person. Research data may also include identifying information about the research subjects' close associates or other third parties. Information that identifies them is also always considered personal data.

Identifying information includes
- Direct identifiers: full name, personal identification number, email address based on the person's name, and biometric identifiers (fingerprint, facial image, voice, iris, palm shape, handwritten signature).
- Strong indirect identifiers: for example, postal address, telephone number, car registration number, rare job title, very rare illness, or various unique codes such as student ID.
- Indirect identifiers: information that is not sufficient for identification on its own, but when combined with other information, can enable the identification of a person. The most common indirect identifiers are background variables in quantitative data and background information on individuals in qualitative data, such as gender, age, education, occupational status, household composition, income, marital status, language, nationality, ethnic background, workplace or school, and variables related to the area of residence, such as postal code, city district, or municipality. A date can also be an indirect identifier.

Identifiable data may be used for scientific research when it is appropriate, planned, objectively justified, and there is a legal basis for processing the data (e.g., the consent of the research subject or research in the public interest).

Processing of identifiable information

The processing of identifiable research data must be systematic and careful. The privacy of research subjects must not be compromised, for example, by careless storage of data or unprotected electronic transfers.

General safeguards for the processing of personal data include pseudonymization, anonymization, and storage restrictions.

Pseudonymisation

Pseudonymisation is the removal or replacement of identifiable information in data with cover information or codes, which are then stored separately from the data in organisational and technical terms. Organizational measures refer to a secure physical environment for the data and administratively restricted and monitored access rights. Technical measures refer to secure storage solutions. Pseudonymous data becomes anonymous when the separately stored identifying information (code key, personal data, and information on how the modified values were formed) is destroyed.

Anonymisation

There is no such thing as completely anonymous data. However, anonymisation can be used to achieve a result where individual persons cannot be identified on the basis of the data provided or by combining the data with other information. The data is therefore anonymous if it can no longer be linked to the original personal data by reasonable means.

There is no ready-made procedure for anonymising research data that is suitable for all data. Anonymisation must always be planned on a case-by-case basis, taking into account the characteristics of the data (age, sensitivity, size of the respondent group, detail of the content), the operating environment (who uses the data and where, what external information is currently available, physical storage) and usability (how anonymity and usability can be combined so that the data is usable for research purposes after anonymization).

The following questions can be used to understand the anonymization process in both quantitative and qualitative data:
- What direct or indirect identifiers does the data contain?
- Does the data contain unique or rare observations?
- What combinations of data could make a person identifiable?
- Is there any external data available that could be combined with the data in such a way that the observations/research subjects could be identified?
- Consider how the data will be used and which characteristics of the data you want to preserve and which can be "sacrificed" in the anonymization process.

Restricting storage:

Personal data that is not necessary for conducting the research is deleted as soon as possible. For example, names, addresses, and similar identifiers needed during the data collection phase are destroyed as soon as they are no longer necessary for the research. Similarly, personal identification numbers needed for linking data can be destroyed when they are no longer needed.