How To Make Your Data FAIR?
-
How To Make Your Data FAIR?
-
Structuring And Naming Files And Folders
Important part of good data management is also taking care of data organization. It is important to take some time to plan file and folder structures and naming. It is good practice to make a clear file naming system from the start of the project. This helps to understand what files contain from the name. Different research fields might also have instructions and guidelines for organizing data.
- Create and agree on a system for naming files and folders and be consequent
- Try to organize files logically using folders and subfolders rather than including all files in a single folder
- Avoid very deep folder structures, since they can be difficult to handle
- If your data is time-sensitive, and logically organized by time periods, it could be useful to organize files by time-specific folders, such as YYYY-MM-DD
- Use meaningful, unique file and folder names
- Keep file and folder names as short as possible but relevant. 25 characters are usually considered maximum.
- Dates in YYYY-MM-DD format allows you to sort and search your files
- Avoid using special characters such as % & / \ : ; * . ? < > ^! " () and Scandinavian letters
- Indicate version number by using 'V' or 'version' and three digits (or 4 if you have a large number of files) i.e. 001, 002…….201, 202 (not 1, 2, 21).
- Use underscores (_) instead of spaces
- If using a personal name in the name give the surname first followed by the first name
- Though, be very careful with personal data when naming files and folders.
-
Using Open & Interoperable Formats
All digital information is structured data. A file format is a standard way that information is encoded for storage in a computer file. An open format is a file format for storing digital data, defined by a published specification usually maintained by a standardization organization, that can be used and implemented by anyone. Examples include CSV or TSV for tabular data, TXT and JSON for structured information, and TIFF or PNG for images. These formats are stable over time and widely supported, so the data remain usable even if your lab changes software or if a proprietary tool becomes unavailable.
In contrast, many proprietary formats — such as .xlsx files, instrument-specific binary files, or formats tied to commercial microscopes or spectrometers — can only be opened by a particular program, often requiring a licence. They might work today, but they are not sustainable long term and cannot be processed easily by machines or combined with other datasets.
When organizing, storing and publishing data it is important to create coherent, intelligible and transparent entities that are easy to access and reuse. This is possible with open formats that can be opened and used also with commonly used open tools.
-
✅ Examples of Open & Interoperable Formats
- Tabular data: .csv, .tsv
- Text data: .txt, .md
- Structured data: .json, .xml
- Images: .tiff, .png
- Documents (archiving): .pdf/a
- Code and scripts: .py, .r, .ipynb, .sh
- Audio: .wav, .flac
- Video: .mp4 (H.264)
- Chemistry: .cif, .jcamp-dx
- Life sciences: .fasta, .pdb
- Earth & environmental sciences: .netcdf, .hdf5
-
Documenting Your Data
Data are only useful if others (and your future self) can understand them. Proper documentation explains how the data were collected, processed, and structured, and under which conditions they can be reused. Without documentation, even high-quality datasets may become unusable or misleading. Data documentation includes a variety of documents, which describe all data used in a project, such as what the project data is, how it has been collected, what the abbreviations mean and how the data has been modified.
README files and data dictionaries
At minimum, every dataset should be accompanied by a README file that describes its content and structure. This document typically includes:
- Project title, authors, contact details, and relevant identifiers (e.g. ORCID, grant number).
- Description of the dataset and its purpose.
- File inventory with explanations of file names, formats, and organisation.
- Methods used to collect or generate the data.
- Instructions for reuse, including licences and citations.
For complex datasets, a data dictionary or codebook is recommended. It provides definitions of variables, units of measurement, coding schemes, and any transformations applied during processing. This is especially important for survey data, interviews, or large tabular datasets.
✅ Best practices
- Document your data continuously during the project, not only at the end.
- Use clear and consistent file names.
- Save metadata and documentation in open, sustainable formats (TXT, CSV, PDF/A).
- Deposit both the dataset and its documentation together in the same repository.
💡 Key message: Good documentation and metadata transform data into a long-term scientific resource. They make your research more transparent, ensure that others can correctly interpret and reuse your results, and increase the visibility and impact of your work.
-
[Example] README.txt
-
Choosing a Repository
Choosing the right repository is a strategic decision, as it determines the visibility, sustainability and interoperability of your datasets.
The recommended order of priority is the following:
- Thematic repository (disciplinary) – Always the preferred option, as it is trusted by the research community and uses recognised standards and metadata schemas. Examples include DataRe (physical sciences), IFREMER Dataverse (marine science), Data INRAE (agronomy), Nakala (social sciences and humanities), Ortholab (phonetics), or e-ReColNat (biodiversity).
- Institutional repository – If no disciplinary repository exists, datasets can be deposited in Recherche Data Gouv (RDG), which is both the French national platform and the institutional node for the University of Bordeaux.
- General-purpose repository – If neither a thematic nor an institutional repository is available, a generalist solution may be used. Options include Zenodo (operated by CERN, fully OpenAIRE-compatible), Figshare or Dryad.
In Horizon Europe, it is strongly recommended to deposit data in an OpenAIRE-compatible repository. This ensures that datasets are automatically integrated with the European project monitoring system, can be directly linked to your grant number, and demonstrate compliance with open science requirements. To verify a repository’s compatibility, you may consult the re3data.org registry or the list of repositories validated for Open Research Europe. To guarantee European visibility and integration with OpenAIRE, make sure that the chosen repository is indexed by OpenAIRE. If the repository is not yet harvested by OpenAIRE, contact its support team or your data steward. Use related identifiers to cross-link your outputs (e.g. dataset DOI ↔ publication in HAL, ↔ software in Software Heritage).
💡 In practice, at the time of deposit, make sure to indicate the title of the dataset, the responsible persons and the connection with the project (especially the Horizon Europe grant number).
-
Tutorial: depositing a dataset in Recherche Data Gouv repository
-
Using Persistent Identifiers (PIDs)
A PID is a globally unique, documented, and permanent identifier that provides a direct link to an online resource. PIDs should be machine actionable and for instance resolve to a web page that (re)presents the content. The assignment of a PID is a useful quality criterion for online publications. In the case of a dual book publication (open access and print), most publishers assign an ISBN for each edition.
Each type of entity in the research workflow can—and should—have its own PID. For example:
🧔 ORCID identifies researchers.
📒 DOIs identify publications and datasets.
🏨 ROR IDs identify organisations.
🛠 PIDinst identifies scientific instruments.
💻 SWHIDs identify software versions archived in Software Heritage.
For purely digital publications, DOIs (Digital Object Identifiers) are suitable identifiers. Often one DOI is assigned per publication. The additional allocation of DOIs for individual chapters, figures, and tables is sometimes considered desirable. DOIs are as well used for data publication.
-
Adding Metadata
Beyond human-readable documentation, metadata are structured information that describe the dataset for machines and repositories. They make data findable, interoperable, and reusable in line with the FAIR principles. Metadata typically include:
- Title, author(s), affiliation(s).
- Abstract or description.
- Keywords.
- Dates of data collection and publication.
- Version number.
- File formats.
- Persistent identifiers (e.g. DOI, ORCID, ROR).
- Licence and access conditions.
- Funding information – mandatory for all Horizon Europe projects.
Many research communities have established metadata standards (e.g. Dublin Core, DataCite, DDI for social sciences, Darwin Core for biodiversity, TEI for digital humanities). Using such standards ensures that your dataset can be harvested by catalogues and reused across platforms.In practice, you don’t need to invent anything: repositories like Zenodo, HAL, or Recherche Data Gouv provide metadata forms. You just fill them in and link them to identifiers like DOIs or ORCID.
✅ Good metadata are the bridge between your data and the people who might reuse them — including your future self. If the metadata are clear, your data become truly FAIR.
-
Choosing A Licence : Why And How?
When you share your data, scripts, or documents, you must clearly indicate the conditions under which they may be reused. This is the purpose of a licence: it specifies what others are allowed to do with your files while protecting your intellectual property rights.
Without an explicit licence, your work remains legally “all rights reserved,” even if it is freely accessible online. This severely limits reuse and therefore reduces its impact. A licence functions as an authorisation of use that you grant to others. It allows you to define what is permitted — such as reuse, modification, redistribution, or commercial use — while ensuring that you are acknowledged as the author and that your results are shared under clear conditions.
📎Creative Commons (CC) licences
Creative Commons licences are the most commonly used for data, images, publications and documents. They are standardised, internationally recognised, and easy to use. Here are the main variants:
- CC BY (Attribution): allows any reuse, including commercial use, provided that the author is credited. This is the most recommended licence for scientific data.
- CC BY-SA (Attribution – Share Alike): allows reuse, provided that any modifications are redistributed under the same licence.
- CC BY-NC (Attribution – Non-Commercial): allows reuse for non-commercial purposes only.
- CC0 (Public Domain): allows any reuse without restriction, even without citation. The author waives their rights to the extent permitted by law.
In Horizon Europe projects, CC BY for data or CC0 for metadata are strongly recommended in order to ensure openness and reusability.
📎The Etalab 2.0 licence
The Etalab 2.0 licence is the open licence recommended by the French government for public sector data. It allows very broad reuse, including for commercial purposes, provided that the source of the data is cited. It is compatible with European law and with the FAIR principles. It is particularly well suited for datasets produced by public research laboratories or funded by public money.
How to indicate the licence?
It is important to clearly indicate the licence when depositing your files. The best practices include adding a LICENSE.txt file in your directory or repository (for code or data), mentioning the licence in the README.txt file, selecting the licence during the submission process in the chosen repository (such as RDG, Zenodo or Nakala), and specifying the licence in your related publications or in your thesis manuscript.
💡 Key takeaway: Choosing a clear and open licence that protects your authorship while authorising reuse makes it easier to comply with funder requirements and increases the visibility, traceability, and impact of your work.
-
✅ [Checklist] Checking data before submitting
The datasets to be shared have been correctly selected
Ethical principles are respected
Dissemination rights have been checked
The terms of access have been defined
Files are organised and have been explicitly named
Files are in permanent, open formats
The size of your files does not exceed the maximum
The data are described and documented using standards where possible
A unique and persistent identifier is assigned to the data and source codes
The data are assigned a licenceSource: doranum.fr
-
Additional Resources :
-
Data Sharing and Management Snafu in 3 Short Acts
-
[File] Selecting a trustworthy subject-specific repository for self-depositing data
-
[File] Recommendations for the Adoption of Persistent Identifiers in Higher Education and Research in France
-
[File] Metadata, standards and formats
-
[Link] RDA COVID-19 Working Group. Recommendations and Guidelines on data sharing
-
Self-Assessment Quiz
-
[Self-assessment quiz] How to Make Your Data FAIR
-