Preparing a data set for publication
The FAIR data principles are a set of community development principles for sharing data. FAIR stands for Findable, Accessible, Interoperable, Reusable. [Association of European Research Libraries]
Data documentation & metadata
In most cases the terms data documentation and metadata can be used interchangeably. They help to understand raw data in details and allow other researchers discover, use and properly cite your research data. It is important to start documenting your data at the very beginning of the research project. This could include:
- making notes of all file formats, workflow details, information about how the data will be recorded and processed;
- explanation of codes, variables, and abbreviations;
- planning where the data will be stored in short and long terms that other researchers can find and re-use your data.
Metadata is data about data. It is metadata that make your research data discoverable by a search engine. Metadata, in general, contains several elements, such as:
- identifier (doi)
- date created
Metadata standards consist of elements specific to your research area or discipline. Many disciplines adapt their own metadata standards tailored to a particular needs of the research area. The diagram below shows some metadata standards
Examples of metadata standards by Stanford University Libraries:
Software applications come and go. Proprietary formats created by software are typically controlled by company and therefore might restrict the use of research data. Therefore, it is recommended to archive research data in an open source formats.
Type of Data
Data analysis and cleanup
A software package used for statistical analysis, which includes but not limited to descriptive statistics (cross tabulation, frequencies, descriptives, descriptive ratio statistics) and bivariate statistic (means. ANOVA, t-test, correlation). Available for Mac, Windows and Unix. This software is available through UNB Virtual Lab for UNB students.
An advanced analytics, multivariate analyses, business intelligence, data management, and predictive analytics.
Free, open source integrated development interface for R. It allows to view not only R code but also graphs, data code and the output results simultaneously. Input data could be in .CSV, SPSS or SAS formats. The software is available for Mac, Windows and Linux.
Formerly Google Refine is a tool for working with large data sets: cleaning, transforming from one format into another. Free and open source tool!
For reading ...
"Everyone needs a data-management plan. They sound dull, but data-management plans are essential, and funders must explain why." Nature 555, 286 (2018) doi: 10.1038/d41586-018-03065-z
available in UNB library ...
Working with sensitive data
If you work with sensitive data, your work might need to be overseen by UNB Research Ethics Boards (REBs). Depending on your location there are two REBs at UNB: REB Fredericton campus and REB Saint John campus. Here you can find all the necessary ethics forms with the instructions on how to apply.
Analysis and tools for preparing sensitive data for sharing
Amnesia is a FREE data anonymization tool that transforms relational and transactional databases to a dataset where formal privacy guaranties hold. Amnesia not only removes direct identifiers like names, SSNs etc but also transforms secondary identifiers like birth date and zip code so that individuals cannot be identified in the data. Amnesia supports k-anonymity and km-anonymity.
Data anonymization using Amnesia
Amnesia implements data anonymization techniques from the field of Privacy Preserving Data Publishing (PPDP). The key idea in anonymization is that identifying information is removed from the published data, so no sensitive information can be attributed to a person. The anonymization procedure is not limited to the removal of direct identifiers that might exist in a dataset, e.g. the name or the Social Security Number of a person; it also includes removing secondary information, e.g. like age, zip code that might lead indirectly to the true identity of an individual. This secondary information is referred to as quasi-identifiers. To better understand how secondary information can be used to re-identify a person, consider the following example. A publisher that owns medical data of patients wants to publish an anonymized version of the data she owns. The data are superficially anonymized by removing direct identifiers e.g., names and social security numbers, but descriptive information like the zip code of the patient’s residence and her/his age remain. An adversary who wants to identify the patients that are related to the anonymized data may have access to such descriptive information from other sources, e.g., a voter’s registry. The re-identification can be achieved by matching the descriptive information (Zip code, Age) of the anonymized data to the public registry. If a single match is produced for a given combination, then a patient can be accurately identified. The sparser the data are, the more unique combinations exist, and the easier it is for an adversary to locate unique records that correspond to specific users.
Creating data management plan: templates & examples
What is a data management plan (DMP)?
A DMP is a document outlining how you handle (organize, store, and share) your research data both during the project and after the project is completed.
Guidelines on how to prepare DMPs:
Online tools for preparing data management plans (DMPs)
Public DMP examples created using various online tools and shared publicly by their owners.
- DMPTool Public Plans
- Case studies:
Monash University Australia
DaMaRO project, University of Oxford UK
- Examples of full DMP documents and templates from NC State University and Himmelfarb Health Sciences Library libguide
Repositories for data
Where to deposit research data?
Research data should be submitted to institutional, discipline-specific, community-recognized repositories where possible, or to generalist repositories if no suitable community resource is available.
- The University of New Brunswick (UNB) Library provides an online access to data-sharing platform to support the RDM need of the academic researchers. UNB Dataverse uses local institutional resources to store research data without outsourcing them to another location.
- a platform for Canadian researchers to deposit and share research data, and to facilitate discovery of research data in Canadian repositories. This is a collaborative project between Portage Network, CARL and Compute Canada. FRDR utilizes Compute Canada resources to store research data as well as Globus services to transfer files and search for information. FRDR is particularly suitable for archiving and sharing large data sets (300 GB or 25,000 files).
Some repositories on this page may only accept data from those funded by specific sources, or may charge for hosting data. Be aware of any deposition policies for your chosen repository. The list includes the following disciplines and areas:
- Biological Sciences
- Nucleic acid sequence
- Protein sequence
- Molecular & supramolecular structure
- Omics (functional genomics, Metabolomics, Proteomics)
- Taxonomy & species diversity
- Mathematical & modelling resources
- Cytometry & immunology
- Organism-focused resources
- Health Sciences
- Chemistry & chemical biology
- Earth & environmental sciences
- Physics, astrophysics & astronomy
- Social sciences
- Generalist repositories
The list includes subjects and areas such as:
- Biomedical Sciences
- Marine Sciences
- Model organisms
- Physical Sciences
- Social Sciences
- Structural Databases
- Taxonomic & Species Diversity
- Unstructured and/or Large Data
re3data.org - Registry of Research Data Repositories
- global registry of research data repositories that covers research data repositories from different academic disciplines.
GitHub (a development platform for sharing source codes, open source)
Dat - A distributed data community
Dat is a peer-to-peer platform for publishing datasets both large and small. Its design borrows concepts from distributed revision control systems, allowing multiple users to contribute changes and updates to a dataset while retaining authorship information and preserving older versions. Dat was initially funded by the Knight Foundation under an initiative that "seeks to increase the traction of the open data movement by providing better tools for collaboration." The Try Dat section of the project site contains a detailed tutorial that covers creating, publishing, and updating a dataset. Reference datasets are also provided in a number of formats, including a CSV on recent earthquakes, a JSON file of recently published DOIs, and Bionode format genomics data. The tutorial covers installing Dat on Windows, macOS, and Linux. Dat is free software, distributed under the BSD license, with source code available on Github.
A Digital Object Identifier or a DOI (DOI System) is a unique persistent identifier for a published digital object such as book, article, study or dataset. The word 'persistent' means that it never changes. The idea behind a persistent identifier is that it doesn't break when a website gets updated.
How to obtain a DOI for data set?
A DOI can be created by publishing organizations, not by individual people. Many data repositories can publish research data and assign a DOI to a data set. This data DOI can then be used to cite your data set in a publication. View the list of data repositories to choose which one is more appropriate for the type of data you deal with and carefully read their Terms and Conditions as some repositories may charge you for using their services.
Here is a list of selected data repositories where a DOI can be assigned free of charge to a dataset:
How to use data DOI?
By properly citing the data and including the DOI, you're giving proper credit to the creators who conducted the research and providing the scholarly community a clearer picture of the impact of the research.
APA 6th edition:
Refer to Publication Manual of the American Psychological Association, 6th edition, (2010) p 210 - 211 (datset) and p 212 (unpublished raw data) [UNB Library: BF76.7 .P83 2010b; OCLC:316736612].
Author. (Year). Title of data set (version number). Location: Name of creator.
Author. (Year). Title of data set (version number). Retrieved from http://
Raw data (unpublished, untitled work):
Author. (Year). [Description of study topic]. Unpublished raw data.
RDM Training & Other Resources
- RDM dictionary from CASRAI (Consortia Advancing Standards in Research Administration Information)
Research Data Management (RDM) is an emerging service at UNB Libraries, focused on providing support for data management planning, storing and publishing. RDM is an increasingly important part of research and scholarly communications. Our website is currently under construction. We work on creating content and services relevant to our research community. Please contact RDM Services for details and/or to ask what we have to offer.
A registry for online learning resources focusing on research data management. It was created in a collaboration between the U.S. Geological Survey's Community for Data Integration, the Earth Sciences Information Partnership (ESIP), and DataONE.
- provides links to resources from a wide range of websites and organizations, sorted by topic and audience.
- Google Dataset Search (beta-version)