Preparing a data set for publication
I will store all data on at least one, and possibly up to 50, hard drives in my lab. The directory structure will be custom, not self-explanatory, and in no way documented or described. Students working with the data will be encouraged to make their own copies and modify them as they please, in order to ensure that no one can ever figure out what the actual real raw data is.
Backups will rarely, if ever, be done.
If this sounds like your current data management strategy, this libguide is for you!
The FAIR data principles are a set of community development principles for sharing data. FAIR stands for Findable, Accessible, Interoperable, Reusable. [Association of European Research Libraries]
Data documentation & metadata
In most cases the terms data documentation and metadata can be used interchangeably. They help to understand raw data in details and allow other researchers discover, use and properly cite your research data. It is important to start documenting your data at the very beginning of the research project. This could include:
- making notes of all file formats, workflow details, information about how the data will be recorded and processed;
- explanation of codes, variables, and abbreviations;
- planning where the data will be stored in short and long terms that other researchers can find and re-use your data.
Metadata is data about data. It is metadata that makes your research data discoverable by a search engine. Metadata, in general, contains several elements, such as:
- identifier (doi)
- date created
Metadata standards consist of elements specific to your research area or discipline. Many disciplines adapt their own metadata standards tailored to a particular needs of the research area. The diagram below shows some metadata standards
List of standards in your field by Digital Curation:
Examples of metadata standards by Stanford University Libraries:
Standards-based metadata is generally preferable, but where no appropriate standard exists, writing “README” style metadata is an appropriate strategy. A README file provides information about a data file and is intended to help ensure that the data can be correctly interpreted, by yourself at a later date or by others when sharing or publishing data.
Other recommendations from BioMedCentral:
Software applications come and go. Proprietary formats created by software are typically controlled by the company and therefore might restrict the use of research data. Therefore, it is recommended to archive research data in an open source format.
Type of Data
Data analysis and cleanup
A software package used for statistical analysis, which includes but not limited to descriptive statistics (cross tabulation, frequencies, descriptives, descriptive ratio statistics) and bivariate statistic (means. ANOVA, t-test, correlation). Available for Mac, Windows, and Unix. This software is available through UNB Virtual Lab for UNB students.
Advanced analytics, multivariate analyses, business intelligence, data management, and predictive analytics.
Free, an open source integrated development interface for R. It allows to view not only R code but also graphs, data code and the output results simultaneously. Input data could be in CSV, SPSS or SAS formats. The software is available for Mac, Windows, and Linux.
Formerly Google Refine is a tool for working with large data sets: cleaning, transforming from one format into another. Free and open source tool!
Sensitive data: As open as possible, as closed as necessary
If you work with sensitive data, your work might need to be overseen by UNB Research Ethics Boards (REBs). Depending on your location there are two REBs at UNB: REB Fredericton campus and REB Saint John campus. Here you can find all the necessary ethics forms with the instructions on how to apply.
Analysis and tools for preparing sensitive data for sharing
Amnesia is a FREE data anonymization tool that transforms relational and transactional databases to a dataset where formal privacy guaranties hold. Amnesia not only removes direct identifiers like names, SSNs etc but also transforms secondary identifiers like birth date and zip code so that individuals cannot be identified in the data. Amnesia supports k-anonymity and km-anonymity.
Data anonymization using Amnesia
Amnesia implements data anonymization techniques from the field of Privacy Preserving Data Publishing (PPDP). The key idea in anonymization is that identifying information is removed from the published data, so no sensitive information can be attributed to a person. The anonymization procedure is not limited to the removal of direct identifiers that might exist in a dataset, e.g. the name or the Social Security Number of a person; it also includes removing secondary information, e.g. like age, zip code that might lead indirectly to the true identity of an individual. This secondary information is referred to as quasi-identifiers. To better understand how secondary information can be used to re-identify a person, consider the following example. A publisher that owns medical data of patients wants to publish an anonymized version of the data she owns. The data are superficially anonymized by removing direct identifiers e.g., names and social security numbers, but descriptive information like the zip code of the patient’s residence and her/his age remain. An adversary who wants to identify the patients that are related to the anonymized data may have access to such descriptive information from other sources, e.g., a voter’s registry. The re-identification can be achieved by matching the descriptive information (Zip code, Age) of the anonymized data to the public registry. If a single match is produced for a given combination, then a patient can be accurately identified. The sparser the data are, the more unique combinations exist, and the easier it is for an adversary to locate unique records that correspond to specific users.
UNDER CONSTRUCTION ...
Creating data management plan: templates & examples
What is a data management plan (DMP)?
A DMP is a document outlining how you handle (organize, store, and share) your research data both during the project and after the project is completed.
Guidelines on how to prepare DMPs:
Online tools for preparing data management plans (DMPs)
Public DMP examples created using various online tools and shared publicly by their owners.
- DMPTool Public Plans
- Case studies:
Monash University Australia
DaMaRO project, University of Oxford UK
- Examples of full DMP documents and templates from NC State University and Himmelfarb Health Sciences Library libguide
Where to deposit research data?
Research data should be submitted to institutional (UNB Dataverse ), discipline-specific, community-recognized repositories where possible, or to generalist repositories if no suitable community resource is available.
A tool to assist in identifying FAIR-aligned repos is available from DataCite an can be found at https://repositoryfinder.datacite.org :
Below is the list of available repositories with links!
- The University of New Brunswick (UNB) Library provides online access to the data-sharing platform to support the RDM need of the academic researchers. UNB Dataverse uses local institutional resources to store research data without outsourcing them to another location.
- a platform for Canadian researchers to deposit and share research data, and to facilitate discovery of research data in Canadian repositories. This is a collaborative project between Portage Network, CARL, and Compute Canada. FRDR utilizes Compute Canada resources to store research data as well as Globus services to transfer files and search for information. FRDR is particularly suitable for archiving and sharing large data sets (300 GB or 25,000 files).
Getting started with FRDR
To get started on FRDR demo, go to https://demo.frdr.ca/ and attempt to log in (in the header menu). From there you'll be prompted to create an account. You can use a Google, ORCID, or Compute Canada account, or you can create a new account with Globus. (Note: There is no current support logging in using the university account, but it is something that will be available in the future)
When you have an account, select the "Deposit Data" button on the FRDR demo homepage and you'll see a message asking you to email support to receive permission to use demo. (This is only the process for limited production. Once administrator receives your email they can add you to the FRDR depositor group.)
In demo you can perform test submissions, get a sense of the metadata form to fill out, use your browser or Globus transfer to upload some test data, etc. You will need to download Globus Connect Personal to upload large datasets to FRDR using Globus, or to download large data files or entire datasets.
Some useful links:
FRDR documentation: https://www.frdr.ca/docs/en/home/
Globus Connect Personal download: https://www.globus.org/globus-connect-personal
An example of data record in FRDR: https://www.frdr.ca/repo/handle/doi:10.20383/101.0111
Discipline-specific data repositories suggested by Nature.com
Some repositories on this page may only accept data from those funded by specific sources or may charge for hosting data. Be aware of any deposition policies for your chosen repository. The list includes the following disciplines and areas:
- Biological Sciences
- Nucleic acid sequence
- Protein sequence
- Molecular & supramolecular structure
- Omics (functional genomics, Metabolomics, Proteomics)
- Taxonomy & species diversity
- Mathematical & modeling resources
- Cytometry & immunology
- Organism-focused resources
- Health Sciences
- Chemistry & chemical biology
- Earth & environmental sciences
- Physics, astrophysics & astronomy
- Social sciences
- Generalist repositories
The list includes subjects and areas such as:
- Biomedical Sciences
- Marine Sciences
- Model organisms
- Physical Sciences
- Social Sciences
- Structural Databases
- Taxonomic & Species Diversity
- Unstructured and/or Large Data
re3data.org - Registry of Research Data Repositories
- a global registry of research data repositories that covers research data repositories from different academic disciplines.
GitHub (a development platform for sharing source codes, open source)
Dat - a distributed data community
Dat is a peer-to-peer platform for publishing datasets both large and small. Its design borrows concepts from distributed revision control systems, allowing multiple users to contribute changes and updates to a dataset while retaining authorship information and preserving older versions. Dat was initially funded by the Knight Foundation under an initiative that "seeks to increase the traction of the open data movement by providing better tools for collaboration." The Try Dat section of the project site contains a detailed tutorial that covers creating, publishing, and updating a dataset. Reference datasets are also provided in a number of formats, including a CSV on recent earthquakes, a JSON file of recently published DOIs, and Bionode format genomics data. The tutorial covers installing Dat on Windows, macOS, and Linux. Dat is free software, distributed under the BSD license, with source code available on Github.
Take a look at this excellent infographic, developed by our colleagues from UBC:
A Digital Object Identifier or a DOI (DOI System) is a unique persistent identifier for a published digital object such as book, article, study or dataset. The word 'persistent' means that it never changes. The idea behind a persistent identifier is that it doesn't break when a website gets updated.
How to obtain a DOI for data set?
A DOI can be created by publishing organizations, not by individual people. Many data repositories can publish research data and assign a DOI to a data set. This data DOI can then be used to cite your data set in a publication. View the list of data repositories to choose which one is more appropriate for the type of data you deal with and carefully read their Terms and Conditions as some repositories may charge you for using their services.
Here is a list of selected data repositories where a DOI can be assigned free of charge to a dataset:
How to use data DOI?
By properly citing the data and including the DOI, you're giving proper credit to the creators who conducted the research and providing the scholarly community a clearer picture of the impact of the research.
APA 6th edition:
Refer to Publication Manual of the American Psychological Association, 6th edition, (2010) p 210 - 211 (datset) and p 212 (unpublished raw data) [UNB Library: BF76.7 .P83 2010b; OCLC:316736612].
Author. (Year). Title of data set (version number). Location: Name of the creator.
Author. (Year). Title of data set (version number). Retrieved from http://
Raw data (unpublished, untitled work):
Author. (Year). [Description of study topic]. Unpublished raw data.
RDM training, 101 Readings & other resources
- RDM dictionary from CASRAI (Consortia Advancing Standards in Research Administration Information)
- The Data Librarian's Handbook by the University of Edinburgh
Research Data Management (RDM) is an emerging service at UNB Libraries, focused on providing support for data management planning, storing and publishing. RDM is an increasingly important part of research and scholarly communications. Our website is currently under construction. We work on creating content and services relevant to our research community. Please contact RDM Services for details and/or to ask what we have to offer.
A registry for online learning resources focusing on research data management. It was created in a collaboration between the U.S. Geological Survey's Community for Data Integration, the Earth Sciences Information Partnership (ESIP), and DataONE.
Data Carpentry develops and teaches workshops on the fundamental data skills needed to conduct research. Their mission is to provide researchers with high-quality, domain-specific training covering the full lifecycle of data-driven research.
- provides links to resources from a wide range of websites and organizations, sorted by topic and audience.
- Google Dataset Search (beta-version)
This instance of The Art of Literary Text Analysis is created in Jupyter Notebooks based on the Python scripting language. Other programming choices are available, and many conceptual aspects of the guide are relevant regardless of the language and implementation.
- Data-Planet (trial version)
Data Planet is a large dynamic repository providing access to massive amounts of statistical data combined with descriptive content and a robust suite of visualization, search, and analysis capabilities on a single platform. It is the largest repository of standardized and structured statistical data with 52 billion data points and 6.2 billion datasets. They cover 16 broad subject areas including agriculture, finance, criminal justice, education, energy, government, health, housing, business, trade, military, environment, population, employment, and transportation.
Users can access the search guides at: https://data-planet.libguides.com/?b=s
- World Bank Open Data (open, no subscription)
- Harvard Data Science Initiative (HDSI) (open source community platform)
HDSI combines features of a premier research journal, a leading educational publication, and a popular magazine, HDSR provides a centralized, authoritative, and peer-reviewed publishing community to service the growing profession.
RDM 101 READING ...
"Everyone needs a data-management plan. They sound dull, but data-management plans are essential, and funders must explain why." Nature 555, 286 (2018) doi: 10.1038/d41586-018-03065-z
available in UNB library ...