Day 3: Open Data#
By Neuromatch Academy & NASA
Content creators: NASA, Leanna Kalinowski, Hlib Solodzhuk, Ohad Zivan
Content reviewers: Leanna Kalinowski, Hlib Solodzhuk, Ohad Zivan, Shubhrojit Misra, Viviana Greco, Courtney Dean
Production editors: Hlib Solodzhuk, Konstantine Tsafatinos, Ella Batty, Spiros Chavlis
Tutorial Objectives#
Estimated timing of tutorial: 2 hours
This day focuses on the practice and application of open science for data. It provides a ‘how to’ process for finding and assessing open data for use, for making open data, and for sharing open data. The step-by-step flows are easy to follow and can be used as checklists after you complete the day. Some of the key topics discussed include: data management plans, the process for assessing data for reuse, creating a plan for making data, including choosing open formats and adding documentation, and the considerations for sharing data and making your data citable.
Section 1: Introduction to Open Data#
This section defines open data, its benefits, and the practices that enable data to be open. In addition, the section takes a closer look at how FAIR applies to open data as well as at the critical role of metadata. It wraps up with a brief discussion on how to plan for open data in the scientific workflow and tasks guided by the use, make, share framework.
Introduction#
Data drives science forward. Data are stored electronically to enable further analysis and research. Digital technologies integrated into every aspect of modern scientific research have led to the production of large amounts of data.
Open data is an essential pillar of open science. In many ways, open data is a natural expansion of open science beyond scholarly publications to include digital research outputs. It has since become an integral part of the open science movement as open data allows anyone to see, use, and verify published results. Open data makes science more accessible, inclusive, and reproducible. In order to support this, data needs to be made available in formats that others can use, include metadata that describes the data, and provide helpful documentation. When made available, open data enables new discoveries and unforeseen uses.
Definition and Considerations of Open Data#
Data are any type of information that is collected, observed, or created in the context of research. Today, data are increasingly stored electronically in a digital format.
Data includes:
Primary (raw) data – Primary data refers to data that are directly collected or created by researchers. Research questions guide the collection of the data. Typically, a researcher will formulate a question, develop a methodology and start collecting the data. Some examples of primary data include:
Responses to interviews, questionnaires, and surveys.
Data acquired from recorded measurements, including remote sensing data.
Data acquired from physical samples and specimens form the base of many studies.
Data generated from models and simulations.
Secondary & Processed data – Secondary data typically refers to data that is used by someone different from who collected or generated the data. Often, this may include data that has been processed from its raw state to be more readily usable by others.
Published data – Published data are the data shared to address a particular scientific study and/or for general use. While published data can overlap with primary and secondary data types, we have “published data” as its own category to emphasize that such datasets are ideally well-documented and easy to use.
Metadata – Metadata is a special type of data that describes other data or objects (e.g., samples). They are often used to provide a standard set of information about a dataset to enable easy use and interpretation of the data.
The term open data is defined in the open data handbook from the Open Knowledge Foundation:
“Open data are data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and share alike.”
Open Data Handbook from the Open Knowledge Foundation
When talking about data in the context of this day, we focus on the data that you are preparing to share, such as data affiliated with a scientific publication, regardless of what type that is. While you could share (and many do) laboratory notebooks, preliminary analyses, intermediate data products, drafts of scientific papers, plans for future research, and similar items, these aren’t usually required by funding agencies or institutions and thus won’t be in focus for this day.
To quote from a published paper about data reuse, researchers are mostly looking for data that is “comprehensive, easy to obtain, easy to manipulate, and believable.” For these criteria to be fulfilled, the data should:
Be sufficiently described with appropriate metadata, which greatly affects open data reusability. There is no one-size-fits-all for metadata, as its collection is guided by your data.
Have the appropriate license, copyright, and citation information.
Have appropriate access to information.
Be findable in an accredited or trustworthy resource.
Be accompanied by a history of changes and versioning.
Include details of all processing steps.
Not all data may be shared or shared with all this information. There are different reasons why it might not be possible. However, the more information shared about data helps increase the reliability and reusability of the information.
Benefits of Open Data#
Data underpins almost all of science. Openly sharing data with others enables reproducibility, transparency, validation, reuse, and collaboration. The impacts of open data include facilitating:
Greater Good
Data plays a significant role in our day-to-day lives. Open data, in particular, has played a key role. If you pause and think about it, you may realize that open data are not only common in our society, but you might have benefited from and used open data yourself.
Each country or territory often provides open access to a variety of socioeconomic information about the population, community, and business in its jurisdiction. These data are often called census survey data, which may include the aggregated statistics of gender, race, ethnicity, education, income, and health data of a community. These data are often used to understand the composition of a local neighborhood and are critical to inform decisions on resource allocation to ensure the quality of life for the community.
Case Study: Open Data Helps Provide Life-Saving Information in the Face of Climate Change
The changing climate poses a significant risk to our daily lives and has been responsible for intensifying drought, increasing flooding, and devastating fire incidents worldwide. Open data are therefore critical in providing life-saving information to adapt to the changing climate and help assess the climate risks where we live. Government agencies have been providing public access to long-term weather and climate information for decades (e.g., National Oceanic Atmospheric Administration in the U.S., UK Met Office, European Centre for Medium-Range Weather Forecasts). A more recent initiative stems from organizations developing value-added open-data products to advise society on the risk of a changing climate. One recent example is the flood and fire risk in the United States developed by a non-profit organization First Street Foundation.
Policy Change
Case Study: Predicting Climate Change Effects in Arctic Communities
Open data can lead to policy change that directly impacts the lives of communities, such as those destined to suffer first from the slow changes to the Arctic. A study in Nature employed OpenStreetMap data to help produce maps of projected environmental changes in the Arctic. These maps helped emphasize the need for adaptation-based policies at community and regional levels to avoid stagnation of change in light of a sudden and dramatically worsening situation fueled by climate change.
Global Emergency Response
Case Study: COVID-19
The COVID-19 pandemic demonstrated to the world, in real time, how the collective movement of researchers sharing their data (such as sharing of coronavirus genome data) can lead to an unprecedented number of discoveries in a relatively short amount of time. This directly impacted radical vaccine development efforts and the timely control of the COVID-19 infection. These insights will continue to pay off, with this research spurring future developments.
Data sharing has many benefits and can aid access to knowledge. However, it is important to consider where the data has come from, who should have a say in its interpretation and use, and how the data can be shared responsibly.
Citizen Science
Case Study: Water Quality Testing in Beirut
A citizen scientist is a citizen or amateur scientist who collaborates with professional researchers to help gather or interpret data on a broader spatial and temporal scale than the researchers might be able to achieve on their own. This outsourcing of responsibility helps members of the public engage in scientific pursuits that ultimately benefit them and allow research to be conducted on a grander scale than that might be possible with only professional researchers. Citizen science is gaining popularity and recognition as a valuable contribution to scientific advancements.
For example, volunteer citizen scientists in Beirut were recruited from 50 villages to help test water quality (source: “Contextualizing Openness: Situating Open Science”). These volunteers were trained to be able to conduct the tests, and in turn, not only was the data collected to inform the scientific advancements, but the citizen scientists had the opportunity to learn to better manage their water resources and were able to improve conditions, creating a mutually beneficial interaction.
Open Data and Equitable Sharing of Knowledge
Free distribution of knowledge increases participation in science. Open data are central to fostering science that is inclusive and diverse, with direct and relevant benefits to impacted individuals and communities. This integration with communities is particularly important in the mission towards the equitable sharing of knowledge.
In a research ecosystem where knowledge is a commodity, with the main currency in the form of published papers and hoarded datasets, exclusion from research can limit scientific progress and negatively impact community outcomes. Those excluded from traditional science resources are often from low and lower-middle-income countries. Opening our data in an inclusive and easily reusable way is one step toward the purposeful inclusion of underrepresented groups in science.
Case Study: Recognition and Compensation for the Work of African Ebola Researchers
During the West African Ebola outbreak from 2014-2016, West African researchers actively worked to collect blood sample data to better understand the Ebola virus and to help put a stop to the rapid spread of the virus. However, most of the blood samples were sent overseas to the US and Europe, where researchers used those data samples to author papers about Ebola. According to the paper “Science under fire: Ebola researchers fight to test drugs and vaccines in a war zone”, “This frustrated researchers in the countries ravaged by the virus, who had hoped that studying aspects of the epidemic would strengthen their ability to respond to future infectious- disease outbreaks.”
By fostering a global research culture of transparency and validation, where the work of underrepresented groups is celebrated and compensated, we will create a sustainable model that ensures under-represented communities (such as women, under-represented communities, indigenous scholars, non-Anglophone scholars) a voice in how the global and nuanced narrative of science is developed.
Open data, that are purposefully inclusive and open to scrutiny, benefit scientific innovation by allowing for a more diverse and robust scientific process that draws from multiple perspectives. This openness also allows for the early identification of mistaken insights as well as early intervention for unforeseen harm to impacted communities.
Open data allows non-traditional researchers to contribute to scientific development and bring their unique insights to the table. With these benefits in mind, we should always bear in mind that Open Data requires careful consideration of its potential downsides that result from failure to provide due credit and consultation with potentially vulnerable and/or marginalized communities. The next section “Using Open Data” discusses important considerations for the responsible management, collection, and use of open data by all stakeholders.
Benefits to You#
Open data also benefits your research and career. For starters, you are your own future collaborator!
Doing open science not only lets other people understand and reproduce your results, but lets you do so as well! Implementing open science principles like good documentation and version control helps you, potential collaborators, and everyone else to understand your results. In 2 hours, 2 weeks, or 2 years, you will still be able to understand what you did.
Specific benefits of opening data for you as an individual:
You will never lose access to your previous work, no matter what institute you are affiliated with. Many researchers move around institutions and organizations and by having your data publicly accessible in repositories, you will always have access to them.
Your data can be cited and you will get credit.
Publications that include links to data are cited more, according to a 2020 study.
Implementing best practices for open science can strengthen your funding proposals. Funding agencies are realizing that openly sharing research provides more return on their investment. Well-documented research products also demonstrate the quality of your work, which helps with public communication and can also attract quality collaborators. Everybody prefers to work with people who are reliable and do a good job.
Challenges of Open Data#
While open data has many benefits, there can also be challenges to its creation and use. Throughout this day, we discuss many of these challenges and possible solutions. In this section, we discuss a few of the most common concerns along with actions to mitigate them.
Case Study: Are There Any Harms to Open Data?
Open data has been demonstrated to further marginalize or exploit small-scale and community-driven initiatives, such as in the case of African researchers neither receiving due credit nor compensation for their genome sequencing during the COVID-19 pandemic. This is further explored in the next section as we introduce ways of mitigating harms that could happen via unthoughtful and irresponsible sharing of data.
Restrictions on Sharing Data#
Some data should only be shared very carefully or not at all. Reasons not to share can include:
Data includes a country’s military secrets or violations of national interests.
Data includes private medical information or an individual’s personally identifiable data.
Indigenous/cultural/conservation concerns.
Data includes intellectual property.
It is important to be familiar with the policies around the sharing of your data and policies from your funding agency, institution, or laws around data protection. These are further discussed in later days.
Applying FAIR Principles#

Image by Patrick Hochstenbach, CC0 1.0; image illustrates each FAIR principle
The vast majority of data today is shared online. Although FAIR has been introduced in Day 2, Section 3, some additional details are provided below. FAIR principles help researchers make better use of, and engage with a broader audience with, their scientific data than outdated techniques would allow. FAIR data are more valuable for science because they are easier to use. Data can be FAIR regardless of whether it is openly shared or not. If data are openly shared, being FAIR helps with reuse and expands the scientific impact of the data.
FAIR principles don’t encompass comprehensive implementation instructions for every type of data but offer general insights to improve shareability and reusability. Sometimes, it takes a group effort and/or a long production process to make data and results FAIR. The process starts in the planning stage of a research project. A well-coordinated open science and data management plan is often needed for full compliance with FAIR, depending on the size and type of project the data are used for.
Up-to-date information about FAIR Principles can be found at the GO FAIR Initiative website.
Metadata’s Central Role in Applying FAIR#
Metadata is important for search engines to find data and for people to be able to easily compare what is returned.
Metadata is essential to the implementation of FAIR Principles and enables the data to be used by machines in an automated fashion.
The richer and more self-describing metadata are, the better they will be handled by anyone who is interested in your data.
Licensing Data#
A license is a legal document that tells users how they can use a particular dataset. If you don’t license your dataset, others can’t/shouldn’t re-use it - even if you want them to! It is imperative to understand the licensing conditions of a dataset before data reuse. Without a good understanding of what a license allows, data users may run into copyright infringement or other intellectual property issues.
To ensure open reuse of your data, you can use an open license. An open license has language that describes the user’s ability to access, reuse, and redistribute the dataset. There are many types of data licenses that are open to varying degrees, and these will be discussed further in the section “Making Open Data”.
Key Takeaways#
In this section, you learned:
Open data is an essential pillar of open science. Openly sharing data with others enables reproducibility, transparency, validation, reuse, and collaborations.
Several challenges to creating open data exist, but most have straightforward mitigation measures.
FAIR principles can be applied to data to make them more open.
Open-data principles and tasks are used throughout the entire scientific workflow.
Section 2: Using Open Data#
In this section, you will learn how to discover, assess, and cite an open data set. You start by exploring repositories and learning about the issues and considerations for searching datasets. You then learn how to determine if the dataset is suitable for your use by learning what to review in documentation, licenses, and file formats. The section wraps up with a discussion about the importance of citing the datasets and how to read and follow citation instructions.
Open data isn’t always simple to use in your research. Sometimes, there are multiple versions of the same dataset, so learning how to discover and assess and then use open data will help you save time.
As an example, look at the monthly average carbon dioxide data from Mauna Loa Observatory in Hawaii. This is a foundational dataset for climate change. Not only is it one of the first observational datasets that clearly showed anthropogenic impacts on the Earth’s atmosphere, it constitutes the longest record of direct measurements of carbon dioxide in the atmosphere. These observations were started by C. David Keeling of the Scripps Institution of Oceanography in March 1958 at a facility of the National Oceanic and Atmospheric Administration [Keeling, 1976].

If you want to make this figure yourself, or use the data for some other purpose, first you will want to find the data. If you search for this dataset, or any data, chances are that you will find a number of different sources. How do you decide which data to use?
If you start with Google and search for “Mauna Loa carbon dioxide data,” you will find a lot of results. Here are just some of them:

How do you decide which one to use? In this section, we will cover how to find, assess relevance, and use open data.
Discovering Open Data#
Open data can be discovered by accessing data repositories, search portals, and publications. A wide variety of these resources are available. A key step is identifying the appropriate search terms for your application. Learning community-specific nomenclature and standards can accelerate your search.
Where to Start Your Search#
There are multiple pathways to find research data, and you should be practiced in all of them.

People You Know (Online or In-person!)#
What is the first and best way to find research data? Ask your community, including your research advisor, colleagues, team members, and people online. Knowing where to find reliable, good data is as much a skill and art as any lab technique. You learn this skill set by working with professionals in your field. There is no one source, no one method.
Publications#
Datasets are often attached to scholarly publications in the form of supplementary material. Publication search engines can enable the discovery of relevant publications that you can then use to find data from a particular publication.
Data Search Portals#
Data can also be found utilizing a wide variety of search portals, including:
Generic data search portals
Generic data search portals enable discovery of a wide variety of data. Not built for specific disciplines, they serve a broader audience. This type of search portal collects and makes data findable. They are not sources of scientific data. These are aggregation services that emphasize quantity, not necessarily quality. This is where citizen scientists often go to find data, and it’s a great way for non-professionals to get involved in science.
Examples include:
Discipline-specific data search portals
Discipline-specific data search portals enable the discovery of specific types of data. They generally are tailored to meet their community’s needs.
Examples include:
- NASA Earthdata
- CERN
- NCBI National Center for Biotechnology Information
- EMBL's European Bioinformatics Institute
- ISPCR
- NOAA Climate Data Online
- USGS EarthExplorer
- Open Science Data Cloud (OSDC)
National and international data search portals
National and international data search portals enable discovery of data produced by or funded by national and international organizations.
Examples include:
Repositories#
A common way to share and find open data is through data repositories. Many repositories host open data with persistent identifiers, clear licenses and citation guidelines, and standard metadata.
Note that some of our example search portals are also repositories, but not always. Some of the search portals are simply catalogs of information about the data rather than storage locations for the data themselves.
General repositories
General repositories are not designed for a specific community and are accessible to everyone.
Examples include:
See the Generalist Repository Comparison Chart – a tool for additional repositories and guidance. Dataverse has also published a comparative review of eight data repositories.
Domain-specific repositories
Specialized repositories (typically for specific data subject matter) provide support and information on required standards for metadata and more.
Some examples are:
- OpenNeuro: Open access platform for validating and sharing BIDS-compliant MRI, PET, MEG, EEG, and iEEG data
- FCP/INDI: Functional Connectomes Project / International Neuroimaging Data-sharing Initiative
- ABCD Study: Adolescent Brain Cognitive Development Study Data Repository
Institutional repositories
Many universities and organizations support research data and software management with repositories, known as institutional repositories, to aid their researchers with compliance requirements.
National repositories
National repositories aggregate data and make it available to the public. Data stored in these repositories are often produced by the government.
Examples include:
Challenges with Data Repositories#
Any single repository, search engine or publication search will not have access to all available open data.
Search terms may not be consistent across sources or fields of science.
It is essential to become familiar with the standard nomenclatures and appropriate metadata terms for your application.
There is no sure-fire recipe. You may have to try numerous terms and data sources before finding relevant data.
Assessing Open Data#

Using open data for your project is contingent on a number of factors including quality of data, access and reuse conditions, data findability, and more. A few essential elements that enable you to assess the relevance and usability of datasets include (adapted from the GODAN Action Open Data course):
Practical Questions
Is the data well described?
Is the reason the data is collected clear? Is the publisher’s use for the data clear?
Are any other existing uses of the data outlined?
Is the data accessible?
Is the data timestamped or up to date?
Will the data be available for at least a year?
Will the data be updated regularly?
Is there a quality control process?
Technical Questions
Is the data available in a format appropriate for the content?
Is the data available from a consistent location?
Is the data well-structured and machine-readable?
Are complex terms and acronyms in the data defined?
Does the data use a schema or data standard?
Is there an API available for accessing the data?
What tools or software are needed to use this data?
Social Questions
Is there an existing community of users of the data?
Is the data already relied upon by large numbers of people?
Is the data officially supported?
Are service level agreements available for the data?
It is clear who maintains and can be contacted about the data?
Many of these questions may be answered by viewing a dataset’s documentation and metadata, as well as a data’s format and license, all of which will be discussed further in the next section “Making Data Open”.
Activity 1: Discover Data#
Estimated time for activity: 10 minutes. It is an individual activity.
In this activity, visit the links left above to the websites and repositories where open data is located. Find a specific dataset that suits your current research / studies and the one that you might find interesting by itself - it might be your next scientific journey, which will start with this data! Please save the links to the datasets, as you will need to revisit them later in the next activity.
Using Open Data#
Review Citing Guidelines#
Many datasets and repositories explain how they’d prefer to be cited. The citation information often includes:
Authors and their institutions
Title
ORCID
DOI
Version
URL
Creation date
Additional fields may also be specified

This is an example of a simple CITATION.cff file. Source: GitHub
Most datasets require (at a minimum) that you list the data’s producers, name of the archive hosting the data, dataset name, dataset date, and DOI when citing data.
Case Study: Citing Open Data#
Example from Global Carbon Budget 2019
Global Carbon Project. (2019). Supplemental data of Global Carbon Budget 2019 (Version 1.0) [Data set]. Global Carbon Project. https://doi.org/10.18160/gcp-2019
Example from CRCNS - Collaborative Research in Computational Neuroscience
Ranulfo Romo, Carlos D. Brody, Adrián Hernández, and Luis Lemus. (2016). Single-neuron spike train recordings from macaque prefrontal cortex during a somatosensory working memory task. CRCNS.org. http://dx.doi.org/10.6080/K0V40S4D
Key Takeaways#
The following are the key takeaways from this section:
Relevant data may be found in a variety of locations and may require some trial and error to find.
Carefully assess data before using it for your project.
Data citation is important when using data.
Section 3: Making Open Data#
In this section, you learn the criteria and tasks needed to ensure that the datasets you make are open and reusable. The section starts with topics on selecting open data formats and how to include metadata, README files, and version control for your data. It wraps up with a discussion on open licenses for data.
Selecting Data Formats and Tools for Interoperability#
Data Format Considerations#
Preferred data formats are community supported, machine-readable, non-proprietary, modifiable, and open. It might seem like there are as many data formats as there are different types of data. When you think about selecting a data format, consider the following:
Is the format compatible with your data type, shape, and size?
Does the data format have adequate metadata support?
Are there tools readily available or any specialized tools are required for reading the data format?
Is the data format routinely used in your field? Community standards ensure compatibility, interoperability, and ease of use when exchanging or sharing data among researchers or organizations of the same community.

Investigate if your funding agency, institutions, and/or data repository has additional requirements for or guidance on data formats.
Non-Open Data Formats#
A non-open (unsupported and closed/proprietary) data format refers to a file format that is not freely accessible, standardized, or widely supported by different software applications. Here are some examples of closed/proprietary data formats:
Adobe Photoshop (.psd): The default proprietary file format for Adobe Photoshop, a popular image editing software.
AutoCAD Drawing (.dwg): A proprietary data format used for computer-aided design (CAD).
Software applications that can read but not create DOC, PSD, or DWG formatted data usually do not fully support all the features, layers, specifications, and inner workings of the original file.
Some challenges of using data in non-open formats include:
Trouble opening the file due to compatibility issues.
The need to install additional software or converters, leading to frustration and inconvenience.
Initial setback dampens the enthusiasm for using your data.
Converting the data to a universal format can lead to unique formatting or features that do not translate well, making the data lose part of its value.
New open-data policies can limit the sharing of proprietary data as it is often non-compatible with the concept of easy distribution.
Open Data Format Examples#
Some examples of open data formats include:
Comma Separated Values (CSV) |
For simplicity, readability, compatibility, easy data exchange. |
Hierarchical Data Format (HDF) |
For efficient storing and retrieving data, compression, multi-dimensional support. |
Network Common Data Form (NetCDF) |
For self-describing and portability, efficient data subsetting (extract specific portions of large datasets), standardization and interoperability. |
Investigation-Study- Assay (ISA) model for life science studies |
For structured data organization, data integration and interoperability among experiments, reproducibility and transparency. |
Flexible Image Transport System (FITS) |
As a standard for astronomical data, flexible and extensible metadata and image headers, efficient data compression and archiving of large datasets. |
Common Data Format (CDF) |
For self-describing format readable across multiple operating systems, programming languages, and software environments, multidimensional data, and metadata inclusion. |
Microsoft Word (.doc/.docx) |
A proprietary file format used to store word processing data. |
By embracing open standards, authors can avoid unnecessary barriers and maximize their chances of making data useful to their communities.
Making the Data Reusable Through Documentation#
Adding Documentation and Metadata for Reusability#
Metadata and data documentation describe data so that we and others can use and better understand data. While metadata and documentation are related, there is an important distinction. Metadata are structured, standardized, and machine-readable. Documentation is unstructured and can be any format (often a text file that accompanies the data).
To better understand documentation and metadata, let’s take an example of an online recipe. Many online recipes start with a long description and history of the recipe, and perhaps cooking or baking tips for the dish, before listing ingredients and step-by-step cooking instructions.
The ingredients and instructions are like metadata. They can be indexed and searched via Google and other search engines.
The descriptive text that includes background and context for the recipe are like documentation. They are more free-form, and not standardized.
We already discussed metadata earlier in this module, but it’s important enough that we will repeat ourselves a little bit! We will also discuss other types of documentation, like README files.
Metadata: for Humans and Machines#
Metadata can facilitate the assessment of dataset quality and data sharing by answering key questions. It is also the primary way users will find information about your dataset. It includes key information on topics, such as:
How data were collected and processed
What variables/parameters are included in the dataset
What variables are included and what variables are related to
Who collected the data (science team, organization, etc.)
How and where to find the data (e.g., DOI)
How to cite the data
Which spatio-temporal region/time the data covers
Any legal, guideline, or standard information about the data
Why Add Metadata?#
Metadata enhances searchability and findability of the data by potentially allowing both humans and machines to read and interpret datasets. Benefits to creating metadata about your data include:
Helps users understand what the data are and if/how they can use/cite it.
Helps users find the data, particularly when metadata is machine-readable and standardized.
Can make analysis easier with software tools that interpret standardized metadata (e.g. Xarray).
To be machine-readable, the metadata needs to be standardized. See an example of a community-accepted standard for labeling climate datasets with the CF Conventions.
There are also software packages that can read metadata and enhance the user experience significantly as a result. For instance, Xarray is an open-source, community developed software package that is widely used in the climate and biomedical fields, among many others. According to their website, “Xarray makes working with labeled multi-dimensional arrays in Python simple, efficient, and fun!”. It’s the “labeled” part where standardized metadata comes in! Xarray can interpret variable and dimension names without user input, making the workflow easier and less prone to making mistakes (e.g. users don’t have to remember which axis is “time” - they just need to call the axis with the label “time”).
Many standards exist for metadata fields and structure to describe general data information. Use a standard from your domain when applicable, or one that is requested by your data repository.
Metadata Tagging Best Practices#
Useful and informative metadata:
Uses standards that are commonly used in your field.
Complies with FAIR Principles.
Is as descriptive as possible.
Is self-describing.
Remember, the more metadata you add, the easier it will be for users of your data to use it effectively. When in doubt:
Seek and comply with repository/community standards.
Investigate open science online resources for metadata, e.g., Turing Way.
Accompanying Documentation#
When creating your data, in addition to adding metadata, it is a best practice to create a document that users can refer to. The document can be done as a README file, a user guide, or even a quick start (or all three).
README and other documentation files can include information such as:
Contact information
Information about variables
Information about uncertainty
Data collection methods
Versioning and license references
Information about the structure and file naming of the data
References to publications that describe the dataset and/or it’s processing
The intent is to help users quickly understand how they might use the data and to answer any commonly asked questions about your data. You can read more information and view a README template along with an example (particularly relevant for the medical sciences) at this Harvard Medical School website.
Data Versioning Guidelines#
Establish a versioning schema for your data. This is a method for keeping track of iterations of data that features track changes and the ability to revert to a previous revision.
Proper versioning generates a changed copy of a data object that is uniquely labeled with a version number. This enables users to track changes and correct errors.
Proper versioning preserves data quality and provenance (the origin, history, and processing steps that lead to the dataset) by:
Providing a record of traceability from the data’s source through all aspects of its transmission, storage, and processing to its final form.
Saving data files at key steps along the way.
Aiming for downstream verification/validation of original findings.
Making the Data Reusable Through Licensing#

Image source: xkcd.com
Data is the intellectual property of the researcher(s), or possibly of their funder(s) or supporting institution(s). Data is intellectual property, but that does not mean it cannot be used by other researchers (with appropriate attribution).
“By applying a license to your work, you make clear what others can do with the things you’re sharing, and also the conditions under which you’re providing them (like cite you). You can also require others who copy your work to do things in return.”
If you don’t license your work, others can’t/shouldn’t re-use it - even if you want them to. As mentioned previously in this module, a license is a legal document that tells users how they can use the dataset. It is important to understand the licensing conditions of a dataset before data reuse to avoid any copyright infringement or other intellectual property issues.
A dataset without a license does not necessarily mean that the data is open. Using a license-less dataset may pose an ethical dilemma. Contacting the data creator and getting explicit permission while suggesting they apply for a license is the best path forward.
Understanding when and where the license applies is crucial. For example, data created using US Government public research funds is, by default, in the public domain. However, that only applies to the jurisdiction of the United States. In order for this to apply internationally, data creators need to select an open license.

There are several different types of licenses that build on each other. Creative Commons (CC) licenses are often used for datasets. CC0 (also known as “public domain”) is the license that allows for the most reuse because it has the least restrictions on what users can do with it. Although the CC0 license does not explicitly require citation, you should still follow community best practices and cite the data source. CC-BY is another common license used for scientific data that requires citation. From there, you can add restrictions around commercial use, the ability to adapt or modify the data, or requirements to share with the same license. These other flavors all reduce usability by adding restrictions, such that other scientists may be unable to use the data because of institutional or legal restrictions. Funding agencies may require the use of a specific license. For public agencies, this is often CC-0 or CC-BY, to maximize their return on investment and ensure the widest possible re-use.
Case Study: Data Licenses and Reuse#
Here is an example of how a data license can affect reuse. Coupled Model Intercomparison Project Phase 6 (CMIP6) consists of the “runs” from around 100 distinct climate models being produced across 49 different modeling groups. This is the data that is used to understand what our future climate might look like. You have probably seen images that use this data in articles about Earth’s changing climate and how it may impact our lives. Previous versions of these data were licensed CC-BY-NC-SA (cite-noncommercial-sharealike).

Figure citation: IPCC “Framing and Context in : In: Global warming of 1.5°C. An IPCC Special Report” 2020
This meant that any commercial use was restricted. Insurance companies, global corporations, and any type of organization that wanted to use them for commercial use - were having to do their own modeling or just decided not to develop resources related to climate projections (such as fire risk, flooding risk, and how that may affect transportation, commerce, and where we live). This directly impacted the reuse of this data and created additional work. The latest version of CMIP data is moving to CC-BY because of the negative impacts from the -NC-SA restrictions.
Activity 2: Following License#
Estimated time for activity: 5 minutes. It is an individual activity.
Take a look at the saved datasets from the previous activity and the license for this data. What can you do with the data - are you allowed to use it freely in your research? Can you start your own startup which is based on the model built from this data?
Key Takeaways#
Following are the key takeaways from this section:
It is best practice to create an open data management plan that includes open-related topics.
A critical step to making open data is evaluating and selecting open data formats.
Always add documentation that enables other researchers to assess the relevance and reusability of your product. This includes metadata, README files, and version control details.
It is important to assign an open license to your data to enable reuse.
Section 4: Sharing Open Data#
In this section, you will learn about the practice of sharing your data. The discussion starts with a review of the sharing process and how to evaluate if your data are sharable. Next, you take a look at ensuring your data is accessible with a closer look at repositories and the lifecycle of data accessibility, from selecting a repository to maintaining and archiving your data. The section then discusses some steps to make the data as reusable as possible and concludes with a section about considering who will help with the data-sharing process.
Data Sharing Process Overview#
Sharing data is a critical part of increasing the reproducibility of results. Whether it’s new data we collect ourselves or data that we process in order to do our analysis, we end up sharing some form of data. We need to think about what data we will share and how to best ensure that it will be open and usable by others.
Data sharing should typically be done through a long-term data center or repository, which will be responsible for ingesting, curating, and distributing/publishing your open data. You are responsible for providing information/metadata to help make your data be readily discoverable, accessible, and citable. The cost of archiving and publishing data should also be considered.
In general, sharing your open data requires the following steps:
Make sure your data can be shared
Select or identify a repository to host your data
Work with your repository to follow their process and meet their requirements
Make sure your data is findable and accessible through the repository and is maintained and archived
Request a DOI for your data set so that it is easily citable
Choose a data license
Sometimes, you may be able to work with a well-staffed repository that will handle many of these steps for you (for instance, if you are working with NASA mission data). Otherwise, it is your responsibility to follow the above steps to share your data openly.
How to Enable Reuse of Data#
Obtaining a DOI#
Individuals cannot typically request a DOI (digital object identifier) themselves but rather have to go through an authorized organization that can submit the request, such as:
The data repository
Your organization
The publisher (if the data set is part of a publication)
Data makers should provide summary information for DOI landing page(s) if required. Data sharers should accommodate data providers’ suggestions, comply with DOI guidelines, and create landing page(s). If possible, reserve a DOI for you ahead of creating your data.
Ensuring Findability#
Repositories handle the sharing, distribution, and curation of data. Additional services they may provide include:
The assignment of a persistent identifier (like a DOI) to your data set.
The indexing and/or registration of your data and metadata in various services so that they can be searched and found online (i.e., through search engines).
The provision of feedback to data makers to help them optimize their metadata for findability.
Coordinating with data makers to ensure metadata refers to the DOI.
Ensuring the DOI is associated with a landing page with information about your data.
Making it Easy to Cite Your Data#
The goal is to make it easy to cite your data. Best practices include:
Include a citation statement that includes your DOI.
Different repositories and journals have different standards for how to cite data. If your repository encourages it, include a .CFF file with your data that explains how to cite your data.
Clearly identify the data creators and/or their institution in your citation.
This allows users to follow up with the creators if they have questions or discover issues.
Include ORCID of data authors where possible in the citation.
Now that your data are at a repository and have a citation statement and DOI, publicize it to your users and remind them to cite your data in their work!
Who is Responsible for Sharing Data#
Sharing data openly is a team effort. An important part of planning for open data is planning and agreeing to roles and responsibilities of who will ensure implementation of the plan.
So what needs to be done? Documenting these roles and responsibilities in your Data Management Plan will help your team stay organized and do science faster! A well-written, detailed plan should include:
Who Will Move Data to a Repository
Once you are ready to send your data to your repository, find the repository’s recommendations for uploading data. Determine who will work with your repository to accomplish the following types of activities:
Provide information on data volume, number of files, and nature (e.g., revised files).
Check that the file name follows best practices.
How will the data be moved? (especially when files are large).
Check the data! Verify the integrity of the data, metadata, and documentation transfer.
Who Will Develop the Data Documentation and Metadata
Determine who will work with your repository and inventory the transferred data, metadata, and documentation. This role might include the task of populating any required metadata in databases to make the data findable.
You may be able to accomplish some of these tasks through a repository’s interface. However, some types of repositories may require you to interact with their administration teams. For this role, determine who will:
Provide suggestions to organize data content and logistics.
Develop the metadata.
Develop the documentation (e.g., README file or report).
Extract metadata from data files, metadata files (if applicable), and documentation to populate the metadata database and request additional metadata as necessary.
Who Will Help With Data Reuse
Once the repository has made your data available, someone from your team must test access to the data (its accessibility) and distribution methods (its findability). If possible, identify who will work with your repository to optimize/modify tools for intuitive human access and standardize machine access. This role requires someone who to:
Clearly communicate the open protocols needed for the data/metadata.
Provide actual data use cases to data publishers to optimize/modify data distribution tools based on available metadata.
Understand the access protocol(s) and evaluate implications to targeted communities and user communities at large in terms of accessibility.
Who Will Develop Guidance on Privacy and Cultural Sensitivity of Data
Sharing data should be respectful of the communities that may be involved. This means thinking about privacy issues and cultural sensitivities. Who on your team will identify and develop guidance on:
Privacy concerns and approval processes for release - is the data appropriately anonymized?
How to engage with communities that data may be about.
How data can be correctly interpreted.
Are there any data restrictions that may be necessary to ensure the sharing is respectful of the community the data involves, e.g., collective and individual rights to free, prior, and informed consent in the collection and use of such data, including the development of data policies and protocols for collection?
Key Takeaways#
The following are the key takeaways from this section:
When and if to share data? Determine at what point in a project it makes the most sense to share our data. Remember, not all data can or should be shared.
Where to share data? Sharing in a public data repository is recommended, and there are many types of repositories to choose from.
How to enable reuse? Ensure appropriate, community-accepted metadata, assign a DOI, and develop a citation statement to make sure it can be easily found and cited.
Who helps share data? There are many steps in making and sharing data, and it’s important to think about who will be responsible for each step.
Section 5: From Theory to Practice#
In this section, you will get some practice writing a data management plan. You will then learn how you can get involved in open data communities. You will also learn about resources you can start to use and training you can take to start your journey with open data.
Writing an Open Science and Data Management Plan#
The process, responsibilities, and factors to consider when creating an open science and data management plan have been presented throughout this module. Let us remind the common elements of DMPs relevant to open data:
What? |
Data formats and (where relevant) standards |
When? |
When and if to share data |
Where? |
The intended repositories for archived data |
How? |
How the plan enables reuse of the data |
Who? |
Roles and responsibilities of the team members in implementing the DMP |
Two great places to start are DMPTool and DMPonline. You will need to create a free login to use these tools, but both websites walk researchers through the steps of writing a DMP. There are even some existing DMP templates stored within DMP Tool, such as the NASA Planetary Sciences Division’s DMP template.
There are also public examples of data management plans at DMPTool public plans and DMPonline public plans.
If you are applying for funding, it is almost guaranteed that there will be specific requirements detailed in the funding opportunity. For example, the funder may require a certain license or use of a specific repository. Make sure to cross-reference your plan with these requirements!
Activity 3: Review a data management plan#
Estimated time for activity: 15 minutes. It is a group activity.
Take a look at the examples of a public data management plan from University of North Carolina at Chapel Hill and from University of California San Diego.
If direct link for the University of California San Diego doesn’t work for you, please visit the following link, scroll to the end of the page and open DMP Example Psych.doc
document.
Answer the following questions regarding these DMPs in a group:
What: Data formats and (where relevant) standards.
When: When and if to share data.
Where: The intended repositories for archived data.
How: How the plan enables reuse of the data.
Who: Roles and responsibilities of the team members in implementing the DMP.
Which of the DMPs you like more? Why?
Open Data Communities and You#
Getting Involved with Open Data Communities#
There are numerous ways to get involved with and support open data communities, including starting your own community.
Repositories
- Contribute to open data repositories
- Many repositories have user committees to provide them with advice and feedback (and are often looking for volunteers to serve)
- Subscribe to repository mailing lists and social media accounts
Standards committees
- Volunteer to serve on a standards committee
- Provide input to the standards committee
- Subscribe to the mailing lists focused on standards
Conferences, workshops, and special sessions
- Organize a gathering around open data
- Participate in a gathering around open data
Additional Resources#
Resources for More Information#
In addition to the resources listed elsewhere in this training, the community resources below are excellent sources of information about Open Data.
References and Guides:
Opportunities for More Training About Open Data#
In addition to the resources listed elsewhere in this training, the community resources listed below provide excellent information on Open Data.
Additional training:
Key Takeaways#
Now that you have completed the section, you should be able to start your journey with open data:
You now know the steps and have practice writing a sample data management plan.
There are a variety of ways to get involved in the open data community.
There are numerous resources available to get more information and take more training about open data.
Summary#
After completing this day, you should be able to:
Describe the meaning and purpose of open data, its benefits, and how FAIR principles are used.
Recall methods to assess the reusability of data based on its documentation and cite the data as instructed.
Implement an open data management plan, select open data formats, and add the needed documentation, including metadata, README files, and version control, to make the data reusable and findable.
Evaluate whether your data should and can be shared.
Recall practices to make data more accessible, including registering an affiliated DOI and including citation instructions in the documentation.