Day 2: Open Tools and Resources#

By Neuromatch Academy & NASA

Content creators: NASA, Leanna Kalinowski, Hlib Solodzhuk, Ohad Zivan

Content reviewers: Leanna Kalinowski, Hlib Solodzhuk, Ohad Zivan, Shubhrojit Misra, Viviana Greco, Courtney Dean

Production editors: Hlib Solodzhuk, Konstantine Tsafatinos, Ella Batty, Spiros Chavlis


Tutorial Objectives#

Estimated timing of tutorial: 2 hours

This day is designed to help you get started on your journey to practicing open science. It offers an introductory view of the concepts and resources that are fundamental to open science. The bridge between the concepts and the practice of the concepts is something called the use, make, share framework. There are many methods and models that define how to get started with open science. The use, make, share framework was constructed to help you immediately assign purpose to the concepts and tools that are covered in this module as well as in the entire course curriculum. All of the information that you learn here will be addressed in more detail as you participate in other days but can also be applied immediately after completing this tutorial.


Section 1: Introduction to the Process of Open Science#

In this section, you will review the definition of several common terms in the context of open science, including research products, data, software, and results. In addition, you will read examples that demonstrate how these open-science tools are used in practice. The section wraps up with an example of how one group openly shared their data, results, software, and paper.

Definition of Research Products#

Scientific knowledge, or research products, take the form of:

Within these research products are additional types of products, such as methodologies, algorithms, and physical artifacts.

What is Data?#

In general, data are pieces of information about a subject, including theoretical truths, raw measurements, or highly processed values.

There can even be data about data, called metadata. In our lessons, when we talk about data, we are referring to scientifically or technically relevant information that can be stored digitally and accessed electronically, such as:

  • Information produced by missions and experiments, including calibrations, coefficients, and documentation.

  • Information needed to validate scientific conclusions of peer-reviewed publications.

Open data can have many characteristics, including rich and robust metadata, and be made available in a range of formats. These characteristics are detailed later in this day and even further in the day on Open Data.

What is Code?#

Many scientists write source code to produce software to analyze data or model observations. Code is a language that humans can type and understand. Software is often a collection of programs, data, and other information that a computer system uses to perform specific tasks. Scientists write and use many different types of software as part of their research.

General Purpose Software – Software produced for widespread use, not specialized scientific purposes. This encompasses both commercial software and open-source software.

Operational and Infrastructure Software – Software used by data centers and large information technology facilities to provide data services.

Libraries – Generic tools that implement well-known algorithms, provide statistical analysis or visualization, etc., which are incorporated into other software categories.

Modeling and Simulation Software – Software that either implements solutions to mathematical equations given input data and boundary conditions or infers models from data.

Analysis Software – Software developed to manipulate measurements or model results to visualize or gain understanding.

Single-use Software – Software written for use in unique instances, such as making a plot for a paper or manipulating data in a specific way.

Some of the tools that you can use to develop software are introduced in Day 4. Understanding how to find and use others’ code, create your own, and share it is an important part of advancing science and is covered in the day on Open Code.

What are Results?#

Results capture the different research outputs of the scientific process. Publications are the most common type of results, but this can include a number of other types of products. Both data and software can be considered a type of result, but when we discuss results, we will focus on other types of results. Results can include the following:

  • Peer-reviewed publications

  • Computational notebooks

  • Blog posts

  • Videos and podcasts

  • Social media posts

  • Conference abstracts and presentations

  • Forum discussions

You may already be familiar with the research life cycle, but still unfamiliar with the types of results that can be shared openly throughout this process. When sharing results, we strive to be as open as possible, with the goal of increasing reproducibility, accessibility, and inclusion of our science. Throughout the research lifecycle, there are multiple opportunities to openly share different results that can lead to new collaborations and lines of inquiry. Additional details on the scope of open results are shared in Day 5 – Open Results.

Using Tools for Open Science in Practice#

The following subsection explores different tools and resources available to researchers for using, making, and sharing open science. As mentioned, it is important to think about how to integrate open science principles across all stages of the research process. Here is an overview of one way the various pieces might work together.

The Components of Open Science#

The four principal components of open science can be organized in a pyramid of openly-shared research products.

The research paper, closely tied to the results, sits at the top of the pyramid and summarizes how you’ve combined your software and your data to produce your results.

The practice of sharing these components can occur at varying degrees of completeness. For the following guidance on how to share components of open science, we simplify the range of completeness to “good”, “better”, and “best.” This range reflects one’s commitment to sharing open science at all steps in the research process and to all of its products.

Sharing Open Data#

Data can be easily shared through many different services - the best way for scientific data to be shared is often through a long-term data repository that will both preserve your data and make it discoverable. The image provides some of the considerations when sharing the data through Zenodo, a generalist data repository. These considerations would be similar for other data repositories. See Day 3 - Open Data for more details on sharing open data.

Practices for Open Data.

Sharing Open Code#

When sharing open code, it is often through an online version-controlled platform that allows others to contribute to the software and provides a history of changes to the software. For example, many researchers choose to post code files on GitHub with a BSD 3-Clause license. This permits others to contribute and reuse the software. Steps to preserve code and make it discoverable are discussed in Day 4 - Open Code.

Practices for Open Code.

Sharing an Open Paper#

Researchers can choose to publish in a journal with an open-access license. Researchers can search for open-access journals through the Directory of Open Access Journals (DOAJ). (See Day 5 - Open Results).

Sharing Open Results#

When sharing results, include your methodology that was used to produce results (i.e. the “provenance”) directly with your software. Software tends to evolve with time, while the outputs of the software itself can retain some consistency. Therefore, sharing your methodology helps others to reproduce your aging results with newer software, even if the methodology to produce them can vary as the software evolves.

An Open Science Project Example#

Here is an example of how one group openly shared their data, results, software, and paper; all with their own unique identifiers. Note that data and software can each have multiple identifiers, enabling others to cite all versions or one unique version.

Data, Results(Paper) and Software of Particular Study.

Data, Results, Software.

Key Takeaways#

In this section, you learned:

  • Scientific knowledge, or research products, take the form of: data, software, and results.

  • In general, data are pieces of information about a subject, including theoretical truths, raw measurements, or highly processed values.


Section 2: General Tools for Open Science#

This section introduces you to the commonly used tools in open science. It starts out by providing a brief introduction to open science tools and describes persistent identifiers - one of the most common open science tools in use that ensures reproducibility, accessibility, and recognition of scientific products. This is followed by descriptions of other common open science tools that are applicable regardless of your field of study. The section wraps up with a description of open science and data management plans that is a key component to sharing your science throughout the research process.

Introduction to Open Science Tools#

The word “tools” refers to any type of resource or instrument that can be used to support your research. In this sense, tools can be a collection of useful resources that you might consult during your research, software that you could use to create and manage your data, or even human infrastructure such as a community network that you join to get more guidance and support on specific matters.

In this context, open science tools are any tools that enable and facilitate openness in research, and support responsible open science practices. It is important to note that open science tools are often open source and/or free to use, but not always.

Open science tools can be used for:

  • Discovery - Tools for finding content to use in your research.

  • Analysis - Tools to process your research output, e.g. tools for data analysis and visualization.

  • Writing - Tools to produce content, such as Data Management Plans, presentations, and preprints.

  • Publications - Tools to use for sharing and/or archiving research.

  • Outreach - Tools to promote your research.

Persistent Identifiers#

A digital persistent identifier (or “PID”) is a “long-lasting reference to a digital resource” that is machine-readable and uniquely points to a digital entity, according to ORCID. Examples of persistent identifiers used in science are described below.

ORCID#

ORCID logo.

An “Open Researcher and Contributor Identifier” (ORCID) provides valid information about a person. Following are some key details about ORCIDs.

A free, nonproprietary numeric code that is:

  • Uniquely and persistently identifies authors and contributors of scholarly communication.

  • Used similarly to how tax ID numbers are used for tax purposes.

ORCIDs are used to link researchers to their research and research-related outputs. It is a 16-digit number that uniquely identifies researchers and is integrated with certain organizations (like some publishers) that will add research products (such as a published paper) to an individual’s ORCID profile. ORCIDs are meant to last throughout one’s career and help to avoid confusion when information about a researcher changes over time (e.g., career change or name change).

Many publishers, academic institutes, and government bodies support ORCID. In 2023, ORCID reported over 1,300 member organizations and over 9 million yearly live accounts. You can connect it with your professional information (affiliations, grants, publications, peer review, and more).

Digital Object Identifiers (DOI)#

DOI logo.

A DOI is a persistent identifier used to cite data, software, journal articles, and other types of media (including presentation slides, blog posts, videos, logos, etc.).

Unlike dynamic transient URLs, DOIs are static pointers to documents on the internet. Since a DOI is static, each new version of data or software that you want to cite will need a new DOI. Some DOI providers allow for one DOI to point to “all versions” and a series of individual DOIs for each specific version. Individuals cannot typically request a DOI themselves, but rather have to go through an authorized organization that can submit the request.

Making a DOI for your product ensures its longevity! This means that if you cite a DOI in a research paper, you can be confident that future readers will be able to follow that citation to its source, even if websites have completely changed in the meantime.

For example, the DOI: 10.5067/TERRA-AQUA/CERES/EBAF-TOA_L3B004.1 will always resolve to a web page that explains what the CERES_EBAF-TOA_Edition4.1 data set is and how to download it. (See the screenshot below if you’re curious about what this dataset actually is!)

DOIs are provided and maintained by the International Organization for Standardization (ISO).

Citations Using DOIs#

DOI example.

DOIs make citing research products easier and more useful.

Data repositories will typically instruct you on the exact way to cite their data, which includes the correct DOI. For example, let’s take a look at the CERES_EBAF-TOA_Edition4.1 data set mentioned above. This is an example from the Atmospheric Science Data Center’s (ASDC) website.

Activity 1: Find and Resolve a DOI#

Estimated time for activity: 5 minutes. It is an individual activity.

In this activity, you will search for a DOI for a data set or piece of software that you use, and you will then use the DOI website to “resolve” the DOI name. By “resolving”, this means that you will be taken to the information about the product designated by that particular DOI.

  1. Find the DOI for a dataset or software you use often.

    1. This should be listed either in the citation file or on the website where that data/software is published.

    2. If you can’t find a DOI, you can instead locate the DOI listed on this page: https://asdc.larc.nasa.gov/project/CERES/CERES_EBAF-TOA_Edition4.1

  2. Go to https://www.doi.org/ and scroll down to the bottom of the page to “TRY RESOLVING A DOI NAME”.

  3. Copy and paste the DOI you found into the form called “TRY RESOLVING A DOI NAME”.

  4. Click Submit.

  5. The page should automatically redirect you to a page that explains and contains the cited data.

Examples of PIDs in Action#

DOI and ORCID summary.
  • The necessity for a persistent identifier (PID) begins when a researcher writes code. To make the code searchable, the researcher uploads their code to a repository and registers a DOI for their script. Now, others can review and use the code and cite it properly.

  • A workshop planning committee collaboratively authors a paper that summarizes the results of a workshop. They collect the ORCIDs of everyone who participated in the workshop and include them in the paper. Finally, they publish in an academic journal that automatically assigns the paper a DOI.

  • A community scientist attends an online conference and gives a short talk. They deposit their slides in an online repository, then create a DOI to enable easy sharing with colleagues and straightforward citation.

Useful Open Science Tools#

Metadata#

Metadata are data that describe your data, either accompanying your data as a separate file or embedded in your data file. They are often used to provide a standard set of general information about a dataset (e.g., data temporal/spatial coverage or data provider information) to enable easy use and interpretation of the data.

Metadata is essential to the implementation of FAIR Principles because it makes data searchable in an archive, provides context for future use, and presents a standard vocabulary.

Metadata can be more readily shared than data - it usually does not contain restricted information, and it is much smaller than the entire data set.

Purpose of Metadata#

Metadata can facilitate the assessment of dataset quality and data sharing by answering key questions, such as information about:

  • How data were collected and processed.

  • What variables/parameters are included in the dataset.

  • What variables are and what variables are related to.

  • Who collected the data (science team, organization, etc.).

  • How and where to find the data (e.g., DOI).

  • How to cite the data.

  • Which spatio-temporal region/time the data covers.

  • Any legal, guideline, or standard information about the data.

Metadata enhances the searchability and findability of the data by potentially allowing other machines to read and interpret datasets.

According to The University of Pittsburgh, “A metadata standard is a high-level document which establishes a common way of structuring and understanding data, and includes principles and implementation issues for utilizing the standard.”

Many standards exist for metadata fields and structures to describe general data information. It is a best practice to use a standard that is commonly used in your domain, when applicable, or that is requested by your data repository. Examples of metadata standards for different domains include:

Types of Metadata#

There are different types/categories of metadata addressing different purposes:

Descriptive Metadata

Descriptive metadata can contain information about the context and content of your data, such as variable definition, data limitation, measurement/ sampling description, abstract, title, and subject keywords.

Structural Metadata

Structural metadata are used to describe the structure of the data (e.g., file format, the dataset hierarchy, and dimensions).

Administrative Metadata

Administrative metadata explains the information used to manage the data (e.g., when and how it was created, which software, and the version of the software used in data creation).

Documentation#

Documenting the production and management of your science benefits both you and those who might use your data, code, or results in the future. You are your own best collaborator. Documentation can save you from a headache should you need to reference or reuse your work in six months or attempt to recall meticulous details about your process later on. Properly documented research products increase their usability.

Types of documentation include (many of which will be expanded upon later in this course):

Data

Summary of the data (e.g., as a README file or user guide) that answers questions such as:

  • What are known errors for these data?

  • How can this data be used?

  • How were the data collected?

Associated publications – how did others use these data?

Software

  • README files: Basic installation and usage instructions.

  • Inline comments in code: Annotations on code components.

  • Release notes: What is new in this version?

  • Associated publications: How did others use this software?

Results

  • Associated publications: What was the research process?

  • Packages of data and software for regenerating results.

Repositories#

Repositories are storage locations for data, results, code, and compiled software, providing the most common way to share and find each of these components. In general, you want to use a long-term repository that will independently host and store your data, making sure that it is both shared and preserved. Different kinds of repositories serve different purposes. For example, Zenodo acts as an archiving repository for individual version releases of data, software, and publications.

Different types of repositories:

  • General repositories

  • Domain-specific repositories

  • Institutional repositories

  • National repositories

Users should select repositories based on their needs. See the sections in the rest of this day and Days 3-5 for more details.

Pre-registration#

Pre-registration is the process by which a researcher documents their research plans in an open-access format prior to the start of a project. This provides a locked, time-stamped proof of the origin of a concept. Pre-registration is currently more widely adopted by certain disciplines, particularly the social sciences.

Types of Pre-Registration Include:

Standard Pre-registration

An investigator documents their plans in writing and submits them to a pre-registration service. This documents the researcher’s plans prior to undertaking the research and provides investigators and reviewers with a way to distinguish a priori hypotheses from post-hoc exploratory analyses. The document may be kept private for some period of time but is usually made public upon submission of the manuscript for publication.

Registered Reports

An investigator writes a manuscript describing the motivation for a study and a detailed description of the methods and submits it to a journal for peer review prior to undertaking the research. The manuscript is reviewed based on the importance of the research question and the quality of the methods. If accepted, the journal agrees to publish the paper regardless of the results, assuming that there are no problems with the implementation of the methods.

Registered Replication Report

A type of registered report in which the investigators wish to attempt to replicate a particular published finding, usually involving multiple research sites.

Sharing Grant Proposals Registered Reports

Another way to document and timestamp research plans and concepts is to share funded grant proposals publicly. This has the added benefit of making the funding process more transparent and providing examples of successful grant proposals for other researchers, particularly those in their early career stage.

Why is Pre-Registration Important?#

  • It forces the researcher to plan and think through both why and how they are pursuing their research question.

  • It provides the researcher with a way to determine whether a hypothesis was truly held a priori, versus relying upon memory.

  • It forces the researcher to think through their analysis plan in more detail, potentially surfacing issues that could influence the design of the study.

  • It helps prevent unethical manipulation of data analyses and project design to yield statistically relevant results.

  • Helps prevent selective reporting of measures.

When Can/Should One Pre-Register Their Research?#

A planned research activity can be pre-registered at any point, as long as the particular activity being registered has not started. However, there are several points at which registration is most common:

  • Prior to the collection of data for a project

  • Prior to analysis of an existing or openly available dataset

Source: Registration — Stanford Psychology Guide to Doing Open Science (poldrack.github.io)

A 2023 Nature survey on researcher attitudes towards open science practices found that about 88% of respondents favor sharing data or code online while only 58% support pre-registration. This moderate support for pre-registration among respondents suggests that awareness of its benefits and lingering concerns remain issues.

Open Science and Data Management Plans#

To successfully use, make, and share science openly, we need an open science and data management plan (OSDMP).

  • From day 1, establish a plan for management, preservation, and release of data, software, and results.

  • This plan is your blueprint for open science - refer to your plan often to ensure you succeed in your goal of openness.

We’ll discuss each component (data, software, & results) when we cover each topic.

Note: Many funding opportunities (e.g., NASA ROSES) require an OSDMP as part of your proposal. For more information on NASA Science Mission Directorate’s (SMD’s) policies, please see NASA Guidance on Management Plans and Open Source Science Guidance for Researchers.

Design Your Science to be Open#

Funding organizations and agencies around the world are beginning to require open science plans. In this course, we will focus on the NASA Open Science and Data Management Plan. These plans are not unique to NASA. Knowing how to write one for NASA should prepare you for almost any funding opportunity.

The OSDMP describes how the scientific information that will be produced from scientific activities will be managed and made openly available. Specifically, a plan should include sections on data management, software management, and publication sharing. If your study has other types of outputs, such as physical samples, hardware, or anything else, you should include those in the plan. An OSDMP helps researchers think about the details of how they plan to share results.

A well-written OSDMP can help you win funding because it demonstrates your skills at doing open science!

Research artifacts plan.

Example sections to include in an OSDMP:

  1. Data Management Plan (DMP)

  2. Software Management Plan (SMP)

  3. Publication sharing

  4. Other open science activities

  5. Roles and responsibilities

The steps for each of these sections should include:

  • What?

    • Description of types of materials that will be produced

  • When?

    • The schedule for archiving and sharing

  • Where?

    • The repository(ies) and archives that will be used to share materials

  • How?

    • The details of how to enable reuse of materials (e.g., licensing, documentation, metadata)

  • Who?

    • Roles and responsibilities of the team members

Data Management Plan#

Every major research foundation and federal government agency now requires scientists to file a data management plan (DMP) along with their proposed research plan. Data and other elements, such as code and publications, have their own lifecycle and workflow, which need to be in the plan. DMPs are a critical aspect of open science and help keep other researchers informed and on track throughout the data management lifecycle.

DMPs that are successful typically include clear terminology about FAIR and CARE principles and how they will be applied.

The data management lifecycle is typically circular. Research data are valuable and reusable long after the project’s financial support ends. Data reuse can extend beyond our own lifetimes. Therefore, when designing a project or supporting an existing corpus of data, we need to remain cognizant of what happens to the data after our own research interaction ends.

Data management plans typically include the following:

  • Descriptions of the data expected to be produced from the proposed activities, including types of data to be produced, the approximate amount of each data type expected, the machine-readable format of the data, data file format, and any applicable standards for the data or associated metadata.

  • The repository (or repositories) that will be used to archive data and metadata arising from the activities and the schedule for making data publicly available.

  • Description of data types that are subject to relevant laws, regulations, or policies that exclude them from data sharing requirements.

  • Roles and responsibilities of project personnel who will ensure implementation of the data management plans.

Software Management Plan#

Software management plans describe how software will be managed, preserved, and released as part of the scientific process. This helps ensure transparency and reproducibility in the scientific process. Day 4 on Open Code shares more details about the importance of sharing code as part of the scientific process.

General components of a software management plan:

  • Description of the software.

  • Repository(ies) and archive(s) in which software will be shared.

  • Sharing guidelines.

  • Personnel roles and responsibilities.

  • Any community-specific information of note.

At a minimum, a software management plan SMD-funded (NASA Science Mission Directorate) research should include:

  • Description of the software expected to be produced from the proposed activities, including types of software to be produced, how the software will be developed, and the addition of new features or updates to existing software. This can include the platforms used for development, project management, and community-based best practices to be included such as documentation, testing, dependencies, and versioning.

  • The repository(ies) that will be used to archive software arising from the activities and the schedule for making the software publicly available.

  • Description of software that are subject to relevant laws, regulations, or policies that exclude them from software sharing requirements.

  • Roles and responsibilities of project personnel who will ensure implementation of the software management plan.

Open Science Plan#

The OSDMP should also describe other open processes as part of the plan. This includes the types of publications that are expected to be produced from the activities, including peer-reviewed manuscripts, technical reports, conference materials, and books. The plan should also outline the methods expected to be used to make the publications publicly accessible.

This section may also include a description of additional open science activities associated with the project. This may include:

  • Holding scientific workshops and meetings openly to enable broad participation.

  • Pre-registering research plans in advance of conducting scientific activities.

  • Providing project personnel with open science training or enablement (if not described elsewhere in a proposal).

  • Implementing practices that support the inclusion of broad, diverse communities in the scientific process as close to the start of research activities as possible (if not described elsewhere in a proposal).

  • Integrating open science practices into citizen science activities.

  • Contributions to or involvement in open-science communities.

Publications Plan#

A plan for publications is a crucial piece of the OSDMP. A publications plan should include the following features:

  • Describe how results will be managed, preserved, and released - in other words, how you will communicate your findings.

  • Includes plans for conference talks, whitepapers, peer-reviewed journal articles, books, and other such documents*.

  • Written in compliance with any rules and regulations within your organization, as well as from your funding source.

  • As with the data and software plans, it serves as a foundational framework for your project from start to finish.

Examples of Requirements for Open Science Management Plans#

Globally, organizations and agencies are moving towards open science and beginning to require plans as part of funding. Here are just some of them:

USA

GLOBAL INSTITUTES

Key Takeaways#

In this section, you learned:

  • The definition of science tools, common examples, and which part of the scientific workflow they can support.

  • The definition and purpose of persistent identifiers. The usefulness of ORCIDs and DOIs in the scientific process.

  • Examples of useful and common open science tools such as metadata, documentation, repositories, and pre-registration.

  • The steps for writing an open science and data management plan.


Section 3: Tools for Open Data#

This section discusses the concepts, considerations, and tools for making data and results. It starts with a closer look at the FAIR principles and how they apply to data. The section includes an introduction to plans, tools, data formats, and other considerations that are related to making data and sharing the results related to that data.

Introduction to Open Data#

Scientific data is any type of information that is collected, observed, or created in the context of research. It can be:

  • Primary – Raw from measurements or instruments.

  • Secondary – Processed from secondary analysis and interpretations.

  • Published – Final format available for use and reuse.

  • Metadata – Data about your data.

It is everything that you need to validate or reproduce your research findings, as well as what is required for the understanding and handling of the data.

The following sections discuss ways to ensure that data is fully utilized and accessible to the most amount of people. These best practices center around community frameworks and tools that help researchers manage and share open data.

FAIR Principles#

Just like driving on the road, if everyone follows agreed-upon rules, everything goes much smoother. The rules don’t need to be exactly the same for every region but share common practices based on insights about safety and efficiency.

For example, maybe you drive on the left side of the road or the right side. Either is fine, those sort of details are for different communities to decide on. However, there are overarching guidelines shared by communities across the globe, such as the rule to drive on the road, not the sidewalk, use a turn signal when appropriate, adhere to lights at intersections that direct traffic, and follow speed limits. Some communities may implement stricter rules than others or practice them differently, but these guidelines help everyone move around safely through a common understanding of how to drive on roads. For scientific data, these guidelines are called the Findable, Accessible, Interoperable, Reusable, or “FAIR” principles. They do to data what their title suggests. That is, these principles make it possible for others (and yourself) to find, get, understand, and use data correctly.

Findable:

To be Findable:

  • Data and results are assigned a globally unique and persistent identifier.

  • Data are described with rich metadata.

  • Metadata clearly and explicitly includes the identifier of the data it describes.

  • Data and results are registered or indexed in a searchable resource.

Current Enabling Tech:

Accessible

To be Accessible:

  • Data and results are retrievable by their identifiers using a standardized communication protocol.

  • The protocol is open, free, and universally implementable.

  • The protocol allows for an authentication and authorization procedure, where necessary. Data and results are publicly accessible and licensed under the public domain.

    • Metadata are accessible, even when the data are no longer available. Data and metadata will be retained for the lifetime of the repository.

    • Metadata are stored in high-availability database servers.

Current Enabling Tech:

Note that Microsoft Exchange Server and Skype are examples of proprietary protocols. As always, it is necessary to balance accessibility with security concerns, which may impact the chosen protocol.

Interoperable

To be Interoperable:

  • Data uses a formal, accessible, shared, and broadly applicable language for knowledge representation.

  • Data uses a known, standardized data format.

  • Data use vocabularies that follow FAIR principles.

  • Data include qualified references to other (meta)data.

Current Enabling Tech:

  • Zenodo uses JSON Schema as internal representation of metadata and offers export to other popular formats such as Dublin Core or MARCXML.

  • For certain terms we refer to open, external vocabularies, e.g.: license (Open Definition), funders (FundRef) and grants (OpenAIRE).

  • Each referenced external piece of data is qualified by a resolvable URL.

Reusable

To be Reusable:

  • Data are richly described with a plurality of accurate and relevant attributes.

  • Data are released with a clear and accessible data usage license.

  • Data are associated with detailed provenance.

  • Data meet domain-relevant community standards.

Current Enabling Tech:

  • The metadata record contains a minimum of DataCite’s mandatory terms, with optionally additional DataCite recommended terms and Zenodo’s enrichments.

  • Zenodo is not a domain-specific repository, yet through compliance with DataCite’s Metadata Schema, metadata meets one of the broadest cross-domain standards available.

Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci.Data 3:160018, doi: 10.1038/sdata.2016.18 (2016)

These are high-level guidelines, and much like open science, implementation is nuanced. Sometimes, it takes a group effort and/or a long production process/funding to make data and results FAIR. For other datasets, it could be more straightforward. A well-coordinated data management plan is needed for full compliance with FAIR, and the details of this will be discussed further in Day 3 – Open Data.

Tools to Help with Planning For Open Data Creation#

Data Management Plan#

The previous section describes the requirements of a data management plan (DMP). Below are two open science resources to get you started on creating a data management plan:

DMPTool

The DMPTool in the US helps researchers by featuring a template that lists a funder’s requirements for specific directorate requests for proposals (RFP). The DMPTool also publishes other open DMPs from funded projects that can be referenced to improve your own. The Research Data Management Organizer (RDMO) enables German institutions as well as researchers to plan and carry out their management of research data.

ARGOS

ARGOS is used to plan Research Data Management activities of European and nationally funded projects (e.g. Horizon Europe, CHIST-ERA, the Portuguese Foundation for Science and Technology - FCT). ARGOS produces and publishes FAIR and machine-actionable DMPs that contain links to other outputs, e.g., publications-data-software, and minimizes the effort to create DMPs from scratch by introducing automation in the writing process. OpenAIRE provides a guide on how to create DMP.

Data Repositories#

A data repository is a digital space to house, curate, and share research outputs. Data repositories were originally used to support the needs of research communities. Examples of data repositories include:

  • Protein Data Bank uses a data repository to catalog 3D structures of proteins and nucleic acids.

  • Genbank of the National Institutes of Health uses a genetic sequence database that contains annotated publicly available nucleic acid sequences.

  • The Image Data Resource is a public repository of microscopy bio-image datasets from published studies.

  • The Electron Microscopy Public Image Archive is a public resource for raw cryo-EM images.

  • OpenNeuro is an open platform for validating and sharing brain imaging data. The tools featured in Open Neuro enable easy access, search, and analysis of annotated datasets.

Open science tools such as data repositories should implement FAIR principles, especially in the case of attribution of persistent identifiers (e.g., DOI), metadata annotation, and machine-readability.

Additional examples of data repositories and other open science tools include but are not limited to:

ZENODO

Zenodo is an example of a data repository that allows the upload of research data and creates DOIs. Its popularity among the research community is due to its simplified interface, support of community curation, and feature that enables researchers to deposit diverse types of research outputs, from datasets and reports to publications, software, and multimedia content.

DATAVERSE

The Dataverse Project is an open-source online application to share, preserve, cite, explore, and analyze research data, available to researchers of all disciplines worldwide for free.

DRYAD

The Dryad Digital Repository is a curated online resource that makes research data discoverable, freely reusable, and citable. Unlike previously mentioned tools, it operates on a membership scheme for organizations such as research institutions and publishers.

DATACITE

Datacite is a global non-profit organization that provides DOIs for research data and other research outputs on a membership basis.

OSF

The Open Science Framework is an open-source platform for sharing, managing, and collaborating on research.

Data services and resources for supporting research require robust infrastructure that relies on collaboration. An example of an initiative on the infrastructures of data services comes from the EUDAT Collaborative Data Infrastructure, a sustained network of more than 20 European research organizations.

Private companies also host and maintain online tools for sharing research data and files. For example, Figshare is one example of a free and open-access service operated by private companies. It provides DOIs for all types of files and recently developed a restricted publishing model to accommodate intellectual property (IP) rights requirements. It allows sharing the outputs only within a customized Figshare group (could be your research team) or with users in a specific IP range. Additional advances include integration with code repositories, such as GitHub, GitLab, and Bitbucket.

Additional research data repositories can be found in the publicly available Registry of Research Data Repositories. OpenAire, a hosted search engine, also provides a powerful search function for data and repositories. It features a filter for country, type, and thematic area, as well as enables the download of data.

The amount of data, repositories, and different policies can be overwhelming. When in doubt about determining which repository is right for you, consult librarians, data managers, and/or data stewards in your institution, or check within your discipline-specific or other community of practice.

Activity 2: Explore Zenodo and Sign Up!#

Estimated time for activity: 8 minutes. It is an individual activity.

The most popular repository at the moment is Zenodo. Review the following 4.5-minute video to get an overview of Zenodo and then sign up for an account. You can use your ORCID to sign up if you have one.

Watch Video

Tools to Help with Using and Making Open Data#

Data Formats#

A useful file format can be read into memory by some software. Think of the format as a tool for making data accessible. Easy-to-use formats feature:

  • A simple, easy-to-understand structure.

  • A clear and open specification for the format that is ideally not tied to a specific software product.

  • Open software libraries and APIs that can parse the format.

The formats that are considered the most interoperable against the criteria above include Comma Separated Values (CSV), Extensible Markup Language (XML), and JavaScript Object Notation (JSON). Other common formats for researchers include binary array-based formats like Network Common Data Form (NetCDF), Hierarchical Data Format (HDF), Geotiff, Flexible Image Transport System (FITS), and other formats designed for cloud storage and access like Zarr, Cloud Optimized GeoTIFF, and Parquet. Many of these formats have tools that check datasets for compliance and readability.

Inspecting Data#

Modern data formats allow the storage of much more than mere data points. Once one adopts these standards (e.g. NetCDF), the discovery of the contents on each file can be aided by a variety of tools which together help map primary data and/or display the associated metadata. Several tools exist for inspecting data, too numerous for all to be mentioned here. Notable tools to start with include:

CSV, XML, JSON

These files can all be opened with the most common text editors. There are some tools that can create views of the files that are more user-friendly, such as:

NetCDF, HDF, FITS

These files require special software tools to view their contents. Many of these tools will also visualize the data as well.

  • NetCDF and HDF: Most files are easily viewed using the Xarray open-source software library in Python or the ncdf4 library in R.

  • FITS: There are many options; here is the provided list of tools.

ZARR, COG, PARQUET

These files require special software tools to view their contents. Many of these tools will also visualize the data as well.

  • Zarr: Files are easily viewed using the Xarray open-source software library in Python or the Pizzar library in R.

  • COG: Files are viewed using the rioXarray open-source software library in Python or the terra library in R.

  • Parquet: Files are viewed using the Pandas open-source software library in Python or the Arrow library in R.

FAIR Assessment#

How ‘FAIR’ is your data? Two groups, FAIRsharing.org and the Research Data Alliance RDA, have developed the FAIR Metrics and FAIR Data Maturity Model to help assess the ‘FAIR’-ness of a dataset. There are open-source tools that help researchers assess their data:

AUSTRALIAN RESEARCH DATA COMMONS (ARDC)

Online questionnaire (manual), best for:

  • Triggering discussions at the initial stages of considering FAIR implementation

  • Identifying areas for improvement

Outputs include:

  • Progress bar for each FAIR principle

  • Aggregate bar for all

Link

FAIR-CHECKER

Automated via website or API, best for:

  • Scalability to many datasets

  • Identifying areas for improvement

Outputs include:

  • A chart with scores and details

Link

F-UJI

Automated via website or API, best for:

  • Scalability to many datasets

  • Detailed documentation of tool

Outputs include:

  • A report and chart with scores and details

Link

FAIR EVALUATION SERVICES

Automated via website or API, best for:

  • Scalability to many datasets

  • To generate a custom assessment

Outputs include:

  • A detailed report and chart

Link

Key Takeaways#

In this section, you learned:

  • The different types of scientific data, including primary, secondary, published, and metadata.

  • A list of open science practices to implement FAIR principles that make data and results easily accessible to a wide range of people.

  • Digital tools to help plan for making and sharing open data.


Section 4: Tools for Open Code#

This section introduces you to some useful tools for working with open code. You will learn the various tools available to develop, store, and share open code, from version control to code editing software to containers.

Introduction to Open Code#

In Section 3, we learned about useful tools for working with scientific data. Now, we will provide an overview of commonly used tools that help us write and run computer code to explore, analyze, and visualize our scientific data. Later in Day 4 – Open Code, we will discuss in greater detail what it means to make our code open and walk through the steps of how to find, create, and share open code.

Understanding how to work with scientific code is essential in the modern landscape of data-driven research. The tools presented in this lesson encompass a diverse array of resources designed to streamline, enhance, and optimize the process of developing, maintaining, and collaborating on code development for scientific research. They enable the creation of robust and efficient code, often leveraging the collective wisdom of the open-source community. In the pursuit of reproducibility and transparency, these tools can also facilitate the sharing and dissemination of scientific code, fostering collaboration and ensuring that the foundations of scientific research remain open and accessible to all.

The precedent for making code open is the Linux Operating System. Quick facts:

  • Started in 1991 by Linus Torvalds.

  • Almost immediately released for scrutiny.

  • Many eyes → Many bugs found → Many fixes.

You can read the full history by this link.

Tools for Version Control#

Version Control#

Version control is the practice of tracking and managing changes made to code or other types of files. You may be familiar with “Track changes” in software like Microsoft Word. This is a form of version control, though not one well-suited to working with code. Version control is considered standard practice in the software development community and simplifies the management of code through time.

The general way we use version control starts by initializing a folder on your computing platform with the version control system you are using. A version control system automatically tracks all changes made by contributors and allows you to work offline and return later with updates. You write code as you usually do in your code editor of choice. After you have written some code or made some updates to existing code, you then commit those changes to the version control system to create a sort of “checkpoint” that you can then revert back to later if necessary. Then, you add or update more code and commit changes again. Each commit requires you to add a short message which lets you briefly describe what changes were made. These messages serve as metadata that ensures collaborators, future users, and future you understand your development process at a point in time.

This may sound like a simple process, and in many ways it is! So why is it so important? Especially when it comes to coding, the ability to create a snapshot in time of a piece of code can be very helpful. For instance, you may have a piece of code that yields the intended result, but then you want to add a new function. You may choose to copy that code file so you don’t lose the current state, and then work in a new file. This can become cumbersome pretty quickly when you have multiple files that are different versions of the same piece of code. Or instead of creating a new file, you may write code for the new function directly in the original file, but now the code throws errors when you try to run it, and you can’t remember which lines you added since the last time the code ran without errors. By using version control, these problems are solved because we can revert back to the checkpoint when the code ran cleanly, and thereby avoid the need to create multiple copies to save the original piece of code.

There are many other features of version control systems, such as the concept of creating “branches” that allow you to work on new updates to a piece of code independently of and in parallel to the original piece of code. A branch is a deviation from the original code but can be merged back into the original code when desired. All of these concepts are even more useful when collaborating with others using version control platforms, a collaborative practice that will be discussed later in this lesson.

Types of Software Version Control#

There are two main styles of software version control systems:

Centralized

  • Singular “main” copy of the codebase

  • Must interact with a specific server

  • Example: Subversion (SVN)

Distributed (more popular)

  • Each developer’s system can retain a copy of the codebase

  • Examples:

    • Git

    • Mercurial

Using a distributed version control system like Git gives you more flexibility.

Example: Git#

The most popular version control system for software development is Git. Git is open-source and is commonly used in conjunction with web-based software hosting sites like GitHub and GitLab (more on these in the next section), which allow for collaboration and sharing of code. You can also use it on your local computer when writing your own code. Git is often run at the command line, but there are other interfaces for using Git as well, including GitHub Desktop and some code editors that have Git integration included (more on this later).

Comic about Git.

Source

Git is very powerful and widely used (according to a Stack Overflow developer survey, over 87% of developers use Git), but that doesn’t mean it is straightforward to learn. There are many good resources for learning Git (Neuromatch Academy hosts workshops for Git & GitHub as well). If you find Git confusing at first, know that you are not alone! (There’s even an XKCD comic about it!). For in-depth training on Git, please see the Software Carpentry lesson listed below: Version Control with Git: Summary and Setup (swcarpentry.github.io)

Version Control Platforms#

Version control platforms, typically web-based software hosting platforms, expand the usefulness of version control by allowing for a centralized location to store and collaborate on code, along with many other helpful features for code development and sharing.

Examples of Git-based version control and software development collaboration platforms:

  • GitHub: a Git-based platform that allows collaboration and code history tracking. Owned by Microsoft.

  • GitLab: a Git-based platform (and software) that offers DevOps and CI/CD functionalities, as well as the ability to self-host GitLab instances.

  • BitBucket: a platform that can host Git and Mercurial repositories. Owned by Atlassian.

GitHub is one of the most popular platforms, and so we will provide examples of how to use GitHub in the rest of this section. It is important to note that GitHub is where most open-source software packages are housed, and so if you are interested in getting more involved with the open-source software community, GitHub is an essential tool to learn how to use!

Example: GitHub#

GitHub is an online, cloud-based software repository hosting site that integrates with Git and offers many other features that help with code development, collaboration, testing, and releases. Before we dive into some of these features, it’s important to understand how GitHub acts as a remote repository when using version control systems like Git.

If we go back to the general idea of using version control systems, GitHub can be added to the picture as a remote repository that hosts code. After creating a “checkpoint” in Git, you can then upload a copy of the current snapshot of your code to GitHub. There are a few reasons you might want to do this, including:

  • To serve as a backup for your work (it is now stored on a remote server that you can access even if your computer dies).

  • To share your code with others (more on this later in this course).

  • To collaborate with others on your code. By uploading to GitHub, your code can be made accessible to others who might want to add features.

Let’s expand on some of GitHub’s collaboration tools. Some of these features include:

Issue Tracking

Keep track of feature requests, bugs, and other types of updates via GitHub Issues. GitHub also allows the use of labels and assigning people to tasks to help organize tasks.

Project Discussion Forums

GitHub allows for an online discussion forum where you can ask and answer questions and hold community discussions.

Contribution Tracking

GitHub has a straightforward way to keep track of suggested code contributions (called “Pull Requests”) from different people.

Code Review Tools

GitHub has a rich set of tools for reviewing and accepting (or denying) contributions from others (or yourself), such as in-line comments and easily viewable tracked changes to individual files.

Tailored Permissions

Choose who has the ability to update the code. This helps you feel confident that only those with permission can update code that you shared in GitHub, and also others feel safe to suggest updates without worrying that they might accidentally overwrite existing code.

All of these features excel at enabling asynchronous collaboration across teams. Most scientific open-source packages use GitHub for their primary code development Note that there are many more GitHub features that we don’t go into here that support collaboration, as well as automated workflows and so much more. To learn more about GitHub, take a look at these references:

Summary of Benefits of Using Version Control and Version Control Platforms#

  • Features the ability to rewind changes back to any committed point

  • Eases collaboration with others

  • Keeps a directory clean from clutter, with no need for multiple copies of files

  • Provides a targeted backup system for your work

Tools for Editing Code#

Integrated Development Environment (IDEs)#

An Integrated Development Environment (IDE) plays an important role in open code development by offering a comprehensive toolkit to researchers, scientists, and developers for editing code. It is a software application that streamlines the entire process of creating, testing, and managing code for scientific research and data analysis. By providing an all-in-one platform, an IDE allows researchers to write, debug, and optimize code more efficiently, fostering collaboration and reproducibility in open code science projects.

In open science, where transparency and accessibility are paramount, IDEs often incorporate version control systems like Git to facilitate collaboration and ensure that a research codebase is readily available for others to use and improve. Additionally, many IDEs integrate with data analysis and visualization tools. This makes it easier for scientists to analyze and interpret their data, ultimately contributing to the advancement of open code science practices.

If you were in a room with 10 developers and asked them each what their favorite code editor is, you would get many different responses. In this lesson, we will go over a few of the more popular varieties.

Source-Code Editing & Kernels – The Value of IDEs and Kernels#

IDEs can bring a lot of good tools to your efforts. It’s not just about editing code anymore. Modern, robust IDEs can do most of the things listed here, if not more. One can use an IDE without executing in a kernel; one can use a kernel without having developed code in an IDE. However, they can work hand-in-hand.

Integrated Development Environment (IDE)

  • Source code editing

  • Syntax highlighting

  • Error/bug warnings

  • Plugins

  • Debuggers

  • Memory management

  • Version control

  • Build automation

Kernel

  • Execution environment

  • Like a virtual machine

  • Isolates work area

  • Tailor settings

  • Easily replicable

IDE Example: Visual Studio Code#

The most popular IDE these days, Microsoft’s Visual Studio Code (or VS Code), is feature-rich without being clunky.

VS Code Interface.
  • It has a “dark mode” option which is easier on the eyes for long coding sessions.

  • It provides the basics, such as syntax highlighting and an integrated terminal window.

  • It also has a wealth of plugins for connecting to servers, version control systems, and troubleshooting. It has several linter plugins, which can analyze your code for bugs and errors and help your team code in a consistent “style.” This eases code maintenance down the road.

  • If your line of code has an obvious error in it, the IDE will produce a red squiggle, just as if you’ve spelled something wrong in a Word Document.

Below is an example of a developer who accidentally typed an equal sign when they should have typed a colon. VS Code caught the error, and when the developer hovered over the red squiggle, VS Code explained what the error was and offered to take them to further documentation.

VS Code Interface While Editing File.

Another useful feature in VS Code (as well as many other code editors) is Git Integration. Instead of using a Terminal window, you can just make a few clicks and easily integrate Git into your workflow!

From VS Code, you can:

  • Easily see modifications to your code.

  • Create a branch.

  • Upload your changes directly to GitHub.

  • Download changes from other team members to your local system.

IDE Example: RStudio#

While Visual Studio Code is a more generic IDE where you can use plugins to specialize it, there are also IDEs, such as RStudio, that have specialized features for specific languages right out of the gate.

Researchers conducting statistical analysis tend to use the coding languages of R and Python. RStudio has built-in tools for that very purpose, including data visualization.

RStudio Interface.

Source

Plain Text Editors for Coding#

Most laptop or desktop computers that run standard operating systems (Windows, macOS, Linux) have multiple pre-installed plain-text editors that can be used for coding. It is beneficial to know how to use at least one because it makes editing scripts and files a quick process.

Pros:

  • Lightweight

  • Many distributed natively with OS

Cons:

  • No plugins to help find bugs, errors, etc.

  • May not have syntax-highlighting

Computational Notebooks#

A computational notebook refers to a virtual, interactive computing environment that combines code execution, documentation, and data visualization in a single interface. These notebooks are widely used in data science and coding fields. Popular examples include Jupyter Notebooks and R Notebooks. They allow users to write and run code in a step-by-step manner, providing an efficient platform for data analysis, research, and collaborative coding, with the added benefit of integrating rich text (including equations), images, and charts for clear documentation and communication. Even this tutorial is shared via Jupyter Notebook.

Example: Jupyter Notebook and JupyterLab#

Jupyter Logo.

Jupyter notebooks are open-source web applications that are widely used for creating computational documents. But before we dive into Jupyter Notebooks, we want to make it clear that Jupyter Notebooks are one of many platforms in the Jupyter ecosystem:

  • Jupyter Notebook – contained language shell for interactive programming, displaying output inline with inputs

  • JupyterLab – an in-browser user interface showing multiple windows for notebooks, terminals, and code editing

  • JupyterHub – middleware for running shared interactive computing environments, including JupyterLab and Jupyter Notebook, on shared computing infrastructure (such as the Cloud)

We will use Jupyter Notebook as an example of a computational notebook and discuss how JupyterLab is related to Jupyter Notebook. The following section on computing platforms will discuss JupyterHub.

This screenshot shows an example of a Jupyter Notebook that integrates rich text (with headers and links), equations, code, and the interactive output from those lines of code, including a plot. This screenshot makes it clear why this is called a computational notebook - it resembles a lab notebook that you may have written out by hand in school.

Jupyter Interface.

Source: Project Jupyter

Many programming languages are supported by Jupyter. Fun fact: the name “Jupyter” refers to the three core languages supported by Jupyter: Julia, Python, and R.

JupyterLab is a browser-based interactive development environment that supports Jupyter Notebooks and is designed in a more flexible environment that allows for many useful features. One of these features is Git integration, as we saw for other IDEs like Visual Studio Code.

Since Jupyter Notebooks allow for the integration of code with visualizations and text, they can serve as a tool to carry out research projects and create easily shareable computational documents for education, collaboration, or science communication. With rich text capabilities, such as the use of headers, italics, links, and many more, you can create a readable document that contains runnable code. These are just some of the reasons why JupyterLab and Jupyter Notebooks are widely used across many disciplines, including computational research and data science.

For more information on Jupyter products and its community, check out their website.

If you want to dive in, check out Project Pythia’s “Getting Started with Jupyter” lesson, geared toward scientists without the assumption of a programming background.

Activity 3: Run a Jupyter Notebook Yourself from the Browser#

Estimated time for activity: 15 minutes. It is an individual activity.

Let’s use an example from Project Pythia to showcase how computational notebooks can be used in science. Project Pythia is an education Hub for the geoscientific community. They have some great learning resources and example research notebooks that are developed and maintained by the community and are freely available.

In this activity, you will run pre-written Python code in a Jupyter Notebook from your browser to make plots related to the El-Niño Southern Oscillation (or ENSO). You will use the open-source software package called Xarray to read in sea surface temperature data from a global climate model (the Community Earth System Model - CESM), and create some visualizations of ENSO events across the last 20 or so years. The goal is to recreate the plot below for the last ~20 years. This figure shows the years and magnitude of the El Niño events in red and of the La Niña events in blue.

El Nino and La Nina plot.

Source

Follow These Steps:

  1. Navigate to the “Calculating ENSO with Xarray” lesson

  2. In the top right corner, hover your mouse over the rocket icon, and click on “Binder”. This will open the lesson as an executable Jupyter Notebook that runs on the Cloud. Note that it may take several minutes for the Notebook to get set up.

Rocket icon.
  1. After the Notebook loads, you should see something like the following. Note – this actually uses the JupyterLab view!

Expected interface.
  1. You can take a little time to breeze through the text and code in the Notebook, but keep in mind that this lesson assumes a lot of prior knowledge, so it’s ok if you don’t understand everything. You can still appreciate the nice plots you’re about to make!

  2. This notebook lacks one important line of code, so we would like to ask you to add one in the very first cell on top of all of the statements!

!pip install backports.tarfile

Thus, the first cell should look like this:

!pip install backports.tarfile
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import xarray as xr
from pythia_datasets import DATASETS
  1. You are now ready to run the notebook yourself! To do that, you can go to the “Run” menu in the upper left of the JupyterLab window and choose “Run All Cells”:

Run all Cells Command.
  1. This should only take a few seconds, and if you scroll down, you can view a couple nice visualizations that you just created:
    Use the “<” and “>” buttons to navigate between the images.

Expected plot (part 1). Expected plot (part 2).
  1. Take some time to look through the Notebook a bit more closely. You will see that there is text (including headers, links, and even a table right at the start!), code, and figures integrated together. This is just one example of how scientists use computational notebooks for their research.

You can peruse more of the Project Pythia Python learning resources via their Foundations Book, and you can view more advanced example research workflows in the geosciences that use computational notebooks (which they call “Cookbooks”) to see more examples of how notebooks are used in science. If you are interested in the geosciences, you can even contribute your own notebook if you have a notebook you’d like to share!

Computing Platforms#

We use the term “computing platform” to refer to the computational machine used to run code. There are many different computing platforms that you can choose from, each having its own pros and cons. Here is an overview of three computing options:

Personal Computer (e.g. a laptop)

Pros:

  • Convenient - Can run computations when and where you choose

  • Can tailor the software environment to be exactly what you need

  • Don’t have to share your computing resources

Cons:

  • Has limited computational power

  • Requires downloading data and software

High-Performance Computing (HPC)

Pros:

  • High computational power

Cons:

  • Typically owned and run by a particular institution - may need to be affiliated with that institution to gain access to their HPC

  • May have to wait significant amounts of time to run your code since they are typically shared across many people and groups

  • Need significant funds to build an HPC

Cloud Computing

Pros:

  • High computational power

  • Minimal wait times to run code

  • Typically accessible to anyone with an internet connection

  • On-demand pricing options - You only have to pay for what you use

Cons:

  • High cost per computation

  • Lack of transparency in costs - E.g. data usage can cost a significant amount from different Cloud regions, but it may not always be clear which region your data and compute are in

  • May require some extra knowledge in how Cloud computing works

Examples of Cloud providers:

  • Amazon Web Services (AWS)

  • Google Cloud

  • Microsoft Azure

Many data providers, especially of large datasets, are migrating their data to the Cloud to increase accessibility and to make use of the large storage capacity that the Cloud provides. For instance, NASA Earthdata (which houses all NASA Earth science data) is now using AWS to store the majority of its data. Many Cloud providers also have a number of publicly available datasets, including Google Cloud and AWS.

When choosing a computing platform, it is important to consider where your datasets are saved and how big the datasets are. For instance, when working with small datasets, it is often preferable to use a personal computer since data download will take minimal time and large computing resources likely aren’t needed. When working with large datasets, however, it is best to minimize the amount of downloading and uploading data that is needed, as this can take significant amounts of time and internet bandwidth. If your large datasets are stored on the Cloud already, it is typically best to use Cloud resources for the computation as well, and likewise for HPC use.

Additional Tools#

Software Repository vs Archive#

Software repositories and archives provide centralized locations to store and share software, but there are some important key differences between them that we will discuss in this section.

A software repository is a dynamic and collaborative space where developers work on the latest code, making it the heart of ongoing software development and version control. It houses actively maintained codebases, which encourages collaboration and continuous, often community-driven, improvement.

Conversely, a software archive is a static storage where stable and thoroughly tested software releases are kept. Users access these archives to obtain reliable versions of software, ensuring stability and reliability in their applications. Furthermore, package managers facilitate the installation of specialized software archives within an application or an operating system. Understanding the difference between these two is crucial for effective software development and distribution.

Storage and version control.

Git/GitHub and Bitbucket are popular choices for software repositories.

Repository

  • Is a location for sharing code.

  • Often use version control systems like Git, Mercurial, and Subversion to track changes

  • Typically contains the latest development version (sometimes called the “master” or “trunk”) of a software project, which can be actively worked on by developers.

  • Used for collaborative software development and code sharing among a team or a community of developers.

Important note: A repository is nothing more than a place for hosting code. These days, a version control system and a repository are often one and the same thing. It is important to understand the distinction. However, some websites are purely dropboxes for code executables or zip files of source code.

Archive

  • Often used for distribution and long-term preservation of software.

  • A storage system that contains specific, stable releases or versions of software, compiled binary packages, or source code releases.

  • Users typically download software from an archive to install and use it on their systems.

Containers#

A software container is a standalone and executable package that includes everything needed to run a piece of software, including the code, runtime, system tools, environment settings, and libraries. Containers are isolated environments that hold the application as well as anything needed to run the application, ensuring consistency and portability across different computing environments. A container is a helpful tool that can provide efficiency, scalability, and ease of deployment. Some examples of widely utilized container tools are Kubernetes, Docker, and Apache Mesos.

Key Takeaways#

In this section, you learned:

  • The usefulness of digital tools that manage foster collaboration, and house open code.

  • How version control systems like Git and platforms like GitHub can increase collaboration and management of code.

  • Some common tools for editing open code, including integrated development environments (IDEs) like Visual Studio Code and Jupyter Notebooks.

  • The difference between software repositories and archives, and also how software containers can help with the sharing and reproducibility of code.


Section 5: Planning for Open Science: From Theory to Practice#

This section focuses on the tools available for sharing research products. It begins with a discussion of the tools for management of research projects. Then, it introduced the tools for open publications and how to find them. Next, this section discusses the tools for open results. Lastly, this section discusses the concept of reproducibility. Journals are a tool for sharing your results, and these are discussed in more detail on Day 5 - Open Results.

Tools for Open Publications#

Pre-Prints#

Open science tools can be used for writing, as tools to produce content, such as data management plans, presentations, and pre-prints. Pre-prints are early versions of research papers that are shared publicly before they are published in scientific journals. In some fields, they are shared prior to peer review while in other fields, it may only be after peer review and prior to publication. They are a vital component of open science content creation, as they promote transparency, rapid dissemination of knowledge, and collaboration among researchers.

By sharing pre-prints, scientists can receive feedback from the global research community, refine their work, and rapidly communicate their findings. This accelerates the pace of scientific discovery and ensures that valuable research is accessible to a broader audience, which aligns with the principles of open science.

Pre-prints have gained particular significance during the COVID-19 pandemic, where they played a crucial role in rapidly sharing information about the virus and its effects, emphasizing their importance in advancing science and public health. Fundamentally, pre-prints are important to open science. Consider the following highlights:

  1. Rapid Dissemination: Pre-prints enable researchers to swiftly share their findings with the scientific community and the public, sometimes within days of completing their research. This swift dissemination is particularly beneficial when dealing with urgent or rapidly evolving topics.

  2. Peer Review: While pre-prints are not peer-reviewed, they often undergo a form of community review. Researchers and experts can provide feedback and constructive criticism, helping authors improve their work before formal journal publication.

  3. Variety of Fields: Pre-prints are not limited to any specific scientific discipline. They are used in fields ranging from medicine and biology to physics and social sciences, making them a versatile tool for disseminating research.

  4. Versions and Citations: Pre-prints can have different versions, and the final peer-reviewed paper may differ. Researchers are encouraged to cite pre-prints when discussing ongoing research, allowing for transparency in the academic discourse.

  5. Free Access: Pre-prints are typically freely accessible to anyone with an internet connection. This open access promotes equality and inclusivity in science, enabling researchers from various backgrounds and institutions to engage with the latest research.

  6. Not a Replacement for Peer Review: Although pre-prints are valuable tools for early sharing and collaboration, they are not a substitute for a formal peer-reviewed publication. Researchers and readers should examine pre-prints with the understanding that they have not undergone the rigorous peer review process that journals provide.

Pre-prints are typically hosted on dedicated pre-print servers for different scientific fields. Examples include: arXiv (physics, mathematics), bioRxiv (biology), medRxiv (medicine), and many others. These platforms help organize and facilitate pre-print sharing. The OSF provides a service for searching over multiple preprint servers.

Remember, pre-prints play a significant role in open science by promoting rapid, transparent sharing of research findings across various scientific domains. They offer a valuable platform for researchers to disseminate their work and gather feedback, ultimately advancing scientific knowledge.

Discover an Open Access Journal to Share Your Results#

A common way to share a paper is to pick a journal that is already fully open-access and adopt its license. One way to discover open journals is by using the Directory of Open Access Journals (DOAJ).

To identify the best open-access journal, you can use the Directory of Open Access Journals (DOAJ), which provides a searchable index of all known open-access journals and articles. The DOAJ and its synergetic webpage, Sherpa Romeo, serve as useful tools in the early stages of research planning to help a researcher determine what journals to consider when the time comes to publish their results.

DOAJ Interface.

Activity 4: Identify an Open-Access Journal#

Estimated time for activity: 7 minutes. It is an individual activity.

To become more familiar with the DOAJ, visit https://doaj.org/ and search for Frontiers in Computational Neuroscience published by the American Astronomical Society. Once you select the journal, you can see costs to publish, details about licensing, author retention rights, time to publication, and other details.

Once you have found the journal, answer the following questions:

  1. When did it begin publishing as open access?

  2. What license is used for the publications?

  3. What rights do the authors retain in their publications?

Note: If journals do not have any open access, the journal will not appear in the search results. Also, because DOAJ has strict criteria for being listed in its directory, it is not likely you will find predatory publishers listed here, either.

Tools for Reproducibility#

The National Academies Report 2019 defined reproducibility as:

  • Reproducibility means obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis.

  • Replicability means obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.

The pursuit of reproducibility aims to ensure researchers reach the same result when using the same steps, as well as to enable researchers to copy an environment and build upon a result by editing the environment in order to apply it to a similar problem. This additional feature gives others the ability to directly build upon previous work and get more science out of the same amount of funding.

Tools to support reproducibility in research outputs:

  • Jupyter Notebooks - A web application for creating and sharing computational documents. It offers a simple, streamlined, document-centric experience.

  • Jupyter Books - Build beautiful, publication-quality books and documents from computational content.

  • R Markdown - Produces documents that are fully reproducible. Use a productive notebook interface to weave together narrative text and code to produce elegantly formatted output.

  • Binder - Create custom computing environments that can be shared and used by many remote users.

  • Quarto - Combine Jupyter notebooks with flexible options to produce production quality output in a wide variety of formats.

Note: As you might have noticed, a lot of open science tools require intermediate to advanced skills in data and information literacy and coding, especially if handling coding-intensive research projects. One of the best ways to learn these skills is through engaging with the respective communities, which often provide training and mentoring.

Additional Tools for Open Results#

Tools for Open Project Management#

Advancements over the past few decades to tools that manage research projects and laboratories have helped to meet the ever-increasing demand for speed, innovation, and transparency in science. Such tools are developed to support collaboration, ensure data integrity, automate processes, create workflows, and increase productivity.

Research groups can now use project management tools for highly specialized efforts. They use existing platforms or develop their own software to share materials within the group and manage projects or tasks. Platforms and tools, which are finely tuned to meet researchers’ needs (and frustrations), are available. They are often founded by scientists, for scientists. To explore a few examples, let’s turn to experimental science.

A commonly used term and research output is protocol. Protocol can be defined as “A predefined written procedural method in the design and implementation of experiments. Protocols are written whenever it is desirable to standardize a laboratory method to ensure successful replication of results by others in the same laboratory or by other laboratories.” according to the University of Delaware (USA) Research Guide for Biological Sciences.

In a broader sense, protocol comprises documented computational workflows, operational procedures with step-by-step instructions, or even safety checklists.

Protocols.io is an online and secure platform for scientists affiliated with academia, industry and non-profit organizations, and agencies. It allows users to create, manage, exchange, improve, and share research methods and protocols across different disciplines. This resource can improve collaboration and recordkeeping, leading to an increase in team productivity and facilitating teaching, especially in the life sciences. In its free version, protocols.io supports publicly shared protocols, while paid plans enable private sharing, e.g., for the industry.

Some of the tools are specifically designed for open science with an open-by-design concept from ideation on. These tools aim to support the research lifecycle at all stages and allow for integration with other open science tools.

As an example, the Open Science Framework (OSF), developed by the Center for Open Science, is a free and open-source project management tool. The OSF supports researchers throughout their entire project lifecycle through open, centralized workflows. It captures different aspects and products of the research lifecycle, including developing a research idea, designing a study, storing and analyzing collected data, and writing and publishing reports or papers.

The OSF is designed to be a collaborative platform where users can share research objects from several phases of a project. It supports a broad and diverse audience, including researchers who might not have been able to access certain resources due to historic socioeconomic disadvantages. The OSF also contains other tools in its own platform. For example, Neuromatch Academy uses OSF for storing data for the courses.

“While there are many features built into the OSF, the platform also allows third-party add-ons or integrations that strengthen the functionality and collaborative nature of the OSF. These add-ons fall into two categories: citation management integrations and storage integrations. Mendeley and Zotero can be integrated to support citation management, while Amazon S3, Box, Dataverse, Dropbox, Figshare, GitHub, and OneCloud can be integrated to support storage. The OSF provides unlimited storage for projects, but individual files are limited to 5 gigabytes (GB) each.”

Center for Open Science

Best Practices for a Project Registry#

It is common for different types of outputs to be preserved in different places to optimize discovery and reuse. An up-to-date Project Registry provides a quick overview of all the outputs. Best practices for managing a Project Registry include:

  • Create and update a Project Registry in conjunction with preserving outputs (as described above) in the form of a spreadsheet or other type of list. This can be one registry for the entire project that is updated, or a new registry for each milestone.

  • Include in each registry entry a description of the object, preferred citation, and the persistent identifier (e.g., DOI), and any other useful information supporting the project. For outputs that do not have a persistent identifier, provide a URL and description.

  • Preserve the Project Registry as a project component. Many funders require in their yearly reports a list of both peer-reviewed publications and all project outputs. The Project Registry can be provided to the funder during the reporting process, or used as a tracking tool to assist with completing the report.

Managing Citations Using Reference Management Software#

Keeping track of every paper you reference, every dataset you use, and every software library you build off of is critical. A single paper might cite dozens of references, and each new thing you produce only adds to that list. Reference Management Software can be employed to help you manage these references and automatically create a list of citations in whatever format you need (BibTeX, Word, Google Docs, etc.).

While you are writing up results, keeping track of references and creating a correctly formatted bibliography can be overwhelming. Management software can keep track of references and can be shared with colleagues who are also working on the document.

Some of the common capabilities of reference management software are:

  • Keep a database of article metadata

  • Import article metadata from PDFs

  • Track datasets and software versions and DOIs

  • Create formatted references and bibliography for many different journal styles

Examples of reference management software include:

  • Mendeley

  • EndNote

  • Zotero

Open Highlight: Zotero#

Zotero helps manage software, data, publication metadata, and citations through a drag-and-drop interface. Researchers can use the tool to automatically generate citation files (for example, in BibTeX format).

Pros:

  • Open Source

  • Drag and Drop PDFs to import metadata

  • Word + Browser plugins

  • Export citations to BibTeX

Zotero Interface.

Key Takeaways#

In this section, you learned:

  • Benefits of preprints and resources for open-access journals.

  • Tools for reproducibility and replication of your studies.

  • Additional tools that are available to help manage open results including project management and reference management.


Summary#

Throughout this day, you learned about some of the concepts and tools that support the discovery and use of open research, that can be used to make data and software, and that can be used to share your results. These included:

  • The foundational elements of open science, which include research products such as data, code, and results.

  • Resources used to discover and assess research products for reuse, including repositories, search portals, publications, documentation, metadata, and licensing.

  • Making and sharing data that employs the FAIR principles by incorporating a data management plan, using persistent identifiers and citations, and utilizing the appropriate data formats and tools for making data and sharing results.

  • The use of the tools needed for the development of software, including source code, kernels, programming languages, third-party software, and version control.

  • The tools and documentation types used for publishing and curating open software.

  • Resources for sharing research products, including preprints, open-access publications, reference management systems, and resources to support reproducibility.