Existing data

Before investing time, effort and resources into generating new data, look into what has already been done in the past in your field. By reusing data you increase its value and minimize redundancy.

Sources of existing data

Bibliographic research

Start with a bibliographic research. If you find an interesting publication without any available data, you can contact the authors and request access to their data. If their data are not available or you didn’t find any interesting publication, you can look for existing data in many repositories. Existing data can be described in Data papers. Data papers provide peer-reviewed descriptions of publicly available datasets or databases and link to the data source in repositories. Data papers can be published in dedicated journals, such as Scientific Data, or be a specific article type in conventional journals.

Data repositories

Repositories or databases can also contain data that are not linked to any manuscript, article or paper. Repositories can be general, data type specific or discipline specific.

ELIXIR Deposition Databases for Biomolecular Data

ELIXIR recommends the following databases for specific data type

Functional genomics: ArrayExpress
Computational models of biological processes: BioModels
Descriptions and metadata about biological samples used in research: BioSamples
Descriptions of biological studies: BioStudies
Personally identifiable genetic and phenotypic data resulting from biomedical research projects: EGA
Electron microscopy density maps of macromolecular complexes and subcellular structures: EMBD
Nucleotide sequence information: ENA
Genetic variation data from all species: EVA
Molecular interaction data: IntAct
Metabolomics experiments and derived information: MetaboLights
Biological macromolecular structures: PDBe
Proteomics experiments and derived information: PRIDE

Other lists of recommended repositories

Scientific journals and communities have compiled a number of lists and registries of recommended repositories, searchable by discipline and other characteristics.

Nature - Scientific Data: Recommended Data Repositories
Repositories used by Flemish research infrastructures: https://mronocoroni.shinyapps.io/20200325/
FAIRsharing: Catalogue of databases
OpenAIRE: International search engines for academic and scientific research
DataCite: Registry of Research Data Repositories
Google Dataset search or Datacite for localization of datasets.
The Omics Discovery Index (OmicsDI): knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics and metabolomics).

Before reusing existing data

Check if a licence is attached and if it allows you to reuse the data for your intended purpose.
Make sure that the dataset is well described with high quality metadata and documentation.
Verify the quality of the data. Look for a data quality proof or run a quality test before using the data.
Decide which version (if present) of the data you will use.
- You can decide to always use the version that is available at the start of the project. In this case, you need to make sure that you and others, who want to reproduce your results, can access that specific version at a later stage too.
- You can update to the latest versions if new ones come out during your project. In this case, consider that you may need to re-do all your calculations based on a new version of the dataset and make sure that everything stays consistent.

How to cite an existing dataset

Complete citation

Author(s), Year, Dataset Title, Identifier, Repository, Version.

Short citation

Identifier, Version (if applicable).

Identifiers are machine readable alphanumeric strings provided by repositories. Identifiers can be:

Accession number
example: E-MTAB-NNNN
DOIs
example: doi: 10.1038/d41586-018-03071-1