CINECA Synthetic Cohort EUROPE UK1 referencing fake samples - CINECA Synthetic Cohort EUROPE UK1 referencing fake samples

Please note: This synthetic data set (with cohort “participants” / ”subjects” marked with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or results. The purpose of this dataset is to aid development of technical implementations for cohort data discovery, harmonization, access, and federated analysis. In support of FAIRness in data sharing, this dataset is made freely available under the Creative Commons Licence (CC-BY). Please ensure this preamble is included with this dataset and that the CINECA project (funding: EC H2020 grant 825775) is acknowledged. This dataset (CINECA_synthetic_cohort_EUROPE_UK1) consists of 2521 samples which have genetic data based on 1000 Genomes data (https://www.nature.com/articles/nature15393), and synthetic subject attributes and phenotypic data derived from UKBiobank (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001779). These data were initially derived using the TOFU tool (https://github.com/spiros/tofu), which generates randomly generated values based on the UKBiobank data dictionary. Categorical values were randomly generated based on the data dictionary, continuous variables generated based on the distribution of values reported by the UK Biobank showcase, and date / time values were random. Additionally we split the phenotypes and attributes into 4 main classes - general, cancer, diabetes mellitus, and cardiac. We assigned the general attributes to all the samples, and the cardiac / diabetes mellitus / cancer attributes to a proportion of the total samples. Once the initial set of phenotypes and attributes were generated, the data data was checked for consistency and where possible dependent attributes were calculated from the independent variables generated by TOFU. For example, BMI was calculated from height and weight data, and age at death generated by date of death and date of birth. These data were then loaded to the development instance of Biosamples (https://www.ebi.ac.uk/biosamples/) which accessioned each of the samples. The genetic data are derived from the 1000 Genomes Phase 3 release (https://www.internationalgenome.org/category/phase-3/). The genotype data consists of a single joint call vcf files with call genotypes for all 2504 samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The genotype data has had a variety of errors introduced to mimic real data and as a test for quality control pipelines. These include gender mismatches, ethnic background mislabelling and low call rates for a randomly chosen subset of sample data as well as deviations from Hardy Weinberg equilibrium and low call rates for a random selection of variants. Additionally 40 samples have raw genetic data available in the form of both bam and cram files, including unmapped data. The gender of the samples in the 1000 genomes data has been matched to the synthetic phenotypic data generated for these samples. The genetic data was then linked to the synthetic data in BioSamples, and submitted to EGA.

Additional Information

Field	Value
Data last updated	April 14, 2026
Metadata last updated	April 14, 2026
Created	April 14, 2026
Format	https://publications.europa.eu/resource/authority/file-type/VCF
License	No License Provided
Name	CINECA Synthetic Cohort EUROPE UK1 referencing fake samples
Description	Please note: This synthetic data set (with cohort “participants” / ”subjects” marked with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or results. The purpose of this dataset is to aid development of technical implementations for cohort data discovery, harmonization, access, and federated analysis. In support of FAIRness in data sharing, this dataset is made freely available under the Creative Commons Licence (CC-BY). Please ensure this preamble is included with this dataset and that the CINECA project (funding: EC H2020 grant 825775) is acknowledged. This dataset (CINECA_synthetic_cohort_EUROPE_UK1) consists of 2521 samples which have genetic data based on 1000 Genomes data (https://www.nature.com/articles/nature15393), and synthetic subject attributes and phenotypic data derived from UKBiobank (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001779). These data were initially derived using the TOFU tool (https://github.com/spiros/tofu), which generates randomly generated values based on the UKBiobank data dictionary. Categorical values were randomly generated based on the data dictionary, continuous variables generated based on the distribution of values reported by the UK Biobank showcase, and date / time values were random. Additionally we split the phenotypes and attributes into 4 main classes - general, cancer, diabetes mellitus, and cardiac. We assigned the general attributes to all the samples, and the cardiac / diabetes mellitus / cancer attributes to a proportion of the total samples. Once the initial set of phenotypes and attributes were generated, the data data was checked for consistency and where possible dependent attributes were calculated from the independent variables generated by TOFU. For example, BMI was calculated from height and weight data, and age at death generated by date of death and date of birth. These data were then loaded to the development instance of Biosamples (https://www.ebi.ac.uk/biosamples/) which accessioned each of the samples. The genetic data are derived from the 1000 Genomes Phase 3 release (https://www.internationalgenome.org/category/phase-3/). The genotype data consists of a single joint call vcf files with call genotypes for all 2504 samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The genotype data has had a variety of errors introduced to mimic real data and as a test for quality control pipelines. These include gender mismatches, ethnic background mislabelling and low call rates for a randomly chosen subset of sample data as well as deviations from Hardy Weinberg equilibrium and low call rates for a random selection of variants. Additionally 40 samples have raw genetic data available in the form of both bam and cram files, including unmapped data. The gender of the samples in the 1000 genomes data has been matched to the synthetic phenotypic data generated for these samples. The genetic data was then linked to the synthetic data in BioSamples, and submitted to EGA.
Media type	https://www.iana.org/assignments/media-types/application/vcf
Compress format
Package format
Size	5.7 MiB
Hash
Hash Algorithm
Rights	https://ega-archive.org/studies/EGAS00001002472
Availability
Status
License	https://creativecommons.org/licenses/by-sa/4.0/
Access URL	https://ega-archive.org/studies/EGAS00001002472
Download URL
Release date
Modification date
Retention period
Temporal resolution
Spatial resolution in meters
Language	http://id.loc.gov/vocabulary/iso639-1/en
Documentation
Conforms to
Applicable legislation	http://data.europa.eu/eli/reg/2025/327/oj
Access services
URI	https://fdp.gdi.dkfz.de/distribution/66d12c3f-d5ec-4851-89f1-3cf720f998bb
Access url	https://ega-archive.org/studies/EGAS00001002472
Applicable legislation	['http://data.europa.eu/eli/reg/2025/327/oj']
Description translated	{'en': 'Please note: This synthetic data set (with cohort “participants” / ”subjects” marked with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or results. The purpose of this dataset is to aid development of technical implementations for cohort data discovery, harmonization, access, and federated analysis. In support of FAIRness in data sharing, this dataset is made freely available under the Creative Commons Licence (CC-BY). Please ensure this preamble is included with this dataset and that the CINECA project (funding: EC H2020 grant 825775) is acknowledged. This dataset (CINECA_synthetic_cohort_EUROPE_UK1) consists of 2521 samples which have genetic data based on 1000 Genomes data (https://www.nature.com/articles/nature15393), and synthetic subject attributes and phenotypic data derived from UKBiobank (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001779). These data were initially derived using the TOFU tool (https://github.com/spiros/tofu), which generates randomly generated values based on the UKBiobank data dictionary. Categorical values were randomly generated based on the data dictionary, continuous variables generated based on the distribution of values reported by the UK Biobank showcase, and date / time values were random. Additionally we split the phenotypes and attributes into 4 main classes - general, cancer, diabetes mellitus, and cardiac. We assigned the general attributes to all the samples, and the cardiac / diabetes mellitus / cancer attributes to a proportion of the total samples. Once the initial set of phenotypes and attributes were generated, the data data was checked for consistency and where possible dependent attributes were calculated from the independent variables generated by TOFU. For example, BMI was calculated from height and weight data, and age at death generated by date of death and date of birth. These data were then loaded to the development instance of Biosamples (https://www.ebi.ac.uk/biosamples/) which accessioned each of the samples. The genetic data are derived from the 1000 Genomes Phase 3 release (https://www.internationalgenome.org/category/phase-3/). The genotype data consists of a single joint call vcf files with call genotypes for all 2504 samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The genotype data has had a variety of errors introduced to mimic real data and as a test for quality control pipelines. These include gender mismatches, ethnic background mislabelling and low call rates for a randomly chosen subset of sample data as well as deviations from Hardy Weinberg equilibrium and low call rates for a random selection of variants. Additionally 40 samples have raw genetic data available in the form of both bam and cram files, including unmapped data. The gender of the samples in the 1000 genomes data has been matched to the synthetic phenotypic data generated for these samples. The genetic data was then linked to the synthetic data in BioSamples, and submitted to EGA.', 'nl': ''}
Distribution ref	https://fdp.gdi.dkfz.de/distribution/66d12c3f-d5ec-4851-89f1-3cf720f998bb
Has views	False
Id	e192e415-d1c2-4143-b689-67665721b5a2
Language	['http://id.loc.gov/vocabulary/iso639-1/en']
License	https://creativecommons.org/licenses/by-sa/4.0/
Mimetype	https://www.iana.org/assignments/media-types/application/vcf
Name translated	{'en': 'CINECA Synthetic Cohort EUROPE UK1 referencing fake samples', 'nl': ''}
Package id	34f98d0f-877f-4943-8b73-e83e1bf42498
Position	0
Rights	{'en': 'https://ega-archive.org/studies/EGAS00001002472', 'nl': ''}
Size	5.7 MiB
State	active
Uri	https://fdp.gdi.dkfz.de/distribution/66d12c3f-d5ec-4851-89f1-3cf720f998bb

Field

Value

Data last updated

April 14, 2026

Metadata last updated

April 14, 2026

Created

April 14, 2026

Format

https://publications.europa.eu/resource/authority/file-type/VCF

License

No License Provided

Name

CINECA Synthetic Cohort EUROPE UK1 referencing fake samples

Description

Media type

https://www.iana.org/assignments/media-types/application/vcf

Compress format

Package format

Size

5.7 MiB

Hash

Hash Algorithm

Rights

https://ega-archive.org/studies/EGAS00001002472

Availability

Status

License

https://creativecommons.org/licenses/by-sa/4.0/

Access URL

https://ega-archive.org/studies/EGAS00001002472

Download URL

Release date

Modification date

Retention period

Temporal resolution

Spatial resolution in meters

Language

http://id.loc.gov/vocabulary/iso639-1/en

Documentation

Conforms to

Applicable legislation

http://data.europa.eu/eli/reg/2025/327/oj

Access services

URI

https://fdp.gdi.dkfz.de/distribution/66d12c3f-d5ec-4851-89f1-3cf720f998bb

Access url

https://ega-archive.org/studies/EGAS00001002472

Applicable legislation

['http://data.europa.eu/eli/reg/2025/327/oj']

Description translated

{'en': 'Please note: This synthetic data set (with cohort “participants” / ”subjects” marked with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or results. The purpose of this dataset is to aid development of technical implementations for cohort data discovery, harmonization, access, and federated analysis. In support of FAIRness in data sharing, this dataset is made freely available under the Creative Commons Licence (CC-BY). Please ensure this preamble is included with this dataset and that the CINECA project (funding: EC H2020 grant 825775) is acknowledged. This dataset (CINECA_synthetic_cohort_EUROPE_UK1) consists of 2521 samples which have genetic data based on 1000 Genomes data (https://www.nature.com/articles/nature15393), and synthetic subject attributes and phenotypic data derived from UKBiobank (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001779). These data were initially derived using the TOFU tool (https://github.com/spiros/tofu), which generates randomly generated values based on the UKBiobank data dictionary. Categorical values were randomly generated based on the data dictionary, continuous variables generated based on the distribution of values reported by the UK Biobank showcase, and date / time values were random. Additionally we split the phenotypes and attributes into 4 main classes - general, cancer, diabetes mellitus, and cardiac. We assigned the general attributes to all the samples, and the cardiac / diabetes mellitus / cancer attributes to a proportion of the total samples. Once the initial set of phenotypes and attributes were generated, the data data was checked for consistency and where possible dependent attributes were calculated from the independent variables generated by TOFU. For example, BMI was calculated from height and weight data, and age at death generated by date of death and date of birth. These data were then loaded to the development instance of Biosamples (https://www.ebi.ac.uk/biosamples/) which accessioned each of the samples. The genetic data are derived from the 1000 Genomes Phase 3 release (https://www.internationalgenome.org/category/phase-3/). The genotype data consists of a single joint call vcf files with call genotypes for all 2504 samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The genotype data has had a variety of errors introduced to mimic real data and as a test for quality control pipelines. These include gender mismatches, ethnic background mislabelling and low call rates for a randomly chosen subset of sample data as well as deviations from Hardy Weinberg equilibrium and low call rates for a random selection of variants. Additionally 40 samples have raw genetic data available in the form of both bam and cram files, including unmapped data. The gender of the samples in the 1000 genomes data has been matched to the synthetic phenotypic data generated for these samples. The genetic data was then linked to the synthetic data in BioSamples, and submitted to EGA.', 'nl': ''}

Distribution ref

https://fdp.gdi.dkfz.de/distribution/66d12c3f-d5ec-4851-89f1-3cf720f998bb

Has views

False

e192e415-d1c2-4143-b689-67665721b5a2

Language

['http://id.loc.gov/vocabulary/iso639-1/en']

License

https://creativecommons.org/licenses/by-sa/4.0/

Mimetype

https://www.iana.org/assignments/media-types/application/vcf

Name translated

{'en': 'CINECA Synthetic Cohort EUROPE UK1 referencing fake samples', 'nl': ''}

Package id

34f98d0f-877f-4943-8b73-e83e1bf42498

Position

Rights

{'en': 'https://ega-archive.org/studies/EGAS00001002472', 'nl': ''}

Size

5.7 MiB

State

active

Uri

https://fdp.gdi.dkfz.de/distribution/66d12c3f-d5ec-4851-89f1-3cf720f998bb

CINECA Synthetic Cohort EUROPE UK1 referencing...

Additional Information