Sample Collection Principles

Overview

The International Genome Sample Resource (IGSR) was established at EMBL-EBI in January 2015. The resource was established with three main aims, to:

Ensure maximal usefulness and relevance of the existing 1000 Genomes data resources
Extend the resource for the existing populations
Expand the resource to new populations

The first aim will start with the remapping of the existing low coverage and exome data to the new version of the human reference assembly, GRCh38. The second aim will bring together other functional and sequence data that has been generated on the 1000 Genomes cell lines, such as the Geuvadis RNA-Seq data and the high coverage and long read data that the 1000 Genomes Structural Variant group is continuing to generate, in order to present a uniform analysis set. The final aim is to expand the resource to new populations; the IGSR has been funded to support the addition of new populations to the 1000 Genomes dataset and this document describes the principles of that process.

The IGSR recognises that the current 1000 Genomes Project samples do not reflect all populations. An important aim for the IGSR is to expand the populations represented in the collection and to ensure that the public data represents maximum possible population diversity. This will ensure that the 1000 Genomes dataset remains a valuable open resource for the community over the next five years. The IGSR will work with the groups who were unable to contribute samples to the 1000 Genomes Project prior to the completion of sample collection, and will investigate collaborations with other groups to ensure that population diversity gaps are filled.

Here we propose a process for oversight of consent, sample collection, data production and data and cell line dissemination. The IGSR wishes to ensure that any new population cohorts collected and their associated data allow for public data release, and broad use of the data and samples. In this way, the data can be made available alongside the HapMap and 1000 Genomes Project samples and data.

The IGSR has no funds to support sample collection or data generation. This aspect of the project will need to be self-funded by sample collection groups or funded by third parties. The IGSR is funded to provide ethical review and, data coordination as well as analysis and integration of new population collections established and accepted into the IGSR.

In order for a population collection to be accepted as part of the IGSR, it must meet the following criteria:

Confirmation that the Consent, Ethics Review and Sampling Process meet the criteria established for the 1000 Genomes Project. We suggest that the applicant confirm to the P3G-IPAC that these criteria have been sought prior to sample collection, as was done by the Samples and ELSI subgroup for 1000 Genomes Project sample sets. P3G-IPAC, will then provide approval for acceptance for deposit in IGSR. Once approved, a three-letter code would be provided, as for HapMap or 1000 Genomes Project sample sets;
Recommendation by the IGSR Geographical and Population Advisory Board (GPAB) that the population represents a valuable addition to the IGSR dataset and expands the global diversity found in the data;
Collection of primary sequence data and deposition in the public sequence archives (GenBank, ENA). The IGSR will confirm that these data meet the quantity and quality criteria already established as part of the 1000 Genomes Project. The IGSR will undertake alignment and variant calling, and provide an integrated set of haplotypes that will include the new sample collection within the existing 1000 Genomes data. The IGSR will provide unrestrained public access for download as well as interactive use; and,
If possible, establishment of cell lines and deposition of them in an approved cell line repository, either Coriell or another equivalent repository.

The following information covers useful details for groups planning to create new sample collections. We cover details about educating local Institutional Review Boards/Research Ethics Committees (IRB/RECs) about the types of procedures that were used successfully for sample collection from other populations, in a way that allowed public data release and broad use of data and samples. This commentary includes considerations with respect to cell line production.

The 1000 Genomes Samples and ELSI group determined that all consent forms, for the 1000 genomes project, were required to explicitly state the following items. A similar review would be undertaken for any new populations being added to the IGSR.

Extensive individual data from the study of the samples (but no individual identifiers or medical information) would be made publicly available in scientific databases on the Internet;
Samples and data would be labeled by population and comparisons among individuals and populations would be made;
Individual samples could not be withdrawn, and once data from the study of samples had been put in the database, the data could not be withdrawn;
Samples and data would be made available to many researchers around the world;
Samples and data would be used not only for the 1000 Genomes Project, but also for many other future projects (including gene expression studies, studies of population history and relatedness, etc.);
Samples and data would be used by academic, commercial, and government entities, and if such uses resulted in the development of commercially valuable products, participants would not share in the proceeds;
No individual results from the study of the samples would be returned, al- though general information about the project and about interesting new findings emerging from genetic research is provided periodically to the researchers who collected the samples, and who are encouraged to share them with community members.
Individuals or communities would not have an opportunity to pre-approve future uses of the samples. For samples stored at the Coriell Institute, all proposed uses would be assessed to ensure that the samples will be used only in ways that are consistent with the terms of the informed consent. Similar assessments are expected when samples are stored and distributed by other cell line repositories;
Cell lines would be made, making it possible to generate a potentially unlimited amount of DNA that may last indefinitely;
The consent process for collecting the 1000 Genomes samples is described at /sites/1000genomes.org/files/docs/Informed%20Consent%20Background%20Document.pdf
A copy of the consent template that was used by the 1000 Genomes Project can be seen at /sites/1000genomes.org/files/docs/Informed%20Consent%20Form%20Template.pdf

Sampling process

The samples should be collected from adults.
In order to achieve 100 unrelated cell lines from a population, it is recommended to collect primary material from at least 130 unrelated individuals. Data from mother-father-adult child trios are valuable for assessing data quality and producing haplotypes, so collecting samples in trios and establishing cell lines from the offspring is strongly encouraged.
All needed IRB/REC approvals should be obtained, as well as any other needed permissions, such as from health ministries for export of the samples to other countries.

Sample collection review process

Local IRB/RECs need to provide the approvals for the ethical appropriateness of the sample collecting.
P3G-IPAC will then review the consent and sample collection processes, and approve them as appropriate for this set of samples, similar to what the 1000 Genomes Samples and ELSI Group did for the 1000 Genomes samples.

Cell lines

An important component of the value of the HapMap and 1000 Genomes Project samples is that there are immortalized cell lines available that can be used to obtain essentially unlimited quantities of DNA. This is desirable for new sample collections as well. However, costs are not trivial. If possible blood samples should be made into cell lines and distributed by the Coriell Institute or other central cell line repositories to any legitimate researcher. Distribution should be subject to a Materials Transfer Agreement (MTA) that prohibits reproductive cloning and may restrict other activities including redistribution.

Data production

The samples (including from any adult children in trios) should be genotyped across the genome using a high-density genotyping array with over 500,000 markers.
At least 90 unrelated samples should be sequenced. This sequencing can follow two different strategies. Low coverage whole genome sequencing (minimum 3x aligned non duplicated coverage per sample) and exome sequencing (at least 70% of exome target bases covered to 30x of higher), this strategy was used by the main 1000 Genomes Project, or high coverage whole genome sequencing (minimum 30x aligned non duplicated coverage per sample).
Due to possible cell line artifacts, in some cases it may be better to generate sequences directly from blood DNA. We ask that the source of the DNA be included in the metadata regardless of whether from cell line or blood.

Data release

All sequence and genotype data from these samples should be deposited in open-access public databases such as the ENA or Genbank, the IGSR will support groups in achieving this aim. All samples should be registered in the EMBL-EBI or NCBI BioSample archives and this should include attributes such as the sample and DNA type (blood, cell line) studied, and measures of data quality.
Groups are encouraged to not collect phenotype data from the people providing samples. If phenotype data are collected, the researchers are encouraged to not distribute the phenotype data widely, but to share those data only with close collaborators. The distribution of any phenotype data should follow the consent form and be approved by the local IRB/REC. Participants may be less likely to enroll and IRB/RECs may be less likely to approve of studies where the phenotypes as well as the sequence data are made publicly available.
Another option is to provide the phenotype data in a controlled-access database such as dbGaP or the EGA. Note that the goal of collecting these samples is to provide public genetic data as a genetic reference resource – sample numbers are not compatible with association studies on whole-person phenotypes.

Costs

IGSR includes no funding for sample collection or data production. The IGSR will work with the sample collection group to create alignments and variant calls integrated with the main 1000 Genomes datasets once sequence data is submitted to the public archives. Groups wishing to contribute the samples and data need to cover the costs of sample collection process review, sample collection, cell line transformation, and sequence and genotype data production

This document was approved on the 30th June 2015

IGSR: The International Genome Sample Resource

Supporting open human variation data

Overview

Sampling process

Sample collection review process

Cell lines

Data production

Data release

Costs

Overview

Consent process

Sampling process

Sample collection review process

Cell lines

Data production

Data release

Costs