Any high throughput sequence data (WGS, WXS, RNA-Seq, etc) in BAM, CRAM, FASTQ formats. Sequence data should be submitted only after: 1) you have received an email with an attached sequence metadata file containing the registered subject and sample IDs, and consents. This process ensures that submitted sequences are tied to sample IDs that belong to consented subjects. 2) The sequence metadata has been processed and you have received an email to upload sequences.
A dbGaP Sample is defined as the ID of the final preps submitted to dbGaP by a genotyping center, a sequencing group, or to an NCBI resource, such as GEO or GenBank. A single subject may be mapped to multiple samples, but a single sample should not be mapped to multiple subjects unless the samples are pooled.* For example, if one subject (SUBJECT_ID) provided one sample, and that sample was processed to generate 2 sequencing runs or 1 sequencing and 1 genotyping array run, the data file would show two rows, both using the same subject ID, but having 2 unique sample IDs.
*Please inquire about pooled samples if applicable. This would only apply to pooled samples that belong to consented subjects. If the samples are pooled from controls that are publicly available, there is no need for marking the pooled samples, and a single sample ID may be assigned.
Each sample should be submitted with a single, unique, de-identified sample ID. Sample IDs should be an integer or string value. Integers should not have zero padding. IDs should not have spaces. Specifically, only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). Once a variable name for the sample ID has been chosen, please use the same variable name throughout all the phenotype files for consistency. For example, please do not use SAMPLE_ID in one file and SAMPLE_NAME in another file. Please also do not use "dbGaP" in your submitted ID name, since dbGaP will assign a dbGaP sample ID that will be included in the final dump files along with the submitted sample ID.
Subject IDs submitted to dbGaP may be randomly assigned or may be consecutive numbers without any identifying information (i.e., the submitted Subject ID should not be based on the study person ID or any personal identifiers such as subject's birth date, health record number, or name). The same applies to sample IDs.
A Dataset (DS) file is a rectangular table of data values, subject/sample IDs, and variables, to be submitted either in .txt or .xlsx format, with .txt being the preferred format. There are 5 types of datasets required for submission:
Each row contains phenotypes of one Subject or attributes of one Sample. Following the first row (column headers), each subsequent row will reflect data of one subject or sample, depending on the type of file.
The SSM is a mapping of SUBJECT_IDs (consented subjects and their phenotype data) to SAMPLE_IDs. This list of SAMPLE_IDs is an assertion of the samples that will be submitted in the molecular data. Open the templates under Phenotype_Data:3a_SSM_DS.txt3b_SSM_DD.xlsx
The first column must be the IDs of the subjects. Enter only SUBJECT_IDs that are linked to SAMPLE_IDs with submitted molecular data. Subjects listed in the SUBJECT_ID column must be consented with CONSENT>0 or is a publicly available control with CONSENT=0 in the Subject Consent DS. For SUBJECT_IDs with multiple types of molecular data (e.g. SNP array data, RNA expression data, sequencing data), use multiple rows with identical subject ID, but distinct sample IDs. See SUBJECT_ID in Glossary for full requirement details.
The second column must be the IDs of the samples. The de-identified SAMPLE_IDs in this column must be identical to those used in the molecular data (PLINK, VCFs, etc) and sequence metadata. Different sample runs or aliquots of the same sample should be identified by different SAMPLE_IDs, but the same SUBJECT_IDs. Likewise, intended duplicates should also be identified by different SAMPLE_IDs, but the same SUBJECT_IDs. Sample IDs mapping to a public NCBI resource (GEO, GenBank, public SRA) should also be included. The SAMPLE_ID column should not have any repeating IDs. See SAMPLE_ID in Glossary for full requirement details.
Yes, the SAMPLE_ID can be the same as the SUBJECT_ID, as long as samples that belong to the same person share the same SUBJECT_ID. Please submit an SSM even if each person only has 1 sample and the IDs are identical for the SUBJECT_ID and SAMPLE_ID.
De-identified monozygotic twin IDs should indicate monozygotic twins and multiples of the same family. The MZ_TWIN_ID column should distinguish sample duplicates from samples of monozygotic twins. Monozygotic twins and multiples should be assigned the same MZ_TWIN_ID, FATHER_ID, and MOTHER ID, but different SUBJECT_IDs. For dizygotic twins and all other individuals, the MZ_TWIN_ID column should be left blank. If you wish to identify dizygotic twins, an additional variable may be included in the subject phenotypes DS.
Metadata around the experiment or study and annotations that are necessary to reproduce any published table or analysis must be included with genomic data submissions. In particular, data pertinent to the interpretation of genomic data -- such as associated phenotype data (e.g. clinical information), exposure data, relevant metadata, and descriptive information (e.g. protocols or methodologies used) -- are expected to be shared. To avoid user questions, make sure to include self-reported RACE and relevant dates (e.g., birth, diagnosis, sample collection) written as years or normalized to a set point in time, along with any phenotypes, measured or collected data that are described in your Study Description. For the Subject Phenotypes, it would be data relevant to the individual person. For the Sample Attributes, it would be data relevant to the sample derived from the person. For instance, do not list the RACE variable in the Sample Attributes, since RACE is stable for a person across samples. However, for variables like TREATMENT, if the person was only treated once, and data was collected, then TREATMENT could belong in the Subject Phenotypes table. However, if TREATMENT was completed multiple times, and each time a sample was extracted, then it would be better for TREATMENT to be tracked in the Sample Attributes table.
The Sample Attributes DS includes measured and/or descriptive traits per individual sample (not person). A person may be represented by multiple samples. Therefore, the primary id in this file is the SAMPLE_ID. Open the templates under Phenotype_Data:6a_SampleAttributes_DS.txt6b_SampleAttributes_DD.xlsx
Only include SAMPLE_IDs that are listed in the subject sample mapping (SSM) DS and belong to SUBJECT_IDs that have CONSENT>0 in the subject consent (SC) DS. SAMPLE_IDs belonging to CONSENT=0 SUBJECT_IDs should not appear in the Sample Attributes DS file. The SAMPLE_ID should use the exact same syntax used for the SAMPLE_ID listed in the SSM. For example, '0AB12' is not the same as 'AB12', nor is '123-1' the same as '123_1'. Each SAMPLE_ID should be represented by 1 row of data in the DS. See SAMPLE_ID in Glossary for full requirement details.
The NCBI BioSample database ( ) contains descriptions of biological source materials used in experimental assays. Each of your samples will be assigned a BioSample accession number and will thus be searchable through BioSample. The first three variables below must be included to provide meaningful data for each sample's BioSample entry. HISTOLOGICAL_TYPE should only be included if applicable.
Most institutes request all data pertinent to the interpretation of genomic data, such as clinical information, exposure data, and relevant metadata pertaining to the sample. Please note that the template (6a_SampleAttributes_DS.txt) provided is based on a cancer study and the variables listed may be useful for cancer studies. However, if your study is not a cancer study, please do not include the cancer variables. Instead, submit additional sample attribute variables that will provide a greater understanding of the study. For example: sample collection date, sample extraction method and date; batch and center effects, sample plate or well number; sample run date, sample QA results; and sample affection status (ex. psoriatic skin sample vs. non-psoriatic skin sample from a case subject who has psoriasis). Relevant dates (e.g., sample collection date) that are directly tied to a person should be written as years or normalized to a set point in time. Do not include month and days directly tied to the person, which are considered HIPAA sensitive. Click here to see the algorithm dbGaP uses to find HIPAA sensitive dates: HIPAA.
You may submit multiple sample attributes DS/DD. You may split out sample attributes files to separate them by race/ethnicity, cohort, collection period, etc. Each of the sample attributes files should have SAMPLE_ID as the primary id. The BioSample required variables should appear only once per SAMPLE_ID, and the values for the BioSample required variables should not conflict. For example, a SAMPLE_ID cannot be marked as both TUMOR and non-TUMOR. In this case, we would request that an additional SAMPLE_ID be created. If this is not possible, please contact the dbGaP phenotype curator.
Each SAMPLE_ID has a series of measurements or the data is longitudinal. In this case, this table may have a SAMPLE_ID listed multiple times. We would treat this as a longitudinal dataset, where SAMPLE_ID + [variable] are the variables that make the row unique. Mark an "X" under the UNIQUEKEY column for the variables in the corresponding DD. In this case, we recommend submitting the BioSample required variables in a separate sample attributes DS/DD. 781b155fdc