beta

Welcome to the beta website! Learn more

Return to classic site

                                  dbGap Authorized Access Portal home page

Downloading Data

The dbGaP Authorized Access System uses Aspera, a high-speed file transfer system, to facilitate client download. It requires Aspera Connect to be installed on client's download machine. Aspera Connect is an install-on-demand browser plugin. It is available for free on the Aspera website. From the software download page, please make sure to select and install Aspera Connect instead of any other Aspera client products. Aspera Connect is available for Linux, Mac, and Windows platforms. In addition to the web user interface, Aspera Connect also includes a command line ASCP executable utility.

The principal investigator (PI) of the project or downloaders designated by the PI can download the data as soon as the data access request is approved. The recommended way of downloading dbGaP data is using the “prefetch” utility available in the NCBI SRA toolkit.

The prefetch utility can download dbGaP non-SRA and SRA data files in bulk when a cart file is provided as an argument. It can also download the data of individual SRA run when individual SRR accession is provided as an argument. The documentation of prefetch can be found from here.

The following are main steps of downloading with prefetch.

  1. Download and install Aspera Connect (see here for more information).
  2. Select and save data files information in a cart (.kart) file
    • (For SRA data download, in addition to bulk download with cart file, the prefetch can also run with individual SRA accession, which is often preferred method for program/script directed automatic download. See the section 5 for more about this.)
    • Login to the dbGaP Authorized Access System using the eRA account login credentials. (Intramural NIH scientists and staff need their NIH email username and password).
    • Click on “My Requests” tab. The list of Approved Requests is under “Approved” sub-tab.
    • Find the table row of approved dataset, click on the link named “Request Files” in the “Actions” column.
    • On the “Access Request” page, different types of data files available for download are shown separately under different sub-tabs. To download non-SRA data, go to the “Phenotype and Genotype files” sub-tab and click on the “dbGaP File Selector” link. To download SRA data, go to the “SRA data (reads and reference alignments)” sub-tab and click on the “SRA RUN Selector” link.
    • Wait until the page loading is complete. Click on the “Help” icon on top of the page to see instruction/information about the selector).
    • Add/remove files using the facets listed in the left panel facet manager. From the right panel file list, select/unselect files by checking/unchecking checkboxes in front of the file names.
    • Once the files are selected (checked), click on the “Cart File” button (on the upper part of the page) and save the cart file (.kart).
  3. Download and decrypt dbGaP data files
    • Download the latest version of the NCBI SRA Toolkit. Untar or unzip downloaded toolkit file.
    • Follow this dbGaP Download Guide to download dbGaP data using the sratoolkit.
  4. Specific steps and commands
    • Before running the download commands below, make sure the dbGaP repository key (.ngc) and the cart files are ready.
    • Download a fresh dbGaP repository key (.ngc) file and re-config the toolkit with the command below.
    • $ /path-to-your-sratoolkit-installation-dir/bin/vdb-config -i
    • From the sratoolkit GUI interface, import the repository key
    • Download dbGaP data files
    • Run the command below to download the files specified in the cart file.
    • $ /path-to-your-sratoolkit-installation-dir/bin/prefetch --ngc /path-to-ngc-file-dir/xxxxx.ngc /path-to-your-cart-file/xxxxx.krt
    • Please make sure the sratoolkit, ngc, and cart files are on the same disk drive.
    • Decrypt downloaded files
    • The downloaded dbGaP non-SRA files need to be decrypted before use. Run the command below to decrypt the files.
    • $ /path-to-your-sratoolkit-installation-dir/bin/vdb-decrypt --ngc /path-to-ngc-file-dir/xxxxx.ngc /path-to-top-level-download-dir/
  5. Compatibility issue with older versions of sratoolkit
    • If 2.9.6 or older version of the sratoolkit had been installed and used on the machine, before running above commands, the old toolkit settings need to be disabled by renaming the settings file as below.
    • $ cd ~/.ncbi
    • $ mv user-settings.mkfg user-settings.mkfg.old

Here is a video related to this topic. Recently improved user-interface of the dbGaP Authorized Access System allows principal investigator (PI) to designate one or more downloaders within PI's institution. A Downloader is an individual assigned by the PI to perform the time-consuming task of retrieving large data files. The downloaders can login to the dbGaP system through their own account and make download. The download is limited to the data sets approved to access and specified for downloader by primary PI. The following is how to assign downloaders to approved datasets within all or specific projects:

  1. Login to the dbGaP Authorized Access System as a PI using the eRA login credentials; If respective project hasn't yet been created, create the project and follow multiple steps to complete and submit the online application.
  2. Navigate to “Downloader” page through “Downloaders” tab. Search for the name of intended downloader by the first name and last name using the search boxes.
Note: A downloader needs to have a valid NIH eRA Commons account or a NIH email account, and have successfully logged into the dbGaP Authorized Access System at least once. Downloader's eRA account does not need to have a PI role, but it does need to be affiliated with PI's institution.
  1. Confirm to make sure the resulting user name is correct; Click on the name; select all or a specific project from the pull-down manual, and finally click on “Set downloader” button to make the assignment. The downloader's name and the projects accessible to the downloader will be displayed on the page.
  2. The PI can use the “X” buttons in “Remove Role” column of downloader table to remove any downloaders or downloader's projects.

Here is a video related to this topic. Downloader has to be designated by the PI through the dbGaP system. Please see here for more details. Prior to be chosen as a downloader, the individual must

  1. Have a valid NIH eRA Commons account affiliated with the same organization as the PI, or has an NIH email account. The eRA account does not need to have a PI role.
  2. Have already completed at least one successful login to the dbGaP Authorized Access System.

The download procedure is nearly the same for PI and for downloaders. Please see here for more details.

In most of cases, the expiration interval of a download package is set to two months. You can always delete expired package and order a new one if you need to download the same data again. The new download package can include some or all of the previously downloaded files. Please see here for more details.

No, the FTP interface is no longer available for downloading dbGaP data. The Aspera Connect is the only choice.


Decrypting and Extracting Data

The following instructions are nearly identical in all supported platforms.

  1. Different treatment of SRA and non-SRA data

    The data files distributed through the dbGaP are all encrypted by NCBI's data encryption algorithm. These files have a file suffix “.ncbi_enc”, indicating that they are NCBI encrypted files. Not all encrypted data however need to be decrypted.

    The SRA (short-read-archive) data distributed through the dbGaP are encrypted but there is no need to decrypt them. The NCBI SRA toolkit can work directly on encrypted SRA data without decryption. Decrypted SRA data is in a binary format that is not human readable and can only be processed by the SRA toolkit anyway.

    You need NCBI SRA toolkit to work on SRA data. The SRA toolkit is a collection of utilities that can dump, extract, and convert SRA data to different data formats. The vdb-decrypt utility included in the SRA toolkit can be used to decrypt any encrypted dbGaP data.

    The dbGaP data other than SRA (non-SRA data) need to be decrypted before use. If you are only working on non-SRA data, you can download the NCBI Decryption Tool, which is a sub-set of the SRA Toolkit. It only includes utilities related to data decryption. If you already have SRA toolkit setup, you don't need to download NCBI decryption tool because the vdb-decrypt utility is included.

    Both NCBI SRA Toolkit and NCBI Decryption Tool are available from here.

  2. The dbGaP repository key

    dbGaP repository key is a dbGaP project wide security token required for configuring NCBI SRA toolkit and decryption tools. The key is provided in a file with suffix “.ngc”. It can be obtained from two places in PI's dbGaP account.

    1. The first place is the project page under “My Projects” tab, through a link named “get dbGaP repository key” in the “Actions” column. The key downloaded from here is valid to all downloaded data under the project.
    2. The second place is the download page under “Downloads” tab, through a link named “get dbGaP repository key in the “Actions” column.
  3. Toolkit Configuration and import repository key

    The NCBI decryption tool is a subset of the SRA Toolkit. The steps of setting up both tools are nearly identical. In either case, a dbGaP repository key for the respective dbGaP project should be downloaded from PI's dbGaP account, and the tool should be first configured using “vdb-config”, a command line utility available under the “bin” directory of the toolkit. See here for detailed instruction.

  4. Decrypting Non-SRA Data

    The Non-SRA data distributed through the dbGaP need to be decrypted before used for anything. The tool named “vdb-decrypt” under NCBI sra-toolkit or NCBI decryption Tools is for data decryption.

    To decrypt non-SRA data, go to the dbGaP project directory (workspace) setup through the toolkit configuration, issue the following command from a command line: It is important to remember that the command line has to be run directly from the dbGaP project directory.

    A typical vdb-decrypt command should be like this:

    /path-to-your-sratoolkit-installation-dir/bin/vdb-decrypt --ngc /path-to-ngc-file-dir/xxxxx.ngc /path-to-top-level-download-dir/
  5. More about NCBI SRA Toolkit

    Please refer to the documentation of sra-toolkit for more about various utilities available under the sra-toolkit.

Most of the sequencing data available through the dbGaP are in SRA format. The SRA data can be converted to BAM format using the sam-dump combined with samtools. The sam-dump utility is available under the SRA toolkit. More information about the sam-dump is available at here, and the information about the samtools can be found from here.

Please visit the section related to the fastq-dump utility in SRA Download Guide. If you have further questions regarding SRA (Short-Read-Archive) data, please directly contact NCBI's SRA group (vog.hin.mln.ibcn@ars). They are better able to help with SRA related issues.


Data Sample and Subject ID Mapping

The dbGaP phenotype, genotype, and sequencing data (including BAM, SRA data etc.) are often submitted and processed separately. One of the consequences of it is that the header names of IDs used in different data files may be in different naming formats. The following information may help you to get IDs mapped cross all data files.

  1. Phenotype subject, sample ID mapping
    The master mapping files between subject and sample IDs can be found from the files that have the phrase “_Subject”, or “_Sample” or “_Pedigree” embedded in the file name. For example:
    • phs000094.v1.pht001136.v1.p1.Oral_Clefts_Subject.MULTI.txt
    • phs000094.v1.pht001138.v1.p1.Oral_Clefts_Sample.MULTI.txt
    • phs000094.v1.pht001137.v1.p1.Oral_Clefts_Pedigree.MULTI.txt

    In the authorized access system, these files are placed together with phenotype files in the file selection tree. The file selection tree can be found in the “Access Request” page under “My Request” tab.

  2. Genotype ID mapping The sample and subject ID mapping information of genotype files can be found in a file packed in the tarball that has the phrase “sample-info” embedded in the taball name. For example:
    • phg000054.v1.p1.GENEVA_OralClefts.sample-info.MULTI.tar

    Please note that the header title of IDs in the sample-info file may not be exactly identical to those used in the master mapping files mentioned above. The corresponding IDs in the master mapping file should identified easily based the face meaning of ID headers in the genotype sample-info file.

  3. SRA sample ID mapping The SRA samples are given independent IDs at the different stage of data processing, handling, and archiving for different purposes. For example most of the SRA samples distributed through the dbGaP have submitted_sample_id, sra_accession, sra_sample_id, and dbgap_sample_id. The mapping information of these IDs can be found in a manifest file available on the “Access Request” page. The following is how to locate the manifest file:

    Login to the dbGaP account, go to “My Request” tab, find the data access request of interest from the request list, and click on the “Request Files” link in the “Actions” column. A manifest that contains SRA sample ID mapping is available through a link named “Dataset Manifest”.

  1. dbGaP SampID
    The dbGaP Sample ID is a dbGaP assigned accession to the submitted SAMPID. Please see SAMPID for more information. The dbGaP SampID is included as a column in the final phenotype dump files whenever there is a submitted sample ID column.

  2. dbGaP SampID
    The dbGaP Sample ID is a dbGaP assigned accession to the submitted SAMPID. Please see SAMPID for more information. The dbGaP SampID is included as a column in the final phenotype dump files whenever there is a submitted sample ID column.

  3. dbGaP SubjID
    The dbGaP Subject ID is a dbGaP assigned accession to the submitted SUBJID. Please see SUBJID for more information. The dbGaP SubjID is included as a column in the final phenotype dump files whenever there is a submitted subject ID column.

    The dbGaP Subject ID is unique cross all dbGaP studies, which means that if a subject is known to have participated in multiple studies that have been submitted to dbGaP, the same dbGaP SubjID will be assigned to the individual across multiple studies, though the submitted subject ID may be different.

  4. SUBJID
    The SUBJID is submitted subject ID and is included in the Subject Consent Data File, Subject Sample Mapping Data File, Pedigree Data File (if available), and all Subject Phenotype Data Files. A dbGaP Subject is defined as a single human person/individual/patient that arises from a single germline. Each subject has been assigned a single, unique, de-identified Subject ID. Subject IDs should be an integer or string value. Only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). In addition to the submitted Subject ID, dbGaP will assign a dbGaP Subject ID that will be included in the final phenotype dump files along with the submitted Subject ID.

  5. SAMPID
    The SAMPID is the submitted sample ID and is included in the Subject Sample Mapping Data File and Sample Attributes Data File. A dbGaP Sample is defined as the final preps submitted to dbGaP by a genotyping center, to the SRA group by a sequencing group, or to a NCBI resource, such as GEO or BioSamples. A single subject can have multiple samples, but a single sample cannot be mapped to multiple subjects. Each sample should be submitted with a single, unique, de-identified Sample ID. Sample IDs should be an integer or string value. Only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). In addition to the submitted Sample ID, dbGaP will assign a dbGaP Sample ID that will be included in the final phenotype dump files along with the submitted Sample ID. For example, if one patient (subject ID) gave one sample, and that sample was processed differently to generate two sequencing runs or one sequencing run and 1 genotyping array, there would be two rows, both using the same subject ID, but having 2 unique sample IDs. The SAMPIDs listed in the Subject Sample Mapping Data File should be identical to the samples found in the genotype and SRA Data.

  6. SOURCE_SUBJID and SUBJ_SOURCE
    For subjects originating from a shared source (such as a public repository, consortium, institute, study, etc.) or for subjects with alias IDs, these 2 variables will be included in the Subject Consent Data File. The Subject Source (SUBJ_SOURCE) is the name of the third party source, public repository, consortium, institute, or study that corresponds to the subject. The Source Subject ID (SOURCE_SUBJID) is the de-identified alias Subject ID used in the public repository, consortium, institute, or study from where the subject has been obtained. The SOURCE_SUBJID maps to the SUBJID.

    For referencing HapMap subjects from Coriell, the SUBJ_SOURCE value is written as “Coriell.” The SOURCE_SUBJID should be written as the de-identified subject ID assigned by Coriell.

  7. SEX
    The gender variable can be included in a subject phenotype data file or in a pedigree file if a pedigree file is available.

  8. FAMID
    The family ID is found in the pedigree file if a pedigree file is available. FAMID is a column of de-identified Family IDs. The Family ID is also referred to as the Pedigree ID. The family ID should be the same for individuals belonging in the same biological family.

  9. FATHER and MOTHER
    Every individual father has a unique, de-identified Father ID; every individual mother has a unique, de-identified Mother ID. The Father ID and Mother ID are not identical. 0 (zero) or blank is filled in for founders or marry-ins (parents not specified) in a pedigree. Each unique Father ID and unique Mother ID is also listed in the SUBJID column of both the Pedigree Data File and the Subject Consent Data File.

  10. CONSENT
    Every subject that appears in a Subject Phenotype Data File must belong to a consented subject (to allow his/her phenotypes to be used by approved Authorized Access Users) and every sample that appears in a Sample Attribute Data File must belong to a consented subject. The consent information is listed in the Subject Consent Data File. Each subject can only belong to a single consent group. The consents are determined by the submitter, their IRB, their GPA (Genome Program Administrator) along with the DAC (Data Access Committee). All data is parsed into its respective consent groups for download.