Overview ^^^^^^^^ This directory and the subdirectory "initial/" contain the genome sequence files from the original release in 2009 by NCBI (GCA_000001405.1) and some related files. This original version and all subsequent changes are called "hg19" at UCSC. All coordinates from the initial assembly will always be valid on the "hg19" UCSC Genome Browser, as no changes were made to existing sequences. In 2020 we added a few additional sequences, new sequences from GRC patch release GRCh37.p13 (GCA_000001405.14) plus the revised Cambridge Reference Sequence (rCRS) mitochondrial sequence. These can be found in the subdirectory "p13.plusMT/" or its alias "latest/". See the section "Patches" below. Most users looking at this text are looking for the file "latest/hg19.fa.gz". There is one exception: if you need a file for a genome aligner, like BWA, bowtie2, hisat2 or similar, please read the section "Analysis Set" below and look at the directory "analysisSet/". The subdirectory "genes/" contains select gene transcript sets in GFF format. GRCh37 was produced and is updated by the Genome Reference Consortium: https://www.ncbi.nlm.nih.gov/grc Differences from the NCBI files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ There are two main differences compared to the NCBI files: - the mitochondrial genome: since the release of the UCSC hg19 assembly, the Homo sapiens mitochondrion sequence (represented as "chrM" in the Genome Browser) has been replaced in GenBank with the record NC_012920, the revised Cambridge Reference Sequence (rCRS). We have not replaced the original sequence, NC_001807, as chrM in the hg19 Genome Browser. However, files in the subdirectory p13.plusMT include NC_012920 as "chrMT", in addition to the original "chrM". - also, the FASTA files of NCBI's GCA_000001405.1 distributed at ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/ have different sequence identifiers ("NC_000001.10" for NCBI instead of "chr1" for UCSC) and the repeatmasking, expressed by lowercasing letters, was done with different RepeatMasker settings. Please also read the notes on our hg19 overview page at: http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19 The page explains the naming scheme of unplaced contigs and haplotypes, e.g. HSCHR6_MHC_APD_CTG1 = GL000250.1 => "chr6_apd_hap1" and the placement of the pseudo-autosomal (PAR) regions on chrX and chrY. Analysis set ^^^^^^^^^^^^ The GRCh37/hg19 patch13 assembly contains more than just the chromosome sequences, but also a mitochondrial genome, unplaced sequences, alternate haplotypes and fixes, some of these sequences can confuse modern aligners. The subdirectory analysisSet/ contains files with optimized versions of the genome for these aligners or similar high-throuput analysis programs. The README.txt file in that directory provides more details. Patches to hg19 ^^^^^^^^^^^^^^^ The Genome Reference Consortium has been adding additional (short) sequences since the initial release. We have added these patches in 2020 but keep the updated releases in separate directories: - The initial/ subdirectory contains files for the initial release of GRCh37, without any patch release sequences. - The p13.plusMT/ subdirectory contains files for GRCh37.p13 (patch release 13) plus the rCRS mitochondrion sequence (NC_012920) as "chrMT". GRC patch releases do not change any previously existing sequences; they simply add new sequences for fix patches or alternate haplotypes that correspond to specific regions of the main chromosome sequences. The Genome Browser displays this expanded set of assembly sequences. - The latest/ subdirectory contains files that do not include version indicators in their names, but are symbolic links to files in the most recent version subdirectory, i.e. p13.plusMT. - Data files in the current directory are the same as files in the initial/ subdirectory, i.e. they are from the initial GRCh37 release and do not include the patch sequences that are included in the Genome Browser. Sequence names ^^^^^^^^^^^^^^ For historical reasons, what UCSC calls "chr1", Ensembl calls "1" and NCBI calls "NC_000067.6". The sequences are identical though. To map between UCSC, Ensembl and NCBI names, use our table "chromAlias", available via our Table Browser or as file: https://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/chromAlias.txt.gz We also provide a Python command line tool to convert sequence names in the most common genomics file formats: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/chromToUcsc During genome assembly, reads are assembled into "contigs" (a few kbp long), which are then joined into longer "scaffolds" of a few hundred kbp. These are finally placed, often manually e.g. with FISH assays, onto chromosomes. The .agp file below describes how these were placed onto chromosomes. The alternate haplotype (_hap) sequences were released with the initial assembly, subsequent patches introduced fix sequences (_fix) and novel sequences (_alt). For more information on patches see: http://genome.ucsc.edu/blog/patches/ The following list represents all the types of sequences found in the hg19 genome: Chromosomes: - made from scaffolds placed onto chromosome locations, 95% of the genome file - format: chr{chromosome number or name} - e.g. chr1 or chrX, chrM for the (non-rCRS) mitochondrial genome. Unlocalized scaffolds: - a sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome. - format: chr{chromosome number or name}_{sequence_accession}v{sequence_version}_random - e.g. chr17_gl000205_random Unplaced scaffolds: - a sequence found in an assembly that is not associated with any chromosome. - format: chrUn_{sequence_accession}v{sequence_version} - e.g. chrUn_gl000223 Alternative haplotypes in initial GRCh37 release: - a sequence that provides an alternate representation of a locus found in the primary assembly. These sequences were present in the initial hg19 assembly release. They do not represent complete chromosome sequences. There are 9 present in the initial hg19 assembly. For more information on the 7 chr6 alternate haplotypes see the MHC Haplotype Project website: http://www.ucl.ac.uk/cancer/medical-genomics/mhc - format: chr{chromosome number or name}_{haplotype_name}_hap{haplotype_number_in_chromosome} - e.g. chr6_cox_hap2 Alternate loci scaffolds from patch releases: - a scaffold that provides an alternate representation of a locus found in the primary assembly. These sequences do not represent a complete chromosome sequence although there is no hard limit on the size of the alternate locus; currently most are less than 1 Mb. In the context of hg19, all these sequences have been added through patch releases. - these sequences are not part of the files in the initial/ directory - format: chr{chromosome number or name}_{sequence_accession}_alt - e.g. chr12_gl877876_alt Fix loci scaffolds: - a patch that corrects sequence or reduces an assembly gap in a given major release. FIX patch sequences are meant to be incorporated into the primary or existing alt-loci assembly units at the next major release. - these sequences are not part of the files in the initial/ directory - format: chr{chromosome number or name}_{sequence_accession}_fix - e.g. chrX_kb021648_fix Files ^^^^^ Files included in this directory are from the initial 2009 release of the genome, files for the most current patch version of the genome are in the "latest/" subdirectory: hg19.fa.gz - "Soft-masked" assembly sequence in one file. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. Again, the most current version of this file is latest/hg19.fa.gz For many types of analysis that include sequence comparisons, the files in the directory analysisSet are recommended, as these include fewer duplicates. hg19.fa.masked.gz - based on hg19.fa.gz, "hard-masked" assembly sequence in one file. Repeats are masked by capital Ns; non-repeating sequence is shown in upper case. hg19.fa.out.gz - RepeatMasker .out file. RepeatMasker was run with the -s (sensitive) setting. Jan 29 2009 (open-3-2-7) version of RepeatMasker RepBase library: RELEASE 20090120 hg19.fa.align.gz - RepeatMasker .align file. RepeatMasker was run with the -s (sensitive) setting. Jan 29 2009 (open-3-2-7) version of RepeatMasker RepBase library: RELEASE 20090120 hg19.trf.bed.gz - Tandem Repeats Finder locations, filtered to keep repeats with period less than or equal to 12, and translated into UCSC's BED format. hg19.2bit - contains the complete human/hg19/GRCh37 genome sequence in the 2bit file format. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. The utility program, twoBitToFa (available from the kent src tree), can be used to extract .fa file(s) from this file. A pre-compiled version of the command line tool can be found at: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/ See also: http://genome.ucsc.edu/admin/git.html https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/userApps/README hg19.agp.gz - Description of how the assembly was generated from fragments. chromAgp.tar.gz - Description of how the assembly was generated from fragments, unpacking to one file per chromosome. chromFa.tar.gz - The assembly sequence in one file per chromosome. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. chromFaMasked.tar.gz - The assembly sequence in one file per chromosome. Repeats are masked by capital Ns; non-repeating sequence is shown in upper case. chromOut.tar.gz - RepeatMasker .out files (one file per chromosome). RepeatMasker was run with the -s (sensitive) setting. Using: Jan 29 2009 (open-3-2-7) version of RepeatMasker and RELEASE 20090120 of library RepeatMaskerLib.embl chromTrf.tar.gz - Tandem Repeats Finder locations, filtered to keep repeats with period less than or equal to 12, and translated into UCSC's BED 5+ format (one file per chromosome). est.fa.gz - Human ESTs in GenBank. This sequence data is updated regularly via automatic GenBank updates. md5sum.txt - checksums of files in this directory mrna.fa.gz - Human mRNA from GenBank. This sequence data is updated regularly via automatic GenBank updates. refMrna.fa.gz - RefSeq mRNA from the same species as the genome. This sequence data is updated regularly via automatic GenBank updates. upstream1000.fa.gz - Sequences 1000 bases upstream of annotated transcription starts of RefSeq genes with annotated 5' UTRs. This file is updated weekly so it might be slightly out of sync with the RefSeq data which is updated daily for most assemblies. upstream2000.fa.gz - Same as upstream1000, but 2000 bases. upstream5000.fa.gz - Same as upstream1000, but 5000 bases. xenoMrna.fa.gz - GenBank mRNAs from species other than that of the genome. hg19.chrom.sizes - Two-column tab-separated text file containing assembly sequence names and sizes. hg19.gc5Base.wigVarStep.gz - ascii data wiggle variable step values used - to construct the GC Percent track hg19.gc5Base.wig.gz - wiggle database table for the GC Percent track - this is an older standard alternative to the current - bigWig format of the track, sometimes usefull for analysis hg19.gc5Base.wib - binary data to correspond with the gc5Base.wig file see also: http://genome.ucsc.edu/goldenPath/help/wiggle.html and http://genomewiki.ucsc.edu/index.php/Using_hgWiggle_without_a_database for a discussion of how to use the wig.gz and .wib files for interaction with the GC percent data values hg19.chromAlias.txt - sequence name alias file, one line for each sequence name. First column is sequence name followed by tab separated alias names. hg19.chromAlias.bb - bigBed file for alias sequence names, one line for each sequence name. The first three columns are the sequence in BED format, followed by tab-separated alias names. The .bb file is used by bedToBigBed as a URL to avoid having to download the entire chromAlias.txt file. From the usage message: -sizesIsChromAliasBb -- If set, then chrom.sizes file is assumed to be a chromAlias bigBed file or a URL to a such a file (see above). More documentation is found here: https://genomewiki.ucsc.edu/index.php?title=Chrom_Alias ------------------------------------------------------------------ How to download ^^^^^^^^^^^^^^^ If you plan to download a large file or multiple files from this directory, we recommend that you use ftp rather than downloading the files via our website. To do so, ftp to hgdownload.soe.ucsc.edu [username: anonymous, password: your email address], then cd to the directory goldenPath/hg19/bigZips. To download multiple files, use the "mget" command: mget ... - or - mget -a (to download all the files in the directory) Alternate methods to ftp access. Using an rsync command to download the entire directory: rsync -avzP rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/ . For a single file, e.g. chromFa.tar.gz rsync -avzP rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz . Or with wget, all files: wget --timestamping 'ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/*' With wget, a single file: wget --timestamping 'ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz' -O chromFa.tar.gz To unpack the *.tar.gz files: tar xvzf .tar.gz To uncompress the fa.gz files: gunzip .fa.gz All the files in this directory are freely available for public use.