Introduction ^^^^^^^^^^^^ This directory contains the Dec. 2011 (GRCm38/mm10) assembly of the mouse genome (mm10, Genome Reference Consortium Mouse Build 38 (GCA_000001635.2)), as well as repeat annotations and GenBank sequences. This assembly was produced by the Mouse Genome Sequencing Consortium, and the National Center for Biotechnology Information (NCBI). For more information on the mouse genome, see the project website: See also: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/ http://www.ncbi.nlm.nih.gov/genome/52 Patches ^^^^^^^ mm10 has been updated with patches since its release in 2012. GRC patch releases do not change any previously existing sequences; they simply add new sequences for fix patches or alternate haplotypes that correspond to specific regions of the main chromosome sequences. These sequences are typically relatively short, on the order for a few dozen kbp. For most users, the patches are unlikely to make a difference and may complicate the analysis as they introduce more duplication. The initial/ subdirectory contains files for the initial release of GRCm38, which has 66 sequences, and no original alternate sequences and no fix sequences. It is the same as the parent download directory. This is probably the best genome file for aligners and most analysis tasks, a version called "analysisSet" for the human genome. The p6/ subdirectory contains files for GRCm38.p6 (patch release 6). It has 239 sequences including alternate and fix sequences. Note that these patches includes "strain-specific" sequences. You may want to check with the authors of your aligner if the software can recognize these sequences. The "latest/" symbolic link points to the subdirectory for the most recent patch version. mm10.* files in this directory are the same as files in the initial/ subdirectory, i.e. they are from the initial GRCm38 release and do not include the patch sequences that are now included in the Genome Browser. This means that old software that downloads these files will not report different results. Sequence names ^^^^^^^^^^^^^^ For historical reasons, what UCSC calls "chr1", Ensembl calls "1" and NCBI calls "NC_000067.6". The sequences are identical though. To map between UCSC, Ensembl and NCBI names, use our table "chromAlias", available via our Table Browser or as file: https://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/chromAlias.txt.gz We also provide a Python command line tool to convert sequence names in the most common genomics file formats: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/chromToUcsc During genome assembly, reads are assembled into "contigs" (a few kbp long), which are then joined into longer "scaffolds" of a few hundred kbp. These are finally placed, often manually e.g. with FISH assays, onto chromosomes. As a result, the mm10 genome sequence files contains these types of sequences: Chromosomes: - made from scaffolds placed onto chromosome locations, 95% of the genome file - format: chr{chromosome number or name} - e.g. chr1 or chrX, chrM for the mitochondrial genome. Unlocalized scaffolds: - a sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome. - format: chr{chromosome number or name}_{sequence_accession}_random - e.g. chr5_JH584299_random Unplaced scaffolds: - a sequence found in an assembly that is not associated with any chromosome. - format: chrUn_{sequence_accession} - e.g. chrUn_GL456379 Alternate loci scaffolds: - a scaffold that provides an alternate representation of a locus found in the primary assembly. These sequences do not represent a complete chromosome sequence although there is no hard limit on the size of the alternate locus; currently these are less than 1 Mb. These could either be NOVEL patch sequences, added through patch releases, or present in the initial assembly release. - format: chr{chromosome number or name}_{sequence_accession}_alt - e.g. chr6_GL456054_alt Strain-specific alternate loci sequences: - these are alternate loci that are not different versions of the sequence in the same mouse population, but from other strains, Fix loci scaffolds: - a patch that corrects sequence or reduces an assembly gap in a given major release. FIX patch sequences are meant to be incorporated into the primary or existing alt-loci assembly units at the next major release. - these sequences are not part of the files in the initial/ directory - format: chr{chromosome number or name}_{sequence_accession}_fix - e.g. chr5_KV575237_fix Files ^^^^^ Files included in this directory: mm10.2bit - contains the complete mouse/mm10 genome sequence in the 2bit file format. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. The utility program, twoBitToFa (available from the kent src tree), can be used to extract .fa file(s) from this file. A pre-compiled version of the command line tool can be found at: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ See also: http://genome.ucsc.edu/admin/git.html http://genome.ucsc.edu/admin/jk-install.html chromAgp.tar.gz - Description of how the assembly was generated from fragments, unpacking to one file per chromosome. chromFa.tar.gz - The assembly sequence in one file per chromosome. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. chromFaMasked.tar.gz - The assembly sequence in one file per chromosome. Repeats are masked by capital Ns; non-repeating sequence is shown in upper case. chromOut.tar.gz - RepeatMasker .out files (one file per chromosome). RepeatMasker was run with the -s (sensitive) setting. Repeat Masker library RELEASE 20110920 April 26 2011 (open-3-3-0) version of RepeatMasker chromTrf.tar.gz - Tandem Repeats Finder locations, filtered to keep repeats with period less than or equal to 12, and translated into UCSC's BED 5+ format (one file per chromosome). est.fa.gz - Mouse ESTs in GenBank. This sequence data is updated once a week via automatic GenBank updates. md5sum.txt - checksums of files in this directory mrna.fa.gz - Mouse mRNA from GenBank. This sequence data is updated once a week via automatic GenBank updates. refMrna.fa.gz - RefSeq mRNA from the same species as the genome. This sequence data is updated once a week via automatic GenBank updates. upstream1000.fa.gz - Sequences 1000 bases upstream of annotated transcription starts of RefSeq genes with annotated 5' UTRs. This file is updated weekly so it might be slightly out of sync with the RefSeq data which is updated daily for most assemblies. upstream2000.fa.gz - Same as upstream1000, but 2000 bases. upstream5000.fa.gz - Same as upstream1000, but 5000 bases. xenoMrna.fa.gz - GenBank mRNAs from species other than that of the genome. This sequence data is updated once a week via automatic GenBank updates. mm10.agp.gz - Description of how the assembly was generated from fragments. mm10.chrom.sizes - Two-column tab-separated text file containing assembly sequence names and sizes. mm10.chromAlias.txt - sequence name alias file, one line for each sequence name. First column is sequence name followed by tab separated alias names. mm10.chromAlias.bb - bigBed file for alias sequence names, one line for each sequence name. The first three columns are the sequence in BED format, followed by tab-separated alias names. The .bb file is used by bedToBigBed as a URL to avoid having to download the entire chromAlias.txt file. From the usage message: -sizesIsChromAliasBb -- If set, then chrom.sizes file is assumed to be a chromAlias bigBed file or a URL to a such a file (see above). More documentation is found here: https://genomewiki.ucsc.edu/index.php?title=Chrom_Alias mm10.fa.align.gz - RepeatMasker .align file. RepeatMasker was run with the -s (sensitive) setting. 2012-02-06 version of RepeatMasker mm10.fa.gz - "Soft-masked" assembly sequence in one file. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. (again, the most current version of this file is latest/mm10.fa.gz) mm10.fa.masked.gz - "Hard-masked" assembly sequence in one file. Repeats are masked by capital Ns; non-repeating sequence is shown in upper case. mm10.fa.out.gz - RepeatMasker .out file. RepeatMasker was run with the -s (sensitive) setting. 2012-02-06 version of RepeatMasker mm10.gc5Base.wigVarStep.gz - ascii data wiggle variable step values used - to construct the GC Percent track mm10.gc5Base.bw - binary bigWig data for the gc5Base track. mm10.trf.bed.gz - Tandem Repeats Finder locations, filtered to keep repeats with period less than or equal to 12, and translated into UCSC's BED format. ------------------------------------------------------------------ If you plan to download a large file or multiple files from this directory, we recommend that you use ftp rather than downloading the files via our website. To do so, ftp to hgdownload.cse.ucsc.edu [username: anonymous, password: your email address], then cd to the directory goldenPath/mm10/bigZips. To download multiple files, use the "mget" command: mget ... - or - mget -a (to download all the files in the directory) Alternate methods to ftp access. Using an rsync command to download the entire directory: rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/ . For a single file, e.g. chromFa.tar.gz rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz . Or with wget, all files: wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/*' With wget, a single file: wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz' -O chromFa.tar.gz To unpack the *.tar.gz files: tar xvzf .tar.gz To uncompress the fa.gz files: gunzip .fa.gz ------------------------------------------------------------------ This file last updated: 2020-04-20 - 20 April 2021 ------------------------------------------------------------------