Introduction
^^^^^^^^^^^^

This directory contains the Dec. 2011 (GRCm38/mm10) assembly of the mouse genome
(mm10, Genome Reference Consortium Mouse Build 38 (GCA_000001635.2)), as well
as repeat annotations and GenBank sequences.

This assembly was produced by the Mouse Genome Sequencing Consortium,
and the National Center for Biotechnology Information (NCBI).
For more information on the mouse genome, see the project website:

See also: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/
          http://www.ncbi.nlm.nih.gov/genome/52

Patches
^^^^^^^

mm10 has been updated with patches since its release in 2012.
GRC patch releases do not change any previously existing sequences; they simply add
new sequences for fix patches or alternate haplotypes that correspond to
specific regions of the main chromosome sequences. These sequences are typically
relatively short, on the order for a few dozen kbp. For most users, the patches
are unlikely to make a difference and may complicate the analysis as they
introduce more duplication.

The initial/ subdirectory contains files for the initial release of GRCm38,
which has 66 sequences, and no original alternate sequences and no fix sequences.
It is the same as the parent download directory. This is probably the 
best genome file for aligners and most analysis tasks, a version called
"analysisSet" for the human genome.

The p6/ subdirectory contains files for GRCm38.p6 (patch release 6). 
It has 239 sequences including alternate and fix sequences. Note that 
these patches includes "strain-specific" sequences. You may want to 
check with the authors of your aligner if the software can recognize these
sequences.

The "latest/" symbolic link points to the subdirectory for the most recent patch version.

mm10.* files in this directory are the same as files in the initial/
subdirectory, i.e. they are from the initial GRCm38 release and do not
include the patch sequences that are now included in the Genome Browser.
This means that old software that downloads these files will not report 
different results.


Sequence names
^^^^^^^^^^^^^^

For historical reasons, what UCSC calls "chr1", Ensembl calls "1" and NCBI
calls "NC_000067.6". The sequences are identical though. To map between UCSC,
Ensembl and NCBI names, use our table "chromAlias", available via our Table
Browser or as file:
https://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/chromAlias.txt.gz We
also provide a Python command line tool to convert sequence names in the most
common genomics file formats:
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/chromToUcsc

During genome assembly, reads are assembled into "contigs" (a few kbp long),
which are then joined into longer "scaffolds" of a few hundred kbp. These are 
finally placed, often manually e.g. with FISH assays, onto chromosomes.
As a result, the mm10 genome sequence files contains these types of sequences:

Chromosomes: 
- made from scaffolds placed onto chromosome locations, 95% of the genome file
- format: chr{chromosome number or name} 
- e.g. chr1 or chrX, chrM for the mitochondrial genome.

Unlocalized scaffolds: 
- a sequence found in an assembly that is associated with a specific 
chromosome but cannot be ordered or oriented on that chromosome. 
- format: chr{chromosome number or name}_{sequence_accession}_random
- e.g.  chr5_JH584299_random 	

Unplaced scaffolds: 
- a sequence found in an assembly that is not associated with any chromosome.  
- format: chrUn_{sequence_accession}
- e.g. chrUn_GL456379

Alternate loci scaffolds: 
- a scaffold that provides an alternate representation of a locus found
  in the primary assembly. These sequences do not represent a complete
  chromosome sequence although there is no hard limit on the size of the
  alternate locus; currently these are less than 1 Mb. These could either 
  be NOVEL patch sequences, added through patch releases, or present in the 
  initial assembly release.
- format: chr{chromosome number or name}_{sequence_accession}_alt
- e.g. chr6_GL456054_alt

Strain-specific alternate loci sequences:
- these are alternate loci that are not different versions of the sequence
  in the same mouse population, but from other strains, 

Fix loci scaffolds: 
- a patch that corrects sequence or reduces an assembly gap in a given
  major release. FIX patch sequences are meant to be incorporated into
  the primary or existing alt-loci assembly units at the next major
  release.
- these sequences are not part of the files in the initial/ directory
- format: chr{chromosome number or name}_{sequence_accession}_fix
- e.g. chr5_KV575237_fix

Files
^^^^^

Files included in this directory:

mm10.2bit - contains the complete mouse/mm10 genome sequence
    in the 2bit file format.  Repeats from RepeatMasker and Tandem Repeats
    Finder (with period of 12 or less) are shown in lower case; non-repeating
    sequence is shown in upper case.  The utility program, twoBitToFa (available
    from the kent src tree), can be used to extract .fa file(s) from
    this file.  A pre-compiled version of the command line tool can be
    found at:
        http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/
    See also:
        http://genome.ucsc.edu/admin/git.html
	http://genome.ucsc.edu/admin/jk-install.html

chromAgp.tar.gz - Description of how the assembly was generated from
    fragments, unpacking to one file per chromosome.

chromFa.tar.gz - The assembly sequence in one file per chromosome.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case.

chromFaMasked.tar.gz - The assembly sequence in one file per chromosome.
    Repeats are masked by capital Ns; non-repeating sequence is shown in
    upper case.

chromOut.tar.gz - RepeatMasker .out files (one file per chromosome).
    RepeatMasker was run with the -s (sensitive) setting.
    Repeat Masker library RELEASE 20110920
    April 26 2011 (open-3-3-0) version of RepeatMasker

chromTrf.tar.gz - Tandem Repeats Finder locations, filtered to keep repeats
    with period less than or equal to 12, and translated into UCSC's BED 5+
    format (one file per chromosome).

est.fa.gz - Mouse ESTs in GenBank. This sequence data is updated once a
    week via automatic GenBank updates.

md5sum.txt - checksums of files in this directory

mrna.fa.gz - Mouse mRNA from GenBank. This sequence data is updated
    once a week via automatic GenBank updates.

refMrna.fa.gz - RefSeq mRNA from the same species as the genome.
    This sequence data is updated once a week via automatic GenBank
    updates.

upstream1000.fa.gz - Sequences 1000 bases upstream of annotated
    transcription starts of RefSeq genes with annotated 5' UTRs.
    This file is updated weekly so it might be slightly out of sync with
    the RefSeq data which is updated daily for most assemblies.

upstream2000.fa.gz - Same as upstream1000, but 2000 bases.

upstream5000.fa.gz - Same as upstream1000, but 5000 bases.

xenoMrna.fa.gz - GenBank mRNAs from species other than that of 
    the genome. This sequence data is updated once a week via automatic 
    GenBank updates.

mm10.agp.gz - Description of how the assembly was generated from
    fragments.

mm10.chrom.sizes - Two-column tab-separated text file containing assembly
    sequence names and sizes.

mm10.chromAlias.txt - sequence name alias file, one line
    for each sequence name.  First column is sequence name followed by
    tab separated alias names.

mm10.chromAlias.bb - bigBed file for alias sequence names, one line
    for each sequence name. The first three columns are the sequence in
    BED format, followed by tab-separated alias names.
    The .bb file is used by bedToBigBed as a URL to avoid having to download
    the entire chromAlias.txt file.  From the usage message:
        -sizesIsChromAliasBb -- If set, then chrom.sizes file is assumed to be a
        chromAlias bigBed file or a URL to a such a file (see above).

More documentation is found here:
https://genomewiki.ucsc.edu/index.php?title=Chrom_Alias
    
mm10.fa.align.gz - RepeatMasker .align file.  RepeatMasker was run with the
    -s (sensitive) setting.
    2012-02-06 version of RepeatMasker

mm10.fa.gz - "Soft-masked" assembly sequence in one file.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or
    less) are shown in lower case; non-repeating sequence is shown in upper
    case. (again, the most current version of this file is latest/mm10.fa.gz)

mm10.fa.masked.gz - "Hard-masked" assembly sequence in one file.
    Repeats are masked by capital Ns; non-repeating sequence is shown in
    upper case.

mm10.fa.out.gz - RepeatMasker .out file.  RepeatMasker was run with the
    -s (sensitive) setting.
    2012-02-06 version of RepeatMasker

mm10.gc5Base.wigVarStep.gz - ascii data wiggle variable step values used
                           - to construct the GC Percent track
mm10.gc5Base.bw      - binary bigWig data for the gc5Base track.

mm10.trf.bed.gz - Tandem Repeats Finder locations, filtered to keep repeats
    with period less than or equal to 12, and translated into UCSC's BED
    format.

------------------------------------------------------------------
If you plan to download a large file or multiple files from this
directory, we recommend that you use ftp rather than downloading the
files via our website. To do so, ftp to hgdownload.cse.ucsc.edu
[username: anonymous, password: your email address], then cd to the
directory goldenPath/mm10/bigZips. To download multiple files, use
the "mget" command:

    mget <filename1> <filename2> ...
    - or -
    mget -a (to download all the files in the directory)

Alternate methods to ftp access.

Using an rsync command to download the entire directory:
    rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/ .
For a single file, e.g. chromFa.tar.gz
    rsync -avzP 
        rsync://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz .

Or with wget, all files:
    wget --timestamping 
        'ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/*'
With wget, a single file:
    wget --timestamping 
        'ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz' 
        -O chromFa.tar.gz

To unpack the *.tar.gz files:
    tar xvzf <file>.tar.gz
To uncompress the fa.gz files:
    gunzip <file>.fa.gz

------------------------------------------------------------------
This file last updated: 2020-04-20 - 20 April 2021
------------------------------------------------------------------