This file is from:

    http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/multiz44way/README.txt

This directory contains compressed multiple alignments of 44 virus sequences.
These 44 sequences represent coronavirus strains in bat populations

The 'reference' sequence for this collection is the sequence:

  NC_045512v2 - 2019-12-30 - Wuhan-Hu-1
  https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2

Description files in this directory:

  md5sum.txt - md5 sums to verify copied files
  wuhCor1.44way.nameList.txt - relating the accession name to
                              sequence name, and sample collection date

  wuhCor1.44way.nh - Phylogenetic tree used for multiz alignment.
           The phylogenetic tree was calculated on 31mer frequency similarity
           and neighbor joining that distance matrix with the phylip toolset:
           http://evolution.genetics.washington.edu/phylip.html
           'neighbor' command:
           http://evolution.genetics.washington.edu/phylip/progs.data.dist.html

  wuhCor1.multiz44way.maf.gz - alignments with gap annotation with
                                accession identifiers

  sequences/ - directory with files:
     https://hgdownload.soeucsc.edu/goldenPath/wuhCor1/multiz44way/sequences/

  sequences/dnaFasta44.tgz - gzipped tar file for the DNA fasta, 44 sequences

  sequences/proteinFasta44.tgz - gzipped tar file for the proteins as obtained
                     from the genbank records, for example in:
                     https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2
                     one .faa.gz for each sequence

  sequences/proteinTab44.tgz - the same proteins arranged in single lines
                     of the format:

  sequenceName.proteinName<tab>amino acids . . .

  One file for each of the sequences.

  This format file is convenient for extracting proteins from all the
  sequences that have a similar length.  For example, the longest protein
  (the 'spike' protein) is over 6000 AAs, after unpacking the tgz file
  into a directory:

  zcat *.faa.tab.gz | awk -F$'\t' 'length($2) > 6000' \
     | awk -F$'\t' '{printf ">%s\n%s\n", $1, $2}' > spikeProtein.faa

  You can drop that spikeProtein.faa file into a multiple aligner such
  as 'COBALT'
     https://www.ncbi.nlm.nih.gov/tools/cobalt/re_cobalt.cgi
  to obtain a multiple alignment of that protein for 99 of these sequences

For a description of multiple alignment format (MAF), see
http://genome.ucsc.edu/goldenPath/help/maf.html.

PhastCons conservation scores for these alignments are available at:
http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/phastCons44way

PhyloP conservation scores for these alignments are available at:
http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/phyloP44way

---------------------------------------------------------------
To download a large file or multiple files from this directory, we recommend
that you use rsync or ftp rather than downloading the files via our website.

Via rsync:
rsync -avz --progress \
        rsync://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/multiz44way/ ./

Via FTP:
    ftp hgdownload.soe.ucsc.edu
    user name: anonymous
    password: <your email address>
    go to the directory goldenPath/wuhCor1/multiz44way

To download multiple files from the UNIX command line, use the "mget" command.
    mget <filename1> <filename2> ...
    - or -
    mget -a (to download all the files in the directory)
Use the "prompt" command to toggle the interactive mode if you do not want
to be prompted for each file that you download.

---------------------------------------------------------------
All the files in this directory are freely usable for any
purpose. For data use restrictions regarding the individual
genome assemblies, see http://genome.ucsc.edu/goldenPath/credits.html.
---------------------------------------------------------------