This file is from: http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/multiz119way/README.txt This directory contains compressed multiple alignments of 119 virus sequences. The 'reference' sequence for this collection is the sequence: NC_045512v2 - 2019-12-30 - Wuhan-Hu-1 https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2 These 119 unique sequences were obtained from 3 sources: 1. NCBI Entrez search term: "SARS-CoV-2" produces 106 sequences as of 2020-03-06. 40 of these sequences were exact duplicates to other sequences in this set, and 21 of these sequences were fragments of gene sequences. The duplicates and fragments are not included in the list of 119 sequences 2. "coronaviridae" sequences (55 sequences) obtained from: https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=11118 selected to show "RefSeq nucleotides" 3. 12 additional unique sequences were obtained from: https://bigd.big.ac.cn/ncov/release_genome all the other sequences available here were copies of the NCBI/genbank sequences Description files in this directory: md5sum.txt - md5 sums to verify copied files wuhCor1.119way.nameList.txt - relating the accession name to sequence name, and sample collection date wuhCor1.119way.nh - Phylogenetic tree used for multiz alignment. The phylogenetic tree was calculated on 31mer frequency similarity and neighbor joining that distance matrix with the phylip toolset: http://evolution.genetics.washington.edu/phylip.html 'neighbor' command: http://evolution.genetics.washington.edu/phylip/progs.data.dist.html wuhCor1.multiz119way.maf.gz - alignments with gap annotation with accession identifiers sequences/ - directory with files: https://hgdownload.soeucsc.edu/goldenPath/wuhCor1/multiz119way/sequences/ sequences/dnaFasta119.tgz - gzipped tar file for the DNA fasta, 119 sequences sequences/proteinFasta119.tgz - gzipped tar file for the proteins as obtained from the genbank records, for example in: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2 one .faa.gz for each sequence (not all protein sequences are available, only 107 sequences present) sequences/proteinTab119.tgz - the same proteins arranged in single lines of the format: sequenceName.proteinNameamino acids . . . One file for each of the sequences. Not all of the 119 sequences have these protein records, there are 107 protein records here. This format file is convenient for extracting proteins from all the sequences that have a similar length. For example, the longest protein (the 'spike' protein) is over 6000 AAs, after unpacking the tgz file into a directory: zcat *.faa.tab.gz | awk -F$'\t' 'length($2) > 6000' \ | awk -F$'\t' '{printf ">%s\n%s\n", $1, $2}' > spikeProtein.faa You can drop that spikeProtein.faa file into a multiple aligner such as 'COBALT' https://www.ncbi.nlm.nih.gov/tools/cobalt/re_cobalt.cgi to obtain a multiple alignment of that protein for 99 of these sequences For a description of multiple alignment format (MAF), see http://genome.ucsc.edu/goldenPath/help/maf.html. PhastCons conservation scores for these alignments are available at: http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/phastCons119way PhyloP conservation scores for these alignments are available at: http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/phyloP119way --------------------------------------------------------------- To download a large file or multiple files from this directory, we recommend that you use rsync or ftp rather than downloading the files via our website. Via rsync: rsync -avz --progress \ rsync://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/multiz119way/ ./ Via FTP: ftp hgdownload.soe.ucsc.edu user name: anonymous password: go to the directory goldenPath/wuhCor1/multiz119way To download multiple files from the UNIX command line, use the "mget" command. mget ... - or - mget -a (to download all the files in the directory) Use the "prompt" command to toggle the interactive mode if you do not want to be prompted for each file that you download. --------------------------------------------------------------- All the files in this directory are freely usable for any purpose. For data use restrictions regarding the individual genome assemblies, see http://genome.ucsc.edu/goldenPath/credits.html. ---------------------------------------------------------------