This file is from: http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/multiz44way/README.txt This directory contains compressed multiple alignments of 44 virus sequences. These 44 sequences represent coronavirus strains in bat populations The 'reference' sequence for this collection is the sequence: NC_045512v2 - 2019-12-30 - Wuhan-Hu-1 https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2 Description files in this directory: md5sum.txt - md5 sums to verify copied files wuhCor1.44way.nameList.txt - relating the accession name to sequence name, and sample collection date wuhCor1.44way.nh - Phylogenetic tree used for multiz alignment. The phylogenetic tree was calculated on 31mer frequency similarity and neighbor joining that distance matrix with the phylip toolset: http://evolution.genetics.washington.edu/phylip.html 'neighbor' command: http://evolution.genetics.washington.edu/phylip/progs.data.dist.html wuhCor1.multiz44way.maf.gz - alignments with gap annotation with accession identifiers sequences/ - directory with files: https://hgdownload.soeucsc.edu/goldenPath/wuhCor1/multiz44way/sequences/ sequences/dnaFasta44.tgz - gzipped tar file for the DNA fasta, 44 sequences sequences/proteinFasta44.tgz - gzipped tar file for the proteins as obtained from the genbank records, for example in: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2 one .faa.gz for each sequence sequences/proteinTab44.tgz - the same proteins arranged in single lines of the format: sequenceName.proteinNameamino acids . . . One file for each of the sequences. This format file is convenient for extracting proteins from all the sequences that have a similar length. For example, the longest protein (the 'spike' protein) is over 6000 AAs, after unpacking the tgz file into a directory: zcat *.faa.tab.gz | awk -F$'\t' 'length($2) > 6000' \ | awk -F$'\t' '{printf ">%s\n%s\n", $1, $2}' > spikeProtein.faa You can drop that spikeProtein.faa file into a multiple aligner such as 'COBALT' https://www.ncbi.nlm.nih.gov/tools/cobalt/re_cobalt.cgi to obtain a multiple alignment of that protein for 99 of these sequences For a description of multiple alignment format (MAF), see http://genome.ucsc.edu/goldenPath/help/maf.html. PhastCons conservation scores for these alignments are available at: http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/phastCons44way PhyloP conservation scores for these alignments are available at: http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/phyloP44way --------------------------------------------------------------- To download a large file or multiple files from this directory, we recommend that you use rsync or ftp rather than downloading the files via our website. Via rsync: rsync -avz --progress \ rsync://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/multiz44way/ ./ Via FTP: ftp hgdownload.soe.ucsc.edu user name: anonymous password: go to the directory goldenPath/wuhCor1/multiz44way To download multiple files from the UNIX command line, use the "mget" command. mget ... - or - mget -a (to download all the files in the directory) Use the "prompt" command to toggle the interactive mode if you do not want to be prompted for each file that you download. --------------------------------------------------------------- All the files in this directory are freely usable for any purpose. For data use restrictions regarding the individual genome assemblies, see http://genome.ucsc.edu/goldenPath/credits.html. ---------------------------------------------------------------