This directory contains phylogenetic trees of fully public SARS-CoV-2 sequences updated daily: * public-latest.all.masked.pb[.gz] Protobuf file for use with usher --load-mutation-annotated-tree * public-latest.all.masked.vcf.gz Variant Call Format (VCF) file containing mutations in public sequences, generated from public-latest.all.masked.pb with matUtils extract. Missing or ambiguous bases have been imputed by usher to the most parsimonious base value [ACGT] at the time each sequence was placed in the tree. * public-latest.all.nwk.gz Newick tree file (usher's uncondensed-final-tree.nh output) * public-latest.metadata.tsv.gz Information about each public sequence e.g. collection date, location, Nextstrain clade and Pango lineage. Dates and locations are not available for some sequences. * public-latest.all.masked.ShUShER.pb.gz Size-limited version of public-latest.all.masked.pb.gz, randomly downsampled to 6000000 sequences in order to prevent ShUShER from exceeding web browser memory limits. * public-latest.version.txt A brief description including date, sources and number of sequences. Previous versions of the files are available in year/month/day directories: http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2021/ The trees encoded in the Newick and protobuf files are derived from releases of Rob Lanfear's sarscov2phylo (https://github.com/roblanf/sarscov2phylo), pruned to include only public sequences aggregated from GenBank, COG-UK, and the China National Center for Bioinformation, mapped to GISAID EPI_ISL_* IDs used in the sarscov2phylo tree files. The trees have also been re-rooted to Wuhan/Hu-1 (GenBank MN908947.3, RefSeq NC_045512.2), and nodes with no associated mutations have been collapsed. Sequences released after the final sarscov2phylo release (Nov. 13, 2020) were added to the tree using UShER. A file that maps GISAID EPI_ISL_* IDs to public sequence IDs may be downloaded from https://github.com/CDCgov/SARS-CoV-2_Sequencing/blob/master/files/epiToPublic.tsv.gz GenBank sequences and metadata may be downloaded from https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049 COG-UK sequences and metadata may be downloaded from the "Latest Sequence Data" section of https://www.cogconsortium.uk/tools-analysis/public-data-analysis-2/ The China National Center for Bioinformation offers additional sequences and metadata: https://bigd.big.ac.cn/ncov/release_genome A more extensive archive of public sequence trees is available in year/month/day directories (note that methods changed slightly over time and the trees may be lower quality than the more recent trees on hgdownload.soe.ucsc.edu): https://hgwdev.gi.ucsc.edu/~angie/UShER_SARS-CoV-2/2020/ https://hgwdev.gi.ucsc.edu/~angie/UShER_SARS-CoV-2/2021/ The DD-MM-YY release labels used in 2020/10/ and 2020/11/ subdirectories correspond to sarscov2phylo releases: https://github.com/roblanf/sarscov2phylo/releases This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge the authors and the originating laboratories where the clinical specimen or virus isolate was first obtained and the submitting laboratories, where sequence data have been generated and submitted to public databases, on which this research is based. If you use these files please cite McBroome et al. (2021). A Daily-Updated Database and Tools for Comprehensive SARS-CoV-2 Mutation-Annotated Trees. https://academic.oup.com/mbe/article/38/12/5819/6361626 Please also acknowledge the submitters of SARS-CoV-2 sequences to public databases.