Gene identification ------------------- - Gene calling was performed with Prodigal v2.6.3 (Hyatt et al., 2010) and marker genes identified and aligned using HMMER v3.1b1 (Eddy, 2011). Marker genes and corresponding HMMs are from the Pfam v27 (Finn et al., 2014) and TIGRFAMs v15.0 (Haft et al., 2003) databases. Tree inference -------------- - Bacteria reference tree is inferred with FastTree v2.1.10 under the WAG model from the concatenated alignment of 120 ubiquitous bacterial genes (Parks et al., 2018) - Archaea reference tree is inferred with IQ-Tree v1.6.9 under the PMSF model from the concatenated alignment of 122 ubiquitous archaeal genes (Parks et al., 2018), using FastTree v2.1.10 to infer an initial guide tree Identifying 16S rRNA sequences ------------------------------ - Sequences are identified using nhmmer v3.1b2 (Wheeler and Eddy, 2013) with the 16S rRNA model (RF00177) from the RFAM database (Kalvari et al., 2018). Average nucleotide identity --------------------------- Average nucleotide identity (ANI) and alignment fraction (AF) values were calculated with FastANI v1.3 (Jain et al., 2018). Updating GTDB species representatives ------------------------------------- Each GTDB species is defined by a single representative genome and species assignments established by considering the ANI and AF to these representative genomes (Parks et al., Nature Biotechnology, 2019). Species representatives are re-evaluated each GTDB release with an emphasis placed on retaining representatives so they can serve as effective nomenclatural type material. However, the goal of stable representatives must be balanced with the desire to use high-quality genomes as representatives, the incorporation of changing taxonomic opinion, and identified errors in genome classification or assembly. GTDB representatives are updated according to two primary principles: i) representatives should be assembled from the type strain of a species whenever possible, and ii) representatives should only be replaced by assembles of suitably higher overall quality. These two principles are quantitatively defined by the balanced ANI score (BAS) given by: 0.5 * (ANI score) + 0.5 * (quality score) where the ANI score is 100 – 20*(100 - ANI to current representative) and the quality score is defined by the criteria given in Table 1. An existing representative is only replaced by a new representative if it has a BAS >= 10 above the BAS of the current representative. Intuitively, the BAS achieves the goal of stable representatives by requiring a new representative to be of increasingly higher quality (as defined by the quality score) the more dissimilar it is from the current representative (as defined by the ANI score). Representatives are also updated to account for genome assemblies being removed from NCBI and representatives are updated whenever the underlying assembly is updated at NCBI. TABLE 1. Criteria used to establish assembly quality score CRITERIA: SCORE Type species of genome: 100,000 Effective type strain of species according to NCBI: 10,000 NCBI representative of species: 1,000 Complete genome: 100 CheckM quality estimate: completeness - 5*contamination MAG or SAG: -100 Contig count: -5 * (no. contigs/100) Undetermined bases: -5 * (no. undetermined bases/10,000) Full length 16S rRNA gene: 10 Updating name of GTDB species clusters -------------------------------------- The names assigned to GTDB species clusters are re-evaluated each GTDB release with an emphasize placed on nomenclature stability. However, names are changed in some cases to reflect changes in taxonomic opinions and/or to correct identified errors in GTDB or NCBI assignments. Species clusters containing one or more genomes assembled from the type strain of a species are named after the species with nomenclatural priority (Parker et al., 2019), with the generic and specific names changed as necessary to reflect any genus level reclassifications in the GTDB. Species names identified as synonyms are provided as separated file in the GTDB repository and updated each release. Species clusters without a type strain genome are assigned via a majority voting approach based on NCBI species assignments regarded as correct under the GTDB framework. A genome is considered to have an erroneous NCBI species assignment if a genome assembled from the type strain of this species exists and resides in a different GTDB species cluster. A cluster is assigned a name by majority voting if >50% of genomes in the cluster with a GTDB-validated NCBI name are from a single species and >50% of all genomes with this species classification are in the cluster. Otherwise, the species cluster is assigned an alphanumeric or Latin suffixed placeholder name. In order to maximize the stability of GTDB names, placeholder names are not updated to new placeholder names (e.g., Bacillus sp002153395 to B. subtilis_A or vice versa) even if an updated placeholder name might better reflect the current classification of genomes within a cluster. Species clusters containing an assembly from the type strain of a subspecies or a subspecies satisfying the majority voting criteria will have the subspecies name promoted to the specific name of the cluster in cases where a placeholder name would otherwise be required. Additional information ---------------------- Please consult the following GTDB publications for additional information: Parks, D. H., et al. (2018). A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, 36: 996-1004. Parks, D.H., et al. (2020). A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8. Chaumeil P-A, et al. (2019). GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, btz848: https://doi.org/10.1093/bioinformatics/btz848. REFERENCES ---------- Eddy SR. 2011. Accelerated Profile HMM Searches. PLoS Comput Biol 7: e1002195. Finn RD, et al. 2014. Pfam: The protein families database. Nucleic Acids Res 42: D222-230. Haft DH, Selengut JD, White O. 2003. The TIGRFAMs database of protein families. Nucl Acids Res 31: 371-373. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11: 119. Jain C, et al. (2018). High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communication 9: 5114. Kalvari I, et al. 2018. Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 46(D1):D335-D342. Parker et al. International Code of Nomenclature of Prokaryotes. IJSEM 60: doi: 10.1099/ijsem.0.000778. Parks DH, et al. 2017. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol 2: 1533-42. Wheeler TJ, Eddy SR. 2013. nhmmer: DNA homology search with profile HMMs. Bioinformatics. 2013 Oct 1;29(19):2487-9.