Ensemble workflow for protein structure prediction

Ensemble workflow for structure prediction of SARS-CoV-2 nsp3. Case-by-case protocols of structure prediction are determined by finely parsing each protein sequence using information about the position of intrinsically disordered regions (IDR), transmembrane regions (TM), signal peptides, and templates


To date, partial or full structures of five proteins from SARS-CoV-2 have been experimentally solved. In view of the urgency to understand the molecular machinery of SARS-CoV-2, we used an ensemble workflow to generate structural models of all unsolved structural and mature nonstructural viral proteins. Due to the performance of methods for protein structure prediction varying by complexity, protein sequences were carefully analyzed to optimize the combination of the state-of-the-art methods of protein structure prediction. As such, the resulting models have the highest possible resolution and maximum information with regards to the overall shape of each protein. Here, we provide a synopsis for each of the 27 mature viral proteins along with their structural models and additional important information, such as variability relative to SARS-CoV-1 and potential functional relevance to SARS-CoV-2.

Case-by-case protocols were generated based on a profile extracted from each sequence, consisting of two main factors:

  1. Primary sequence-based information. Residues within conserved domains (Pfam (Finn et al. 2014)) and intrinsically disordered regions were identified using IuPred2 (Mészáros, Erdos, and Dosztányi 2018), which relies on the composition of amino acid segments and their tendency to form stable structural motifs. TMHMM (Krogh et al. 2001) was used to predict the helical transmembrane protein regions based on a hidden Markov model. No β-barrel transmembrane proteins are present in SARS-CoV-2.
  2. Availability of experimentally determined structures. PSI-BLAST was used to identify homologous with partial or full structures available in the Protein Data Bank (PDB) that could be used as templates for modeling.

Several SARS-CoV-1 proteins that are highly conserved have been solved experimentally and are available for our analysis. In order to maximize accurate translation of information from these structures, amino acid substitutions were analyzed to identify those that likely impact protein conformation. Examples of changes that affect protein structure are a hydrophobic side chain being replaced by a charged amino acid at the protein core or a substitution to proline (a helix “breaker”) within a helical structure. In case such substitutions are not found, and the protein has more than 70% identity to the template, loops and substitutions are locally modeled (LM) using the Rosetta remodel (Huang et al. 2011) and fixbb (Hu et al. 2007; Kuhlman and Baker 2000) applications, respectively. The comparison of recently released crystallographic structures with the models generated using carefully analyzed protein sequences and using LM for selected regions appears to be an effective approach (Prates et al. 2020). Achieving high local resolution, especially in sites of substrate/ligand binding, can considerably enhance the results of subsequent studies for small molecule candidate identification using molecular docking. Although ensemble docking approaches are often applied to contend with the conformational flexibility of the protein target, refining the binding site based on structural information from homologs in the holo form, if available, is more suitable for identifying functional complexes.

Homology-based modeling is typically the optimal approach for cases in which the identity to the template is above 30%. The fragment-based (FB) approach of the I-TASSER (Yang et al. 2015) workflow was used in cases where the range of identity was 30-70%, and to provide an alternative model to LM in regions of proteins harboring substitutions that would be expected to significantly affect protein conformation. In order to predict structures for proteins that do not have a crystal structure of a homolog available, we applied the trRosetta (Yang et al. 2020) workflow. Based on benchmarks of the Critical Assessment of Techniques for Protein Structure Prediction (CASP13), trRosetta was designed to achieve sound performance for modeling novel folds by using a deep residual network for predicting inter-residue distance and orientation that guides energy minimization. In Prates et al. 2020, we use the analysis of nsp3, the largest mature protein of SARS-CoV-2, as an example to describe the workflow (Figure). 




Finn, Robert D., Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y. Eberhardt, Sean R. Eddy, Andreas Heger, et al. 2014. “Pfam: The Protein Families Database.” Nucleic Acids Research 42 (Database issue): D222–30.

Huang, Po-Ssu, Yih-En Andrew Ban, Florian Richter, Ingemar Andre, Robert Vernon, William R. Schief, and David Baker. 2011. “RosettaRemodel: A Generalized Framework for Flexible Backbone Protein Design.” PloS One 6 (8): e24109.

Hu, Xiaozhen, Huanchen Wang, Hengming Ke, and Brian Kuhlman. 2007. “High-Resolution Design of a Protein Loop.” Proceedings of the National Academy of Sciences of the United States of America 104 (45): 17668–73.

Krogh, A., B. Larsson, G. von Heijne, and E. L. Sonnhammer. 2001. “Predicting Transmembrane Protein Topology with a Hidden Markov Model: Application to Complete Genomes.” Journal of Molecular Biology 305 (3): 567–80.

Kuhlman, B., and D. Baker. 2000. “Native Protein Sequences Are close to Optimal for Their Structures.” Proceedings of the National Academy of Sciences of the United States of America 97 (19): 10383–88.

Mészáros, Bálint, Gábor Erdos, and Zsuzsanna Dosztányi. 2018. “IUPred2A: Context-Dependent Prediction of Protein Disorder as a Function of Redox State and Protein Binding.” Nucleic Acids Research 46 (W1): W329–37.

Prates, Erica Teixeira, Michael R. Garvin, Mirko Pavicic, Piet Jones, Manesh Shah, Christiane Alvarez, David Kainer, et al. 2020. “Functional Immune Deficiency Syndrome via Intestinal Infection in COVID-19.” bioRxiv. https://doi.org/10.1101/2020.04.06.028712.

Yang, Jianyi, Ivan Anishchenko, Hahnbeom Park, Zhenling Peng, Sergey Ovchinnikov, and David Baker. 2020. “Improved Protein Structure Prediction Using Predicted Interresidue Orientations.” Proceedings of the National Academy of Sciences of the United States of America 117 (3): 1496–1503.

Yang, Jianyi, Renxiang Yan, Ambrish Roy, Dong Xu, Jonathan Poisson, and Yang Zhang. 2015. “The I-TASSER Suite: Protein Structure and Function Prediction.” Nature Methods 12 (1): 7–8.