|dc.description.abstract||Proteins perform a tremendous array of finely-tuned functions which are not only critical in living organisms, but can be used for industrial and medical purposes. The ability to rationally design these molecular machines could provide a wealth of opportunities, for example to improve human health and to expand the range and reduce cost of many industrial chemical processes. The modularity of a protein sequence combined with many degrees of structural freedom yield a problem that can frequently be best tackled using computational methods. These computational methods, which include the use of: bioinformatics analysis, molecular dynamics, empirical forcefields, statistical potentials, and machine learning approaches, amongst others, are collectively known as Computational Protein Design (CPD). Here CPD is examined from the perspective of four different goals: successful design of an intended structure, the prediction of folding and unfolding kinetics from structure (kinetic stability in particular), engineering of improved stability, and prediction of binding sites and energetics.
A considerable proportion of protein folds, and the majority of the most common folds ("superfolds"), are internally symmetric, suggesting emergence from an ancient repetition event. CPD, an increasingly popular and successful method for generating de novo folded sequences and topologies, suffers from exponential scaling of complexity with protein size. Thus, the overwhelming majority of successful designs are of relatively small proteins (< 100 amino acids). Designing proteins comprised of repeated modular elements allows the design space to be partitioned into more manageable portions. Here, a bioinformatics analysis of a "superfold", the beta-trefoil, demonstrated that formation of a globular fold via repetition was not only an ancient event, but an ongoing means of generating diverse and functional sequences. Modular repetition also promotes rapid evolution for binding multivalent targets in the "evolutionary arms race" between host and pathogen. Finally, modular repetition was used to successfully design, on the first attempt, a well-folded and functional beta-trefoil, called ThreeFoil.
Improving protein design requires understanding the outcomes of design and not simply the 3D structure. To this end, I undertook an extensive biophysical characterization of ThreeFoil, with the key finding that its unfolding is extraordinarily slow, with a half-life of almost a decade. This kinetic stability grants ThreeFoil near-immunity to common denaturants as well as high resistance to proteolysis. A large scale analysis of hundreds of proteins, and coarse-grained modelling of ThreeFoil and other beta-trefoils, indicates that high kinetic stability results from a folded structure rich in contacts between residues distant in sequence (long-range contacts). Furthermore, an analysis of unrelated proteins known to have similar protease resistance, demonstrates that the topological complexity resulting from these long-range contacts may be a general mechanism by which proteins remain folded in harsh environments.
Despite the wonderful kinetic stability of ThreeFoil, it has only moderate thermodynamic stability. I sought to improve this in order to provide a stability buffer for future functional engineering and mutagenesis. Numerous computational tools which predict stability change upon point mutation were used, and 10 mutations made based on their recommendations. Despite claims of >80% accuracy for these predictions, only 2 of the 10 mutations were stabilizing. An in-depth analysis of more than 20 such tools shows that, to a large extent, while they are capable of recognizing highly destabilizing mutations, they are unable to distinguish between moderately destabilizing and stabilizing mutations.
Designing protein structure tests our understanding of the determinants of protein folding, but useful function is often the final goal of protein engineering. I explored protein-ligand binding using molecular dynamics for several protein-ligand systems involving both flexible ligand binding to deep pockets and more rigid ligand binding to shallow grooves. I also used various levels of simulation complexity, from gas-phase, to implicit solvent, to fully explicit solvent, as well as simple equilibrium simulations to interrogate known interactions to more complex energetically biased simulations to explore diverse configurations and gain novel information.||en