qb3

 

Background and Significance

 

Introduction

The mission of the NIH is to fund and perform basic and applied research that will lead to significant improvements in human health. Our understanding of biological processes at the molecular and cellular levels has advanced enormously. The power of genetics has been extended to a multitude of organisms and throughout the stages of development. From these basic advances, we have an impressive armamentarium of therapeutic agents to treat both chronic and acute diseases, as well as an evolving philosophy of how to approach fundamental challenges of the human condition such as aging. The pharmaceutical and biotechnology industries have produced many new drugs. However, as the constantly increasing costs of health care show, drug discovery is effective but not efficient.

The first step towards more efficient and comprehensive treatment of human health problems is provided by the rising tide of data, including gigabytes of genomic sequences, tens of thousands of macromolecular structures, and terabytes of functional data. The second step must be found in the analysis and use of this data, aiming to control, modify, and design biological systems. The NIH Roadmap is a unique opportunity to develop integrated tools and research programs to model the biomolecular processes on a genomic level.

Confluence of Biological Data and Computing Power

Genome-wide and even pan-genomic projects are hallmarks of modern biology 1,2 . These projects have resulted in large and often comprehensive lists of components of important biological systems and their properties 3 . For example, sequencing of genomes produced hundreds of complete genomic sequences, including the human genome 4-6 . Functional genomics and proteomics have determined properties of thousands of genes and proteins, such as the gene expression patterns 7 , protein localizations 8 , ligand-binding preferences of proteins 9 , and physical interactions between proteins 1,10-12 . Consortia of scientists are also mapping genetic variation among the individuals of the same species, including humans 13 . And finally, structural biology with structural genomics is on the way to characterize the three-dimensional structures of most protein domain families 14-17 .

In parallel, developments in computing hardware continue to follow Moore's law, doubling performance every two years. Computer clusters with thousands of nodes and hard disk storage of tens of terabytes are now feasible for relatively small groups of investigators, such as our proposed Center. The individual computing nodes of these clusters now have the speed and memory of earlier large servers. Their connectivity matches the high speed connections of the previous generation of hardware. The combination of cluster computing and faster nodes has increased the computing power available to researchers by two orders of magnitude relative to their share of a system in a national supercomputer center only a few years ago.

Broad Goal

We need to move from a list of system components to the functionality of the system as a whole. The critical link is the quantitation of interactions within individual compartments ( eg , single organelles or cells) and across the full spectrum of genomic variation. We need to store, organize, visualize, and analyze the data, so that we can control, modify, and design the biological systems. The confluence of information provided by experimental biology and the advances in computing hardware provide, for the first time, a realistic opportunity to make significant progress towards this broad goal.

Bottom-Up Approach

The activity of a biological system is determined in large part by interactions involving proteins. As powerful as the experimental approaches are, it is not feasible to find and characterize all relevant interactions by experiment. There are simply too many proteins and potential ligands to measure a large proportion of their interactions by experimental methods, which can be expensive, slow, inaccurate, or inapplicable. The biological sensitivities and complexities of proteins make them notoriously difficult to study in the laboratory. Therefore, we need to augment experiment with a reductionist, bottom-up computational approach that suggests component interactions from simple lists of components and their properties. Only a bottom-up approach will allow us to gain mechanistic insights, to learn about functional implications of structural forms, and to control, modify, and design systems at the molecular level.

Our Center will provide a computational framework to combine experimental data with computation, and thereby achieve greater coverage, accuracy, resolution, and efficiency in the mapping of protein-ligand and protein-protein interactions than either computation or experiment could hope to do on their own.

On the computational side, we will combine the best of physics and bioinformatics. We will use both the first principles of molecular mechanics and statistical rules extracted from databases for the key steps in our pipeline, including protein structure modeling, identifying binding sites, and docking. Proteins and their complexes should be understood in terms of both physics and evolution, which involves three-dimensional modeling and docking using physical energy functions and comparative analysis of complexes from an evolutionary point of view.

Structural biology is a great unifying discipline of biology. Structure-based mapping of protein-ligand and protein-protein interactions may be the way to bridge the gaps between genome sequencing, functional genomics, proteomics, and systems biology.

Scope

The numbers of potential small molecule ligands, proteins, nucleic acids, polysaccharides, and their binding sites are very large. Therefore, the scope of the mapping of even just protein-ligand and protein-protein interactions is huge, almost unbounded. As a first step, we will focus our activities on interactions involving only proteins and their small molecule ligands and leave studies of nucleic acids and other biopolymers for the future.

The human genome alone encodes approximately 30,000 genes. Some of these genes encode for several proteins, as a result of alternative splicing and varying post-translational modifications. Each protein may have several binding sites for other proteins and small ligands. Each binding site may interact with several partners in different environments and during the life of the protein. There are active sites that chemically convert bound substrates, binding sites that trigger regulatory changes in the structure or dynamics of the protein, and interacting sites used for assembling larger machines or structural forms. Binding sites on the same protein can differ among individuals from the same species, which can be important for the drug resistance of pathogens and explains the different responses of patients to the same drug. When we expand our view to include the genomes of biomedically relevant organisms, such as pathogens, model animals, and plants, the number of binding sites becomes staggering. There are over a million protein sequences in the current sequence databases, and correspondingly many millions of binding sites.

The number of potential small ligands is even larger than that of the protein binding sites. There is essentially no upper limit on the variety of small chemicals. For example, databases of human metabolites list tens of thousands of small molecules, drug companies routinely scan millions of compounds in their search for drugs, and combinatorial chemistry libraries may contain billions of different compounds.

In summary, there are millions of binding sites on proteins in the protein sequence databases. These sites need to be located and their binding partners need to be identified among thousands of other proteins and millions of chemical compounds. Thus, automation and large-scale capability are necessary, even to cover islands of the entire "interaction" space.

Of course, we cannot study all possible interactions. Our plan is to provide tools and to select questions that can be addressed in significant detail and rigor to provide practical answers to specific problems.

Our software and hardware infrastructure will produce results for many years to come, extending the impact of the program beyond the duration of the funding period.

Significance

This proposal is ambitious, extending far beyond detailed studies of individual systems. Our pipeline will be both automated and applicable on a large-scale. As a result, it will offer many advantages over current one-at-a-time strategies. First, there are potentially great economies of scale because the component databases can be prepared once and updated as needed, independently of the number of applications. Second, modular design will permit new algorithms and new software packages to be added and tested in a facile and uniform way. Third, for the first time, statistics will be gathered over all relevant protein interactions, offering major improvements in signal-to-noise ratio both in identifying promiscuous ligands without specificity and highly selective agents that might be superior candidates for further study. We recognize that the lasting value of our platform comes from its applications to important problems chosen by biological experts. Therefore, we have included a selection of these as key projects in both Core 1&2 and Core 3.

The Center will impact on critical problems in biomedical sciences: molecular recognition, elucidation of structures of proteins and their complexes, cell trafficking and signaling, functional annotation of open reading frames, organization of complex cell pathways, drug target discovery, lead compound discovery and optimization for treatment of human disease, prediction of drug-drug interactions, functional pharmacogenomics, and pharmacogenetics.  

Our proposal is timely because of the confluence of new genomic data, new expansions in the scales of experiment, and new computational methods. The proposed Center is well suited to the current RFA. It would be very difficult to find support for such an enterprise through traditional mechanisms. Our goal is ambitious, but the prize will be more than commensurate with the effort invested.

 


Copyright 2003-2004 CCPR, webmaster