The mission of the NIH is to fund and perform basic and applied research
that will lead to significant improvements in human health. Our understanding
of biological processes at the molecular and cellular levels has advanced
enormously. The power of genetics has been extended to a multitude of organisms
and throughout the stages of development. From these basic advances, we
have an impressive armamentarium of therapeutic agents to treat both chronic
and acute diseases, as well as an evolving philosophy of how to approach
fundamental challenges of the human condition such as aging. The pharmaceutical
and biotechnology industries have produced many new drugs. However, as the
constantly increasing costs of health care show, drug discovery is effective
but not efficient.
The first step towards more efficient and comprehensive treatment of human
health problems is provided by the rising tide of data, including gigabytes
of genomic sequences, tens of thousands of macromolecular structures, and
terabytes of functional data. The second step must be found in the analysis
and use of this data, aiming to control, modify, and design biological systems.
The NIH Roadmap is a unique opportunity to develop integrated tools and
research programs to model the biomolecular processes on a genomic level.
Genome-wide and even pan-genomic projects are hallmarks of modern biology 1,2
. These projects have resulted in large and often comprehensive lists of
components of important biological systems and their properties 3 . For
example, sequencing of genomes produced hundreds of complete genomic sequences,
including the human genome 4-6 . Functional genomics and proteomics have
determined properties of thousands of genes and proteins, such as the gene
expression patterns 7 , protein localizations 8 , ligand-binding preferences
of proteins 9 , and physical interactions between proteins 1,10-12
. Consortia of scientists are also mapping genetic variation among the individuals
of the same species, including humans 13 . And finally, structural biology
with structural genomics is on the way to characterize the three-dimensional
structures of most protein domain families 14-17 .
In parallel, developments in computing hardware continue to follow Moore's
law, doubling performance every two years. Computer clusters with thousands
of nodes and hard disk storage of tens of terabytes are now feasible for
relatively small groups of investigators, such as our proposed Center. The
individual computing nodes of these clusters now have the speed and memory
of earlier large servers. Their connectivity matches the high speed connections
of the previous generation of hardware. The combination of cluster computing
and faster nodes has increased the computing power available to researchers
by two orders of magnitude relative to their share of a system in a national
supercomputer center only a few years ago.
We need to move from a list of system components to the functionality of
the system as a whole. The critical link is the quantitation of
interactions within individual compartments ( eg , single organelles
or cells) and across the full spectrum of genomic variation. We need to
store, organize, visualize, and analyze the data, so that we can control,
modify, and design the biological systems. The confluence of information
provided by experimental biology and the advances in computing hardware
provide, for the first time, a realistic opportunity to make significant
progress towards this broad goal.
The activity of a biological system is determined in large part by interactions
involving proteins. As powerful as the experimental approaches are, it is
not feasible to find and characterize all relevant interactions by experiment.
There are simply too many proteins and potential ligands to measure a large
proportion of their interactions by experimental methods, which can be expensive,
slow, inaccurate, or inapplicable. The biological sensitivities and complexities
of proteins make them notoriously difficult to study in the laboratory.
Therefore, we need to augment experiment with a reductionist, bottom-up
computational approach that suggests component interactions from simple
lists of components and their properties. Only a bottom-up approach will
allow us to gain mechanistic insights, to learn about functional implications
of structural forms, and to control, modify, and design systems at the molecular
level.
Our Center will provide a computational framework to combine experimental
data with computation, and thereby achieve greater coverage, accuracy, resolution,
and efficiency in the mapping of protein-ligand and protein-protein interactions
than either computation or experiment could hope to do on their own.
On the computational side, we will combine the best of physics and bioinformatics.
We will use both the first principles of molecular mechanics and statistical
rules extracted from databases for the key steps in our pipeline, including
protein structure modeling, identifying binding sites, and docking. Proteins
and their complexes should be understood in terms of both physics and evolution,
which involves three-dimensional modeling and docking using physical energy
functions and comparative analysis of complexes from an evolutionary point
of view.
Structural biology is a great unifying discipline of biology. Structure-based
mapping of protein-ligand and protein-protein interactions may be the way
to bridge the gaps between genome sequencing, functional genomics, proteomics,
and systems biology.
The numbers of potential small molecule ligands, proteins, nucleic acids,
polysaccharides, and their binding sites are very large. Therefore, the
scope of the mapping of even just protein-ligand and protein-protein interactions
is huge, almost unbounded. As a first step, we will focus our activities
on interactions involving only proteins and their small molecule ligands
and leave studies of nucleic acids and other biopolymers for the future.
The human genome alone encodes approximately 30,000 genes. Some of these
genes encode for several proteins, as a result of alternative splicing and
varying post-translational modifications. Each protein may have several
binding sites for other proteins and small ligands. Each binding site may
interact with several partners in different environments and during the
life of the protein. There are active sites that chemically convert bound
substrates, binding sites that trigger regulatory changes in the structure
or dynamics of the protein, and interacting sites used for assembling larger
machines or structural forms. Binding sites on the same protein can differ
among individuals from the same species, which can be important for the
drug resistance of pathogens and explains the different responses of patients
to the same drug. When we expand our view to include the genomes of biomedically
relevant organisms, such as pathogens, model animals, and plants, the number
of binding sites becomes staggering. There are over a million protein sequences
in the current sequence databases, and correspondingly many millions of
binding sites.
The number of potential small ligands is even larger than that of the protein
binding sites. There is essentially no upper limit on the variety of small
chemicals. For example, databases of human metabolites list tens of thousands
of small molecules, drug companies routinely scan millions of compounds
in their search for drugs, and combinatorial chemistry libraries may contain
billions of different compounds.
In summary, there are millions of binding
sites on proteins in the protein sequence databases. These sites
need to be located and their binding partners need to be identified
among thousands of other proteins and millions of chemical compounds.
Thus, automation and large-scale capability are necessary, even to cover
islands of the entire "interaction" space.
Of course, we cannot study all possible interactions. Our plan is to provide
tools and to select questions that can be addressed in significant detail
and rigor to provide practical answers to specific problems.
Our software and hardware infrastructure will produce results for many
years to come, extending the impact of the program beyond the duration of
the funding period.
This proposal is ambitious, extending
far beyond detailed studies of individual systems. Our pipeline
will be both automated and applicable on a large-scale. As a result,
it will offer many advantages over current one-at-a-time strategies.
First, there are potentially great economies of scale because the component
databases can be prepared once and updated as needed, independently
of the number of applications. Second, modular design will permit new
algorithms and new software packages to be added and tested in a facile
and uniform way. Third, for the first time, statistics will be gathered
over all relevant protein interactions, offering major improvements
in signal-to-noise ratio both in identifying promiscuous ligands without
specificity and highly selective agents that might be superior candidates
for further study. We recognize that the lasting value of our platform
comes from its applications to important problems chosen by biological
experts. Therefore, we have included a selection of these as key projects
in both Core 1&2 and
Core 3.
The Center will impact on critical
problems in biomedical sciences: molecular recognition, elucidation
of structures of proteins and their complexes, cell trafficking and
signaling, functional annotation of open reading frames, organization
of complex cell pathways, drug target discovery, lead compound discovery
and optimization for treatment of human disease, prediction of
drug-drug interactions, functional pharmacogenomics, and pharmacogenetics.
Our proposal is timely because of the confluence of new genomic data, new
expansions in the scales of experiment, and new computational methods. The
proposed Center is well suited to the current RFA. It would be very difficult
to find support for such an enterprise through traditional mechanisms. Our
goal is ambitious, but the prize will be more than commensurate with the effort
invested.
Next section: IMPACT