To see how the interactions of biology and mathematics may proceed in the future, it is helpful to map the present landscapes of biology and applied mathematics.
The biological landscape may be mapped as a rectangular table with different rows for different questions and different columns for different biological domains. Biology asks six kinds of questions. How is it built? How does it work? What goes wrong? How is it fixed? How did it begin? What is it for? These are questions, respectively, about structures, mechanisms, pathologies, repairs, origins, and functions or purposes. The former teleological interpretation of purpose has been replaced by an evolutionary perspective. Biological domains, or levels of organization, include molecules, cells, tissues, organs, individuals, populations, communities, ecosystems or landscapes, and the biosphere. Many biological research problems can be classified as the combination of one or more questions directed to one or more domains.
In addition, biological research questions have important dimensions of time and space. Timescales of importance to biology range from the extremely fast processes of photosynthesis to the billions of years of living evolution on Earth. Relevant spatial scales range from the molecular to the cosmic (cosmic rays may have played a role in evolution on Earth). The questions and the domains of biology behave differently on different temporal and spatial scales. The opportunities and the challenges that biology offers mathematics arise because the units at any given level of biological organization are heterogeneous, and the outcomes of their interactions (sometimes called “emergent phenomena” or “ensemble properties”) on any selected temporal and spatial scale may be substantially affected by the heterogeneity and interactions of biological components at lower and higher levels of biological organization and at smaller and larger temporal and spatial scales (Anderson 1972, 1995).
The landscape of applied mathematics is better visualized as a tetrahedron (a pyramid with a triangular base) than as a matrix with temporal and spatial dimensions. (Mathematical imagery, such as a tetrahedron for applied mathematics and a matrix for biology, is useful even in trying to visualize the landscapes of biology and mathematics.) The four main points of the applied mathematical landscape are data structures, algorithms, theories and models (including all pure mathematics), and computers and software. Data structures are ways to organize data, such as the matrix used above to describe the biological landscape. Algorithms are procedures for manipulating symbols. Some algorithms are used to analyze data, others to analyze models. Theories and models, including the theories of pure mathematics, are used to analyze both data and ideas. Mathematics and mathematical theories provide a testing ground for ideas in which the strength of competing theories can be measured. Computers and software are an important, and frequently the most visible, vertex of the applied mathematical landscape. However, cheap, easy computing increases the importance of theoretical understanding of the results of computation. Theoretical understanding is required as a check on the great risk of error in software, and to bridge the enormous gap between computational results and insight or understanding.
The landscape of research in mathematics and biology contains all combinations of one or more biological questions, domains, time scales, and spatial scales with one or more data structures, algorithms, theories or models, and means of computation (typically software and hardware). The following example from cancer biology illustrates such a combination: the question, “how does it work?” is approached in the domain of cells (specifically, human cancer cells) with algorithms for correlation and hierarchical clustering.
Gene expression and drug activity in human cancer.
Suppose a person has a cancer. Could information about the activities of the genes in the cells of the person's cancer guide the use of cancer-treatment drugs so that more effective drugs are used and less effective drugs are avoided? To suggest answers to this question, Scherf et al. (2000) ingeniously applied off-the-shelf mathematics, specifically, correlation—invented nearly a century earlier by Karl Pearson (Pearson and Lee 1903) in a study of human inheritance—and clustering algorithms, which apparently had multiple sources of invention, including psychometrics (Johnson 1967). They applied these simple tools to extract useful information from, and to combine for the first time, enormous databases on molecular pharmacology and gene expression (http://discover.nci.nih.gov/arraytools/). They used two kinds of information from the drug discovery program of the National Cancer Institute. The first kind of information described gene expression in 1,375 genes of each of 60 human cancer cell lines. A target matrix T had, as the numerical entry in row g and column c, the relative abundance of the mRNA transcript of gene g in cell line c. The drug activity matrix A summarized the pharmacology of 1,400 drugs acting on each of the same 60 human cancer cell lines, including 118 drugs with “known mechanism of action.” The number in row d and column c of the drug activity matrix A was the activity of drug d in suppressing the growth of cell line c, or, equivalently, the sensitivity of cell line c to drug d. The target matrix T for gene expression contained 82,500 numbers, while the drug activity matrix A had 84,000 numbers.
These two matrices have the same set of column headings but have different row labels. Given the two matrices, precisely five sets of possible correlations could be calculated, and Scherf et al. calculated all five. (1) The correlation between two different columns of the activity matrix A led to a clustering of cell lines according to their similarity of response to different drugs. (2) The correlation between two different columns of the target matrix T led to a clustering of the cell lines according to their similarity of gene expression. This clustering differed very substantially from the clustering of cell lines by drug sensitivity. (3) The correlation between different rows of the activity matrix A led to a clustering of drugs according to their activity patterns across all cell lines. (4) The correlation between different rows of the target matrix T led to a clustering of genes according to the pattern of mRNA expressed across the 60 cell lines. (5) Finally, the correlation between a row of the activity matrix A and a row of the target matrix T described the positive or negative covariation of drug activity with gene expression. A positive correlation meant that the higher the level of gene expression across the 60 cancer cell lines, the higher the effectiveness of the drug in suppressing the growth of those cell lines. The result of analyzing several hundred thousand experiments is summarized in a single picture called a clustered image map (Figure 1). This clustered image map plots gene expression–drug activity correlations as a function of clustered genes (horizontal axis) and clustered drugs (showing only the 118 drugs with “known function”) on the vertical axis (Weinstein et al. 1997).
What use is this? If a person's cancer cells have high expression for a particular gene, and the correlation of that gene with drug activity is highly positive, then that gene may serve as a marker for tumor cells likely to be inhibited effectively by that drug. If the correlation with drug activity is negative, then the marker gene may indicate when use of that drug is contraindicated.
While important scientific questions about this approach remain open, its usefulness in generating hypotheses to be tested by further experiments is obvious. It is a very insightful way of organizing and extracting meaning from many individual observations. Without the microscope of mathematical methods and computational power, the insight given by the clustered image map could not be achieved.
Figure 1. Clustered Image Map of Gene Expression–Drug Activity Correlations
Plotted as a function of 1,376 clustered genes (x-axis) and 118 clustered drugs (y-axis). From http://discover.nci.nih.gov/external/CIM_example3/cgi_user_matrix.html. (updated 27 April 2000; accessed 7 October 2004). This image is more recent than the published image (Scherf et al. 2000). Used by permission of John N. Weinstein.