by Steen Knudsen John Wiley & Sons (2002) 144 pages. ISBN 0471224901 US$44.95
DNA microarrays have revolutionized biology. Instead of studying one gene or one protein at a time, scientists are now studying many simultaneously. This global approach has created many new opportunities to study human disease. For example, a number of microarray studies have demonstrated the existence of different clinical subtypes of cancer with different prognoses from those identified by other methods(Alizadeh et al., 2000; Bittner et al., 2000). Many biologists have jumped on this bandwagon and started performing their own microarray experiments. However, data analysis can often be confusing,because, on the one hand, this field is evolving quickly, and, on the other,the modern data mining techniques may appear to be daunting and intractable. Although there are several microarray books on the market, and few are dedicated to data analysis (Leung,2002), there is no single book tailored to biologists.
A Biologist's Guide to Analysis of DNA Microarray Data is a good starting point for biologists new to data analysis. Written by Steen Knudsen,the book is composed of 14 chapters. The book starts with an introductory chapter explaining the main principles and usage of DNA microarrays. This is followed by a chapter presenting an overview of data analysis, in which all the methods are summarized in a simple flowchart. This useful chart clearly shows the basic workflow of microarray data analysis and includes the experimental setups.
Most contemporary data analysis methods are discussed in chapters 3-8, in which the underlying principles are illustrated with vivid and simple examples. In chapter 3, Knudsen describes the basic data analysis methods,including scaling the measurements in the sample and the control, calculating the change in expression level of a gene and determining the significance of the expression level of a gene by using student's t-tests, ANOVA or non-parametric tests. Potential problems, such as outliers, and multiple testing are discussed. Chapters 4-8 introduce various data processing and mining methods: principal component analysis for dimensionality reduction;cluster analysis, including hierarchical clustering, K-means clustering and self-organizing maps; various distance measures and their effects on data clustering; normalization methods to correct systematic biases; mining functions of orphan proteins and regulatory relationships between genes;reverse engineering of regulatory networks by time-series and steady-state approaches; constructing molecular classifiers, including nearest neighbor,neural networks and support vector machines. The constraints of these data analysis methods are emphasized and discussed in detail. This is very helpful for those without a strong background in statistics, as the limitations of statistical analysis methods are often overlooked.
In chapter 9, Knudsen discusses the various considerations that need to be taken into account when selecting the appropriate probes for arrays. In chapter 10, the limitations of expression analysis are outlined. In particular, microarray expression study, transcriptomics, is primarily focused on gene expression and neglects many other aspects of cellular dynamics, such as alternative splicing, protein translation, post-translational modifications and degradation. Users need to be very cautious before making bold conclusions on the basis of their expression data. The genotyping array, a close relative to expression array, is briefly discussed in chapter 11. The discussion is largely concentrated on the author's interest in neural network sequence prediction.
Cell biologists often want to know which software is best for microarray data analysis. Chapters 12 and 13 provide a quick overview of the issues related to the choice of software. Often commercial software gives a false sense of security: they have inherent limitations, such as making implicit analysis assumptions for you. Therefore, Knudsen advocates the use of open source/free software for data analysis. There are a few important take home messages regarding software: standardizing the data format will greatly assist data sharing and comparability; learning a scripting language like Awk or Perl will allow you to manipulate your data with ease; and learning an open source statistical language, such as R, will allow you to run different analyses. In addition, with R there are numerous extension modules, libraries, that are written specifically for microarray data analysis, and almost all are free. A great feature of this book is that it shows a number of simple Awk scripts and R commands for various statistical analyses. Therefore, the reader can follow these step-by-step codes to experience first hand command-line-driven programs.
There are some drawbacks to this book. Firstly, the background to these various statistical analyses is only briefly discussed; therefore, it requires some statistical training to appreciate many of the chapters. This conflicts with the book's objective of guiding biologists without special training through the analysis step. However this is an unavoidable trade-off to make the book easier to read. Secondly, the book does not emphasize enough the experimental design, which could significantly affect the data analysis in the later stages of the experiment. Without thorough planning and an understanding of the analysis methods, microarray analysis risks being a `fishing expedition'. But with a careful and critical approach experiments can be quite the opposite. Finally, some of the chapters in this book are just too short to be justified as such. For example, chapters 9-11 are only between three and six pages long. More discussion of the issues raised would be welcome, even in an introductory text.
Nonetheless this book is a good starting point for cell biologists who are interested in analysis of DNA microarrays. It provides a background to microarray data analysis and a quick overview of the current trends. A Biologist's Guide to Analysis of DNA Microarray Data does a marvelous job of introducing biologists into the realm of genomic data analysis.