Publications

2002

Kivi M, Liu X, Raychaudhuri S, Altman R, Small P. Determining the genomic locations of repetitive DNA sequences with a whole-genome microarray: IS6110 in Mycobacterium tuberculosis. J Clin Microbiol. 2002;40(6):2192–8.
The mycobacterial insertion sequence IS6110 has been exploited extensively as a clonal marker in molecular epidemiologic studies of tuberculosis. In addition, it has been hypothesized that this element is an important driving force behind genotypic variability that may have phenotypic consequences. We present here a novel, DNA microarray-based methodology, designated SiteMapping, that simultaneously maps the locations and orientations of multiple copies of IS6110 within the genome. To investigate the sensitivity, accuracy, and limitations of the technique, it was applied to eight Mycobacterium tuberculosis strains for which complete or partial IS6110 insertion site information had been determined previously. SiteMapping correctly located 64% (38 of 59) of the IS6110 copies predicted by restriction fragment length polymorphism analysis. The technique is highly specific; 97% of the predicted insertion sites were true insertions. Eight previously unknown insertions were identified and confirmed by PCR or sequencing. The performance could be improved by modifications in the experimental protocol and in the approach to data analysis. SiteMapping has general applicability and demonstrates an expansion in the applications of microarrays that complements conventional approaches in the study of genome architecture.
Raychaudhuri S, Schütze H, Altman R. Using text analysis to identify functionally coherent gene groups. Genome Res. 2002;12(10):1582–90.
The analysis of large-scale genomic information (such as sequence data or expression patterns) frequently involves grouping genes on the basis of common experimental features. Often, as with gene expression clustering, there are too many groups to easily identify the functionally relevant ones. One valuable source of information about gene function is the published literature. We present a method, neighbor divergence, for assessing whether the genes within a group share a common biological function based on their associated scientific literature. The method uses statistical natural language processing techniques to interpret biological text. It requires only a corpus of documents relevant to the genes being studied (e.g., all genes in an organism) and an index connecting the documents to appropriate genes. Given a group of genes, neighbor divergence assigns a numerical score indicating how "functionally coherent" the gene group is from the perspective of the published literature. We evaluate our method by testing its ability to distinguish 19 known functional gene groups from 1900 randomly assembled groups. Neighbor divergence achieves 79% sensitivity at 100% specificity, comparing favorably to other tested methods. We also apply neighbor divergence to previously published gene expression clusters to assess its ability to recognize gene groups that had been manually identified as representative of a common function.

2001

Chang, Raychaudhuri, Altman. Including biological literature improves homology search. Pac Symp Biocomput. 2001;:374–83.
Annotating the tremendous amount of sequence information being generated requires accurate automated methods for recognizing homology. Although sequence similarity is only one of many indicators of evolutionary homology, it is often the only one used. Here we find that supplementing sequence similarity with information from biomedical literature is successful in increasing the accuracy of homology search results. We modified the PSI-BLAST algorithm to use literature similarity in each iteration of its database search. The modified algorithm is evaluated and compared to standard PSI-BLAST in searching for homologous proteins. The performance of the modified algorithm achieved 32% recall with 95% precision, while the original one achieved 33% recall with 84% precision; the literature similarity requirement preserved the sensitive characteristic of the PSI-BLAST algorithm while improving the precision.
Raychaudhuri, Sutphin, Chang, Altman. Basic microarray analysis: grouping and feature reduction. Trends Biotechnol. 2001;19(5):189–93.
DNA microarray technologies are useful for addressing a broad range of biological problems - including the measurement of mRNA expression levels in target cells. These studies typically produce large data sets that contain measurements on thousands of genes under hundreds of conditions. There is a critical need to summarize this data and to pick out the important details. The most common activities, therefore, are to group together microarray data and to reduce the number of features. Both of these activities can be done using only the raw microarray data (unsupervised methods) or using external information that provides labels for the microarray data (supervised methods). We briefly review supervised and unsupervised methods for grouping and reducing data in the context of a publicly available suite of tools called CLEAVER, and illustrate their application on a representative data set collected to study lymphoma.
Altman, Raychaudhuri. Whole-genome expression analysis: challenges beyond clustering. Curr Opin Struct Biol. 2001;11(3):340–7.
Measuring the expression of most or all of the genes in a biological system raises major analytic challenges. A wealth of recent reports uses microarray expression data to examine diverse biological phenomena - from basic processes in model organisms to complex aspects of human disease. After an initial flurry of methods for clustering the data on the basis of similarity, the field has recognized some longer-term challenges. Firstly, there are efforts to understand the sources of noise and variation in microarray experiments in order to increase the biological signal. Secondly, there are efforts to combine expression data with other sources of information to improve the range and quality of conclusions that can be drawn. Finally, techniques are now emerging to reconstruct networks of genetic interactions in order to create integrated and systematic models of biological systems.

2000

Raychaudhuri, Stuart, Liu, Small, Altman. Pattern recognition of genomic features with microarrays: site typing of Mycobacterium tuberculosis strains. Proc Int Conf Intell Syst Mol Biol. 2000;8:286–95.
Mycobacterium tuberculosis (M. tb.) strains differ in the number and locations of a transposon-like insertion sequence known as IS6110. Accurate detection of this sequence can be used as a fingerprint for individual strains, but can be difficult because of noisy data. In this paper, we propose a non-parametric discriminant analysis method for predicting the locations of the IS6110 sequence from microarray data. Polymerase chain reaction extension products generated from primers specific for the insertion sequence are hybridized to a microarray containing targets corresponding to each open reading frame in M. tb. To test for insertion sites, we use microarray intensity values extracted from small windows of contiguous open reading frames. Rank-transformation of spot intensities and first-order differences in local windows provide enough information to reliably determine the presence of an insertion sequence. The nonparametric approach outperforms all other methods tested in this study.
A series of microarray experiments produces observations of differential expression for thousands of genes across multiple conditions. It is often not clear whether a set of experiments are measuring fundamentally different gene expression states or are measuring similar states created through different mechanisms. It is useful, therefore, to define a core set of independent features for the expression states that allow them to be compared directly. Principal components analysis (PCA) is a statistical technique for determining the key variables in a multidimensional data set that explain the differences in the observations, and can be used to simplify the analysis and visualization of multidimensional data sets. We show that application of PCA to expression data (where the experimental conditions are the variables, and the gene expression measurements are the observations) allows us to summarize the ways in which gene responses vary under different conditions. Examination of the components also provides insight into the underlying factors that are measured in the experiments. We applied PCA to the publicly released yeast sporulation data set (Chu et al. 1998). In that work, 7 different measurements of gene expression were made over time. PCA on the time-points suggests that much of the observed variability in the experiment can be summarized in just 2 components--i.e. 2 variables capture most of the information. These components appear to represent (1) overall induction level and (2) change in induction level over time. We also examined the clusters proposed in the original paper, and show how they are manifested in principal component space. Our results are available on the internet at http:¿www.smi.stanford.edu/project/helix/PCArray .

1997

Van Liew, Raychaudhuri. Stabilized bubbles in the body: pressure-radius relationships and the limits to stabilization. J Appl Physiol (1985). 1997;82(6):2045–53.
We previously outlined the fundamental principles that govern behavior of stabilized bubbles, such as the microbubbles being put forward as ultrasound contrast agents. Our present goals are to develop the idea that there are limits to the stabilization and to provide a conceptual framework for comparison of bubbles stabilized by different mechanisms. Gases diffuse in or out of stabilized bubbles in a limited and reversible manner in response to changes in the environment, but strong growth influences will cause the bubbles to cross a threshold into uncontrolled growth. Also, bubbles stabilized by mechanical structures will be destroyed if outside influences bring them below a critical small size. The in vivo behavior of different kinds of stabilized bubbles can be compared by using plots of bubble radius as a function of forces that affect diffusion of gases in or out of the bubble. The two ends of the plot are the limits for unstabilized growth and destruction; these and the curve's slope predict the bubble's practical usefulness for ultrasonic imaging or O2 carriage to tissues.
Raychaudhuri, Younas, Karplus, Faerman, Ripoll. Backbone makes a significant contribution to the electrostatics of alpha/beta-barrel proteins. Protein Sci. 1997;6(9):1849–57.
The electrostatic properties of seven alpha/beta-barrel enzymes selected from different evolutionary families were studied: triose phosphate isomerase, fructose-1,6-bisphosphate aldolase, pyruvate kinase, mandelate racemase, trimethylamine dehydrogenase, glycolate oxidase, and narbonin, a protein without any known enzymatic activity. The backbone of the alpha/beta-barrel has a distinct electrostatic field pattern, which is dipolar along the barrel axis. When the side chains are included in the calculations the general effect is to modulate the electrostatic pattern so that the electrostatic field is generally enhanced and is focused into a specific area near the active site. We use the electrostatic flux through a square surface near the active site to gauge the functionally relevant magnitude of the electrostatic field. The calculations reveal that in six out of the seven cases the backbone itself contributes greater than 45% of the total flux. The substantial electrostatic contribution of the backbone correlates with the known preference of alpha/beta-barrel enzymes for negatively charged substrates.

1996

Li Z, Raychaudhuri, Wand. Insights into the local residual entropy of proteins provided by NMR relaxation. Protein Sci. 1996;5(12):2647–50.
A simple model is used to illustrate the relationship between the dynamics measured by NMR relaxation methods and the local residual entropy of proteins. The expected local dynamic behavior of well-packed extended amino acid side chains are described by employing a one-dimensional vibrator that encapsulates both the spatial and temporal character of the motion. This model is then related to entropy and to the generalized order parameter of the popular "model-free" treatment often used in the analysis of NMR relaxation data. Simulations indicate that order parameters observed for the methyl symmetry axes in, for example, human ubiquitin correspond to significant local entropies. These observations have obvious significance for the issue of the physical basis of protein structure, dynamics, and stability.