Refine
Has Fulltext
- yes (5)
Is part of the Bibliography
- yes (5)
Document Type
- Doctoral Thesis (5) (remove)
Language
- English (5) (remove)
Keywords
- Microarray (5) (remove)
Institute
- Theodor-Boveri-Institut für Biowissenschaften (5) (remove)
Applying microarray‐based techniques to study gene expression patterns: a bio‐computational approach
(2010)
The regulation and maintenance of iron homeostasis is critical to human health. As a constituent of hemoglobin, iron is essential for oxygen transport and significant iron deficiency leads to anemia. Eukaryotic cells require iron for survival and proliferation. Iron is part of hemoproteins, iron-sulfur (Fe-S) proteins, and other proteins with functional groups that require iron as a cofactor. At the cellular level, iron uptake, utilization, storage, and export are regulated at different molecular levels (transcriptional, mRNA stability, translational, and posttranslational). Iron regulatory proteins (IRPs) 1 and 2 post-transcriptionally control mammalian iron homeostasis by binding to iron-responsive elements (IREs), conserved RNA stem-loop structures located in the 5’- or 3‘- untranslated regions of genes involved in iron metabolism (e.g. FTH1, FTL, and TFRC). To identify novel IRE-containing mRNAs, we integrated biochemical, biocomputational, and microarray-based experimental approaches. Gene expression studies greatly contribute to our understanding of complex relationships in gene regulatory networks. However, the complexity of array design, production and manipulations are limiting factors, affecting data quality. The use of customized DNA microarrays improves overall data quality in many situations, however, only if for these specifically designed microarrays analysis tools are available. Methods In this project response to the iron treatment was examined under different conditions using bioinformatical methods. This would improve our understanding of an iron regulatory network. For these purposes we used microarray gene expression data. To identify novel IRE-containing mRNAs biochemical, biocomputational, and microarray-based experimental approaches were integrated. IRP/IRE messenger ribonucleoproteins were immunoselected and their mRNA composition was analysed using an IronChip microarray enriched for genes predicted computationally to contain IRE-like motifs. Analysis of IronChip microarray data requires specialized tool which can use all advantages of a customized microarray platform. Novel decision-tree based algorithm was implemented using Perl in IronChip Evaluation Package (ICEP). Results IRE-like motifs were identified from genomic nucleic acid databases by an algorithm combining primary nucleic acid sequence and RNA structural criteria. Depending on the choice of constraining criteria, such computational screens tend to generate a large number of false positives. To refine the search and reduce the number of false positive hits, additional constraints were introduced. The refined screen yielded 15 IRE-like motifs. A second approach made use of a reported list of 230 IRE-like sequences obtained from screening UTR databases. We selected 6 out of these 230 entries based on the ability of the lower IRE stem to form at least 6 out of 7 bp. Corresponding ESTs were spotted onto the human or mouse versions of the IronChip and the results were analysed using ICEP. Our data show that the immunoselection/microarray strategy is a feasible approach for screening bioinformatically predicted IRE genes and the detection of novel IRE-containing mRNAs. In addition, we identified a novel IRE-containing gene CDC14A (Sanchez M, et al. 2006). The IronChip Evaluation Package (ICEP) is a collection of Perl utilities and an easy to use data evaluation pipeline for the analysis of microarray data with a focus on data quality of custom-designed microarrays. The package has been developed for the statistical and bioinformatical analysis of the custom cDNA microarray IronChip, but can be easily adapted for other cDNA or oligonucleotide-based designed microarray platforms. ICEP uses decision tree-based algorithms to assign quality flags and performs robust analysis based on chip design properties regarding multiple repetitions, ratio cut-off, background and negative controls (Vainshtein Y, et al., 2010).
In initial experiments, the well characterized VACV strain GLV-1h68 and three wild-type LIVP isolates were utilized to analyze gene expression in a pair of autologous human melanoma cell lines (888-MEL and 1936 MEL) after infection. Microarray analyses, followed by sequential statistical approaches, characterized human genes whose transcription is affected specifically by VACV infection. In accordance with the literature, those genes were involved in broad cellular functions, such as cell death, protein synthesis and folding, as well as DNA replication, recombination, and repair. In parallel to host gene expression, viral gene expression was evaluated with help of customized VACV array platforms to get better insight over the interplay between VACV and its host. Our main focus was to compare host and viral early events, since virus genome replication occurs early after infection. We observed that viral transcripts segregated in a characteristic time-specific pattern, consistent with the three temporal expression classes of VACV genes, including a group of genes which could be classified as early-stage genes. In this work, comparison of VACV early replication and respective early gene transcription led to the identification of seven viral genes whose expression correlated strictly with replication. We considered the early expression of those seven genes to be representative for VACV replication and we therefore referred to them as viral replication indicators (VRIs). To explore the relationship between host cell transcription and viral replication, we correlated viral (VRI) and human early gene expression. Correlation analysis revealed a subset of 114 human transcripts whose early expression tightly correlated with early VRI expression and thus early viral replication. These 114 human molecules represented an involvement in broad cellular functions. We found at least six out of 114 correlates to be involved in protein ubiquitination or proteasomal function. Another molecule of interest was the serine-threonine protein kinase WNK lysine-deficient protein kinase 1 (WNK1). We discovered that WNK1 features differences on several molecular biological levels associated with permissiveness to VACV infection. In addition to that, a set of human genes was identified with possible predictive value for viral replication in an independent dataset. A further objective of this work was to explore baseline molecular biological variances associated with permissiveness which could help identifying cellular components that contribute to the formation of a permissive phenotype. Therefore, in a subsequent approach, we screened a set of 15 melanoma cell lines (15-MEL) regarding their permissiveness to GLV-1h68, evaluated by GFP expression levels, and classified the top four and lowest four cell lines into high and low permissive group, respectively. Baseline gene transcriptional data, comparing low and highly permissive group, suggest that differences between the two groups are at least in part due to variances in global cellular functions, such as cell cycle, cell growth and proliferation, as well as cell death and survival. We also observed differences in the ubiquitination pathway, which is consistent with our previous results and underlines the importance of this pathway in VACV replication and permissiveness. Moreover, baseline microRNA (miRNA) expression between low and highly permissive group was considered to provide valuable information regarding virus-host co-existence. In our data set, we identified six miRNAs that featured varying baseline expression between low and highly permissive group. Finally, copy number variations (CNVs) between low and highly permissive group were evaluated. In this study, when investigating differences in the chromosomal aberration patterns between low and highly permissive group, we observed frequent segmental amplifications within the low permissive group, whereas the same regions were mostly unchanged in the high group. Taken together, our results highlight a probable correlation between viral replication, early gene expression, and the respective host response and thus a possible involvement of human host factors in viral early replication. Furthermore, we revealed the importance of cellular baseline composition for permissiveness to VACV infection on different molecular biological levels, including mRNA expression, miRNA expression, as well as copy number variations. The characterization of human target genes that influence viral replication could help answering the question of host cell response to oncolytic virotherapy and provide important information for the development of novel recombinant vaccinia viruses with improved features to enhance replication rate and hence trigger therapeutic outcome.
In this thesis, the development of a phylogenetic DNA microarray, the analysis of several gene expression microarray datasets and new approaches for improved data analysis and interpretation are described. In the first publication, the development and analysis of a phylogenetic microarray is presented. I could show that species detection with phylogenetic DNA microarrays can be significantly improved when the microarray data is analyzed with a linear regression modeling approach. Standard methods have so far relied on pure signal intensities of the array spots and a simple cutoff criterion was applied to call a species present or absent. This procedure is not applicable to very closely related species with high sequence similarity because cross-hybridization of non-target DNA renders species detection impossible based on signal intensities alone. By modeling hybridization and cross-hybridization with linear regression, as I have presented in this thesis, even species with a sequence similarity of 97% in the marker gene can be detected and distinguished from related species. Another advantage of the modeling approach over existing methods is that the model also performs well on mixtures of different species. In principle, also quantitative predictions can be made. To make better use of the large amounts of microarray data stored in public databases, meta-analysis approaches need to be developed. In the second publication, an explorative meta-analysis exemplified on Arabidopsis thaliana gene expression datasets is presented. Integrating datasets studying effects such as the influence of plant hormones, pathogens and different mutations on gene expression levels, clusters of similarly treated datasets could be found. From the clusters of pathogen-treated and indole-3-acetic acid (IAA) treated datasets, representative genes were selected which pointed to functions which had been associated with pathogen attack or IAA effects previously. Additionally, hypotheses about the functions of so far uncharacterized genes could be set up. Thus, this kind of meta-analysis could be used to propose gene functions and their regulation under different conditions. In this work, also primary data analysis of Arabidopsis thaliana datasets is presented. In the third publication, an experiment which was conducted to find out if microwave irradiation has an effect on the gene expression of a plant cell culture is described. During the first steps, the data analysis was carried out blinded and exploratory analysis methods were applied to find out if the irradiation had an effect on gene expression of plant cells. Small but statistically significant changes in a few genes were found and could be experimentally confirmed. From the functions of the regulated genes and a meta-analysis with publicly available microarray data, it could be suspected that the plant cell culture somehow perceived the irradiation as energy, similar to perceiving light rays. The fourth publication describes the functional analysis of another Arabidopsis thaliana gene expression dataset. The gene expression data of the plant tumor dataset pointed to a switch from a mainly aerobic, auxotrophic to an anaerobic and heterotrophic metabolism in the plant tumor. Genes involved in photosynthesis were found to be repressed in tumors; genes of amino acid and lipid metabolism, cell wall and solute transporters were regulated in a way that sustains tumor growth and development. Furthermore, in the fifth publication, GEPAT (Genome Expression Pathway Analysis Tool), a tool for the analysis and integration of microarray data with other data types, is described. It consists of a web application and database which allows comfortable data upload and data analysis. In later chapters of this thesis (publication 6 and publication 7), GEPAT is used to analyze human microarray datasets and to integrate results from gene expression analysis with other datatypes. Gene expression and comparative genomic hybridization data from 71 Mantle Cell Lymphoma (MCL) patients was analyzed and allowed proposing a seven gene predictor which facilitates survival predictions for patients compared to existing predictors. In this study, it was shown that CGH data can be used for survival predictions. For the dataset of Diffuse Large B-cell lymphoma (DLBCL) patients, an improved survival predictor could be found based on the gene expression data. From the genes differentially expressed between long and short surviving MCL patients as well as for regulated genes of DLBCL patients, interaction networks could be set up. They point to differences in regulation for cell cycle and proliferation genes between patients with good and bad prognosis.
DNA microarrays have become a standard technique to assess the mRNA levels for complete genomes. To identify significantly regulated genes from these large amounts of data a wealth of methods has been developed. Despite this, the functional interpretation (i.e. deducing biological hypothesis from the data) still remains a major bottleneck in microarray data analysis. Most available methods display the set of significant genes in long lists, from which common functional properties have to be extracted. This is not only a tedious and time-consuming task, which becomes less and less feasible with increasing numbers of experimental conditions, but is also prone to errors, since it is commonly done by eye. In the course of this work methods have been developed and tested, that allow for a computerbased analysis of functional properties being relevant in the given experimental setting. To this end the Gene Ontology was chosen as an appropriate source of annotation data, because it combines human-readability with computer-accessibility of the annotations term and thus allows for a statistical analysis of functional properties. Here the gene-annotations are integrated in a Correspondence Analysis which allows to visualize genes, hybridizations and functional categories in a single plot. Due to the increasing amounts of available annotations and the fact that in most settings only few functional processes are differentially regulated, several filter criteria have been developed to reduce the number of displayed annotations to a set being relevant in the given experimental setting. The applicability of the presented visualization and filtering have both been validated on datasets of varying complexity. Starting from the well studied glucose-pathway in S. cerevisiae up to the comparison of different tumor types in human. In both settings the method generated well interpretable plots, which allowed for an immediate identification of the major functional differences between the experimental conditions [90]. While the integration of annotation data like GO facilitates functional interpretation, it lacks the capability to identify key regulatory elements. To facilitate such an analysis, the occurrence of transcription factor binding sites in upstream regions of genes has been integrated to the analysis as well. Again this methodology was biologically validated on S. cerevisiae as well human cancer data sets. In both settings TFs known to exhibit central roles for the observed transcriptional changes were plotted in marked positions and thus could be immediately identified [206]. In essence, integration of supplementary information in Correspondence Analysis visualizes genes, hybridizations and annotation data in a single, well interpretable plot. This allows for an intuitive identification of relevant annotations even in complex experimental settings. The presented approach is not limited to the shown types of data, but is generalizable to account for the majority of the available annotation data.
Recent progresses and developments in molecular biology provide a wealth of new but insufficiently characterised data. This fund comprises amongst others biological data of genomic DNA, protein sequences, 3-dimensional protein structures as well as profiles of gene expression. In the present work, this information is used to develop new methods for the characterisation and classification of organisms and whole groups of organisms as well as to enhance the automated gain and transfer of information. The first two presented approaches (chapters 4 und 5) focus on the medically and scientifically important enterobacteria. Its impact in medicine and molecular biology is founded in versatile mechanisms of infection, their fundamental function as a commensal inhabitant of the intestinal tract and their use as model organisms as they are easy to cultivate. Despite many studies on single pathogroups with clinical distinguishable pathologies, the genotypic factors that contribute to their diversity are still partially unknown. The comprehensive genome comparison described in Chapter 4 was conducted with numerous enterobacterial strains, which cover nearly the whole range of clinically relevant diversity. The genome comparison constitutes the basis of a characterisation of the enterobacterial gene pool, of a reconstruction of evolutionary processes and of comprehensive analysis of specific protein families in enterobacterial subgroups. Correspondence analysis, which is applied for the first time in this context, yields qualitative statements to bacterial subgroups and the respective, exclusively present protein families. Specific protein families were identified for the three major subgroups of enterobacteria namely the genera Yersinia and Salmonella as well as to the group of Shigella and E. coli by applying statistical tests. In conclusion, the genome comparison-based methods provide new starting points to infer specific genotypic traits of bacterial groups from the transfer of functional annotation. Due to the high medical importance of enterobacterial isolates their classification according to pathogenicity has been in focus of many studies. The microarray technology offers a fast, reproducible and standardisable means of bacterial typing and has been proved in bacterial diagnostics, risk assessment and surveillance. The design of the diagnostic microarray of enterobacteria described in chapter 5 is based on the availability of numerous enterobacterial genome sequences. A novel probe selection strategy based on the highly efficient algorithm of string search, which considers both coding and non-coding regions of genomic DNA, enhances pathogroup detection. This principle reduces the risk of incorrect typing due to restrictions to virulence-associated capture probes. Additional capture probes extend the spectrum of applications of the microarray to simultaneous diagnostic or surveillance of antimicrobial resistance. Comprehensive test hybridisations largely confirm the reliability of the selected capture probes and its ability to robustly classify enterobacterial strains according to pathogenicity. Moreover, the tests constitute the basis of the training of a regression model for the classification of pathogroups and hybridised amounts of DNA. The regression model features a continuous learning capacity leading to an enhancement of the prediction accuracy in the process of its application. A fraction of the capture probes represents intergenic DNA and hence confirms the relevance of the underlying strategy. Interestingly, a large part of the capture probes represents poorly annotated genes suggesting the existence of yet unconsidered factors with importance to the formation of respective virulence phenotypes. Another major field of microarray applications is gene expression analysis. The size of gene expression databases rapidly increased in recent years. Although they provide a wealth of expression data, it remains challenging to integrate results from different studies. In chapter 6 the methodology of an unsupervised meta-analysis of genome-wide A. thaliana gene expression data sets is presented, which yields novel insights in function and regulation of genes. The application of kernel-based principal component analysis in combination with hierarchical clustering identified three major groups of contrasts each sharing overlapping expression profiles. Genes associated with two groups are known to play important roles in Indol-3 acetic acid (IAA) mediated plant growth and development as well as in pathogen defence. Yet uncharacterised serine-threonine kinases could be assigned to novel functions in pathogen defence by meta-analysis. In general, hidden interrelation between genes regulated under different conditions could be unravelled by the described approach. HMMs are applied to the functional characterisation of proteins or the detection of genes in genome sequences. Although HMMs are technically mature and widely applied in computational biology, I demonstrate the methodical optimisation with respect to the modelling accuracy on biological data with various distributions of sequence lengths. The subunits of these models, the states, are associated with a certain holding time being the link to length distributions of represented sequences. An adaptation of simple HMM topologies to bell-shaped length distributions described in chapter 7 was achieved by serial chain-linking of single states, while residing in the class of conventional HMMs. The impact of an optimisation of HMM topologies was underlined by performance evaluations with differently adjusted HMM topologies. In summary, a general methodology was introduced to improve the modelling behaviour of HMMs by topological optimisation with maximum likelihood and a fast and easily implementable moment estimator. Chapter 8 describes the application of HMMs to the prediction of interaction sites in protein domains. As previously demonstrated, these sites are not trivial to predict because of varying degree in conservation of their location and type within the domain family. The prediction of interaction sites in protein domains is achieved by a newly defined HMM topology, which incorporates both sequence and structure information. Posterior decoding is applied to the prediction of interaction sites providing additional information of the probability of an interaction for all sequence positions. The implementation of interaction profile HMMs (ipHMMs) is based on the well established profile HMMs and inherits its known efficiency and sensitivity. The large-scale prediction of interaction sites by ipHMMs explained protein dysfunctions caused by mutations that are associated to inheritable diseases like different types of cancer or muscular dystrophy. As already demonstrated by profile HMMs, the ipHMMs are suitable for large-scale applications. Overall, the HMM-based method enhances the prediction quality of interaction sites and improves the understanding of the molecular background of inheritable diseases. With respect to current and future requirements I provide large-scale solutions for the characterisation of biological data in this work. All described methods feature a highly portable character, which allows for the transfer to related topics or organisms, respectively. Special emphasis was put on the knowledge transfer facilitated by a steadily increasing wealth of biological information. The applied and developed statistical methods largely provide learning capacities and hence benefit from the gain of knowledge resulting in increased prediction accuracies and reliability.