Refine
Has Fulltext
- yes (16) (remove)
Is part of the Bibliography
- yes (16)
Year of publication
Document Type
- Doctoral Thesis (16) (remove)
Language
- English (16) (remove)
Keywords
- Genexpression (16) (remove)
Institute
- Theodor-Boveri-Institut für Biowissenschaften (16) (remove)
Recent progresses and developments in molecular biology provide a wealth of new but insufficiently characterised data. This fund comprises amongst others biological data of genomic DNA, protein sequences, 3-dimensional protein structures as well as profiles of gene expression. In the present work, this information is used to develop new methods for the characterisation and classification of organisms and whole groups of organisms as well as to enhance the automated gain and transfer of information. The first two presented approaches (chapters 4 und 5) focus on the medically and scientifically important enterobacteria. Its impact in medicine and molecular biology is founded in versatile mechanisms of infection, their fundamental function as a commensal inhabitant of the intestinal tract and their use as model organisms as they are easy to cultivate. Despite many studies on single pathogroups with clinical distinguishable pathologies, the genotypic factors that contribute to their diversity are still partially unknown. The comprehensive genome comparison described in Chapter 4 was conducted with numerous enterobacterial strains, which cover nearly the whole range of clinically relevant diversity. The genome comparison constitutes the basis of a characterisation of the enterobacterial gene pool, of a reconstruction of evolutionary processes and of comprehensive analysis of specific protein families in enterobacterial subgroups. Correspondence analysis, which is applied for the first time in this context, yields qualitative statements to bacterial subgroups and the respective, exclusively present protein families. Specific protein families were identified for the three major subgroups of enterobacteria namely the genera Yersinia and Salmonella as well as to the group of Shigella and E. coli by applying statistical tests. In conclusion, the genome comparison-based methods provide new starting points to infer specific genotypic traits of bacterial groups from the transfer of functional annotation. Due to the high medical importance of enterobacterial isolates their classification according to pathogenicity has been in focus of many studies. The microarray technology offers a fast, reproducible and standardisable means of bacterial typing and has been proved in bacterial diagnostics, risk assessment and surveillance. The design of the diagnostic microarray of enterobacteria described in chapter 5 is based on the availability of numerous enterobacterial genome sequences. A novel probe selection strategy based on the highly efficient algorithm of string search, which considers both coding and non-coding regions of genomic DNA, enhances pathogroup detection. This principle reduces the risk of incorrect typing due to restrictions to virulence-associated capture probes. Additional capture probes extend the spectrum of applications of the microarray to simultaneous diagnostic or surveillance of antimicrobial resistance. Comprehensive test hybridisations largely confirm the reliability of the selected capture probes and its ability to robustly classify enterobacterial strains according to pathogenicity. Moreover, the tests constitute the basis of the training of a regression model for the classification of pathogroups and hybridised amounts of DNA. The regression model features a continuous learning capacity leading to an enhancement of the prediction accuracy in the process of its application. A fraction of the capture probes represents intergenic DNA and hence confirms the relevance of the underlying strategy. Interestingly, a large part of the capture probes represents poorly annotated genes suggesting the existence of yet unconsidered factors with importance to the formation of respective virulence phenotypes. Another major field of microarray applications is gene expression analysis. The size of gene expression databases rapidly increased in recent years. Although they provide a wealth of expression data, it remains challenging to integrate results from different studies. In chapter 6 the methodology of an unsupervised meta-analysis of genome-wide A. thaliana gene expression data sets is presented, which yields novel insights in function and regulation of genes. The application of kernel-based principal component analysis in combination with hierarchical clustering identified three major groups of contrasts each sharing overlapping expression profiles. Genes associated with two groups are known to play important roles in Indol-3 acetic acid (IAA) mediated plant growth and development as well as in pathogen defence. Yet uncharacterised serine-threonine kinases could be assigned to novel functions in pathogen defence by meta-analysis. In general, hidden interrelation between genes regulated under different conditions could be unravelled by the described approach. HMMs are applied to the functional characterisation of proteins or the detection of genes in genome sequences. Although HMMs are technically mature and widely applied in computational biology, I demonstrate the methodical optimisation with respect to the modelling accuracy on biological data with various distributions of sequence lengths. The subunits of these models, the states, are associated with a certain holding time being the link to length distributions of represented sequences. An adaptation of simple HMM topologies to bell-shaped length distributions described in chapter 7 was achieved by serial chain-linking of single states, while residing in the class of conventional HMMs. The impact of an optimisation of HMM topologies was underlined by performance evaluations with differently adjusted HMM topologies. In summary, a general methodology was introduced to improve the modelling behaviour of HMMs by topological optimisation with maximum likelihood and a fast and easily implementable moment estimator. Chapter 8 describes the application of HMMs to the prediction of interaction sites in protein domains. As previously demonstrated, these sites are not trivial to predict because of varying degree in conservation of their location and type within the domain family. The prediction of interaction sites in protein domains is achieved by a newly defined HMM topology, which incorporates both sequence and structure information. Posterior decoding is applied to the prediction of interaction sites providing additional information of the probability of an interaction for all sequence positions. The implementation of interaction profile HMMs (ipHMMs) is based on the well established profile HMMs and inherits its known efficiency and sensitivity. The large-scale prediction of interaction sites by ipHMMs explained protein dysfunctions caused by mutations that are associated to inheritable diseases like different types of cancer or muscular dystrophy. As already demonstrated by profile HMMs, the ipHMMs are suitable for large-scale applications. Overall, the HMM-based method enhances the prediction quality of interaction sites and improves the understanding of the molecular background of inheritable diseases. With respect to current and future requirements I provide large-scale solutions for the characterisation of biological data in this work. All described methods feature a highly portable character, which allows for the transfer to related topics or organisms, respectively. Special emphasis was put on the knowledge transfer facilitated by a steadily increasing wealth of biological information. The applied and developed statistical methods largely provide learning capacities and hence benefit from the gain of knowledge resulting in increased prediction accuracies and reliability.
In this thesis, the development of a phylogenetic DNA microarray, the analysis of several gene expression microarray datasets and new approaches for improved data analysis and interpretation are described. In the first publication, the development and analysis of a phylogenetic microarray is presented. I could show that species detection with phylogenetic DNA microarrays can be significantly improved when the microarray data is analyzed with a linear regression modeling approach. Standard methods have so far relied on pure signal intensities of the array spots and a simple cutoff criterion was applied to call a species present or absent. This procedure is not applicable to very closely related species with high sequence similarity because cross-hybridization of non-target DNA renders species detection impossible based on signal intensities alone. By modeling hybridization and cross-hybridization with linear regression, as I have presented in this thesis, even species with a sequence similarity of 97% in the marker gene can be detected and distinguished from related species. Another advantage of the modeling approach over existing methods is that the model also performs well on mixtures of different species. In principle, also quantitative predictions can be made. To make better use of the large amounts of microarray data stored in public databases, meta-analysis approaches need to be developed. In the second publication, an explorative meta-analysis exemplified on Arabidopsis thaliana gene expression datasets is presented. Integrating datasets studying effects such as the influence of plant hormones, pathogens and different mutations on gene expression levels, clusters of similarly treated datasets could be found. From the clusters of pathogen-treated and indole-3-acetic acid (IAA) treated datasets, representative genes were selected which pointed to functions which had been associated with pathogen attack or IAA effects previously. Additionally, hypotheses about the functions of so far uncharacterized genes could be set up. Thus, this kind of meta-analysis could be used to propose gene functions and their regulation under different conditions. In this work, also primary data analysis of Arabidopsis thaliana datasets is presented. In the third publication, an experiment which was conducted to find out if microwave irradiation has an effect on the gene expression of a plant cell culture is described. During the first steps, the data analysis was carried out blinded and exploratory analysis methods were applied to find out if the irradiation had an effect on gene expression of plant cells. Small but statistically significant changes in a few genes were found and could be experimentally confirmed. From the functions of the regulated genes and a meta-analysis with publicly available microarray data, it could be suspected that the plant cell culture somehow perceived the irradiation as energy, similar to perceiving light rays. The fourth publication describes the functional analysis of another Arabidopsis thaliana gene expression dataset. The gene expression data of the plant tumor dataset pointed to a switch from a mainly aerobic, auxotrophic to an anaerobic and heterotrophic metabolism in the plant tumor. Genes involved in photosynthesis were found to be repressed in tumors; genes of amino acid and lipid metabolism, cell wall and solute transporters were regulated in a way that sustains tumor growth and development. Furthermore, in the fifth publication, GEPAT (Genome Expression Pathway Analysis Tool), a tool for the analysis and integration of microarray data with other data types, is described. It consists of a web application and database which allows comfortable data upload and data analysis. In later chapters of this thesis (publication 6 and publication 7), GEPAT is used to analyze human microarray datasets and to integrate results from gene expression analysis with other datatypes. Gene expression and comparative genomic hybridization data from 71 Mantle Cell Lymphoma (MCL) patients was analyzed and allowed proposing a seven gene predictor which facilitates survival predictions for patients compared to existing predictors. In this study, it was shown that CGH data can be used for survival predictions. For the dataset of Diffuse Large B-cell lymphoma (DLBCL) patients, an improved survival predictor could be found based on the gene expression data. From the genes differentially expressed between long and short surviving MCL patients as well as for regulated genes of DLBCL patients, interaction networks could be set up. They point to differences in regulation for cell cycle and proliferation genes between patients with good and bad prognosis.
Background: The frequency of the most observed cancer, Non Hodgkin Lymphoma (NHL), is further rising. Diffuse large B-cell lymphoma (DLBCL) is the most common of the NHLs. There are two subgroups of DLBCL with different gene expression patterns: ABC (“Activated B-like DLBCL”) and GCB (“Germinal Center B-like DLBCL”). Without therapy the patients often die within a few months, the ABC type exhibits the more aggressive behaviour. A further B-cell lymphoma is the Mantle cell lymphoma (MCL). It is rare and shows very poor prognosis. There is no cure yet. Methods: In this project these B-cell lymphomas were examined with methods from bioinformatics, to find new characteristics or undiscovered events on the molecular level. This would improve understanding and therapy of lymphomas. For this purpose we used survival, gene expression and comparative genomic hybridization (CGH) data. In some clinical studies, you get large data sets, from which one can reveal yet unknown trends. Results (MCL): The published proliferation signature correlates directly with survival. Exploratory analyses of gene expression and CGH data of MCL samples (n=71) revealed a valid grouping according to the median of the proliferation signature values. The second axis of correspondence analysis distinguishes between good and bad prognosis. Statistical testing (moderate t-test, Wilcoxon rank-sum test) showed differences in the cell cycle and delivered a network of kinases, which are responsible for the difference between good and bad prognosis. A set of seven genes (CENPE, CDC20, HPRT1, CDC2, BIRC5, ASPM, IGF2BP3) predicted, similarly well, survival patterns as proliferation signature with 20 genes. Furthermore, some bands could be associated with prognosis in the explorative analysis (chromosome 9: 9p24, 9p23, 9p22, 9p21, 9q33 and 9q34). Results (DLBCL): New normalization of gene expression data of DLBCL patients revealed better separation of risk groups by the 2002 published signature based predictor. We could achieve, similarly well, a separation with six genes. Exploratory analysis of gene expression data could confirm the subgroups ABC and GCB. We recognized a clear difference in early and late cell cycle stages of cell cycle genes, which can separate ABC and GCB. Classical lymphoma and best separating genes form a network, which can classify and explain the ABC and GCB groups. Together with gene sets which identify ABC and GCB we get a network, which can classify and explain the ABC and GCB groups (ASB13, BCL2, BCL6, BCL7A, CCND2, COL3A1, CTGF, FN1, FOXP1, IGHM, IRF4, LMO2, LRMP, MAPK10, MME, MYBL1, NEIL1 and SH3BP5; Altogether these findings are useful for diagnosis, prognosis and therapy (cytostatic drugs).
DNA microarrays have become a standard technique to assess the mRNA levels for complete genomes. To identify significantly regulated genes from these large amounts of data a wealth of methods has been developed. Despite this, the functional interpretation (i.e. deducing biological hypothesis from the data) still remains a major bottleneck in microarray data analysis. Most available methods display the set of significant genes in long lists, from which common functional properties have to be extracted. This is not only a tedious and time-consuming task, which becomes less and less feasible with increasing numbers of experimental conditions, but is also prone to errors, since it is commonly done by eye. In the course of this work methods have been developed and tested, that allow for a computerbased analysis of functional properties being relevant in the given experimental setting. To this end the Gene Ontology was chosen as an appropriate source of annotation data, because it combines human-readability with computer-accessibility of the annotations term and thus allows for a statistical analysis of functional properties. Here the gene-annotations are integrated in a Correspondence Analysis which allows to visualize genes, hybridizations and functional categories in a single plot. Due to the increasing amounts of available annotations and the fact that in most settings only few functional processes are differentially regulated, several filter criteria have been developed to reduce the number of displayed annotations to a set being relevant in the given experimental setting. The applicability of the presented visualization and filtering have both been validated on datasets of varying complexity. Starting from the well studied glucose-pathway in S. cerevisiae up to the comparison of different tumor types in human. In both settings the method generated well interpretable plots, which allowed for an immediate identification of the major functional differences between the experimental conditions [90]. While the integration of annotation data like GO facilitates functional interpretation, it lacks the capability to identify key regulatory elements. To facilitate such an analysis, the occurrence of transcription factor binding sites in upstream regions of genes has been integrated to the analysis as well. Again this methodology was biologically validated on S. cerevisiae as well human cancer data sets. In both settings TFs known to exhibit central roles for the observed transcriptional changes were plotted in marked positions and thus could be immediately identified [206]. In essence, integration of supplementary information in Correspondence Analysis visualizes genes, hybridizations and annotation data in a single, well interpretable plot. This allows for an intuitive identification of relevant annotations even in complex experimental settings. The presented approach is not limited to the shown types of data, but is generalizable to account for the majority of the available annotation data.
Cardiovascular disease is the leading cause of mortality in both men and women in the Western world. Earlier observations have pointed out that pre-menopausal women have a lower risk of developing cardiovascular disease than age-matched men, with an increase in risk after the onset of menopause. This observation has directed the attention to estrogen as a potential protective factor in the heart. So far the focus of research and clinical studies has been the vascular system, leaving the current knowledge on the role of estrogen in the myocardium itself rather scarce. Functional estrogen receptor-alpha as well as -beta have recently been identified in the myocardium, making the myocardium an estrogen target organ. The focus of this thesis was 1) to investigate the role of estrogen and estrogen receptors in modulating myocardial gene expression both in vivo in an animal model for cardiac hypertrophy (spontaneously hypertensive rats; SHR), as well as in vitro in isolated neonatal cardiomyocytes, 2) to investigate the mechanisms of the rapid induction of an estrogen target gene, the early growth response gene-1 (Egr-1) and 3) to initiate the search for novel estrogen target genes in the myocardium. 1) The effects of estrogen on the expression of one of the major myocardial specific contractile proteins, the alpha-myosin heavy chain (alpha-MHC) have been investigated. In ovarectomised animals treated either with 17beta-estradiol alone or in combination with a specific estrogen receptor antagonist, ICI 182780, it was shown that both alpha-MHC mRNA and protein were upregulated by estrogen in an estrogen receptor specific manner. The in vivo results were confirmed in vitro in isolated neonatal cardiomyocytes which showed that estrogen has a direct action on the myocardium potent enough to upregulate the expression of alpha-MHC. Furthermore it was shown that the alpha-MHC promoter is induced by estrogen in an estrogen receptor-dependent manner and first investigations into the mechanisms involved in this upregulation identified Egr-1 as a potential transcription factor which, upon induction by estrogen, drives the expression of the alpha-MHC promoter. 2) Previously it was shown that Egr-1 is rapidly induced by estrogen in an estrogen receptor-dependent manner which was mediated via 5 serum response elements (SREs) in the promoter region and surprisingly not via the estrogen response elements (EREs). In this study it was shown that estrogen-treatment of cardiomyocytes resulted in the recruitment of serum response factor (SRF), or an antigenically related protein, to the SREs in the Egr-1 promoter, which was specifically inhibited by the estrogen receptor antagonist ICI 182780. Transfection experiments showed that estrogen induced a heterologous promoter consisting only of 5 tandem repeats of the c-fos SRE in an ER-dependent manner, which identified SREs as promoter elements able to confer an estrogen response to target genes. 3) Potentially new target genes regulated by estrogen in vivo were analysed using hearts of ovarectomised animals as well as ovarectomised animals treated with estrogen. Analyses of cDNA microarray filters containing 1250 known genes identified 24 genes that were modified by estrogen in vivo. Among these genes, some might have potentially important functions in the heart and further analyses of these genes will create a more global picture of the role and function of estrogen in the myocardium. Taken together, the results showed that estrogen does have a direct action on the myocardium both by regulating the expression of myocardial specific genes in vivo, as well as exerting rapid non-nuclear effects in cardiac myocytes. It was shown that SREs in the promoter region of genes can confer an estrogen response to genes identifying SREs as important elements in regulation of genes by estrogen. Furthermore, 24 potentially new estrogen targets were identified in the myocardium, contributing to the general understanding of estrogen action in the myocardium.
The Gram-negative, spiral-shaped, microaerophilic bacterium Helicobacter pylori is the causative agent of various disorders of the upper gastrointestinal tract, such as chronic superficial gastritis, chronic active gastritis, peptic ulceration and adenocarcinoma. Although many of the bacterial factors associated with disease development have been analysed in some detail in the recent years, very few studies have focused so far on the mechanisms that regulate expression of these factors at the molecular level. In an attempt to obtain an overview of the basic mechanisms of virulence gene expression in H. pylori, three important virulence factors of this pathogen, representative of different pathogenic mechanisms and different phases of the infectious process, are investigated in detail in the present thesis regarding their transcriptional regulation. As an essential factor for the early phase of infection, including the colonisation of the gastric mucosa, the flagella are analysed; the chaperones including the putative adhesion factors GroEL and DnaK are investigated as representatives of the phase of adherence to the gastric epithelium and persistence in the mucus layer; and finally the cytotoxin associated antigen CagA is analysed as representative of the cag pathogenicity island, which is supposed to account for the phenomena of chronic inflammation and tissue damage observed in the later phases of infection. RNA analyses and in vitro transcription demonstrate that a single promoter regulates expression of cagA, while two promoters are responsible for expression of the upstream divergently transcribed cagB gene. All three promoters are shown to be recognised by RNA polymerase containing the vegetative sigma factor sigma 80. Promoter deletion analyses establish that full activation of the cagA promoter requires sequences up to -70 and binding of the C-terminal portion of the alpha subunit of RNA polymerase to an UP-like element located between -40 and -60, while full activation of the major cagB promoter requires sequences upstream of -96 which overlap with the cagA promoter. These data suggest that the promoters of the pathogenicity island represent a class of minimum promoters, that ensure a basic level of transcription, while full activation requires regulatory elements or structural DNA binding proteins that provide a suitable DNA context. Regarding flagellar biosynthesis, a master transcriptional factor is identified that regulates expression of a series of flagellar basal body and hook genes in concert with the alternative sigma factor sigma 54. Evidence is provided that this regulator, designated FlgR (for flagellar regulatory protein), is necessary for motility and transcription of five promoters for seven basal body and hook genes. In addition, FlgR is shown to act as a repressor of transcription of the sigma 28-regulated promoter of the flaA gene, while changes in DNA topology are shown to affect transcription of the sigma 54-regulated flaB promoter. These data indicate that the regulatory network that governs flagellar gene expression in H. pylori shows similarities to the systems of both Salmonella spp. and Caulobacter crescentus. In contrast to the flagellar genes which are regulated by three different sigma factors, the three operons encoding the major chaperones of H. pylori are shown to be transcribed by RNA polymerase containing the vegetative sigma factor sigma 80. Expression of these operons is shown to be regulated negatively by the transcriptional repressor HspR, a homologue of a repressor protein of Streptomyces spp., known to be involved in negative regulation of heat shock genes. In vitro studies with purified recombinant HspR establish that the protein represses transcription by binding to large DNA regions centered around the transcription initiation site in the case of one promoter, and around -85 and -120 in the case of the the other two promoters. In contrast to the situation in Streptomyces, where transcription of HspR-regulated genes is induced in response to heat shock, transcription of the HspR-dependent genes in H. pylori is not inducible with thermal stimuli. Transcription of two of the three chaperone encoding operons is induced by osmotic shock, while transcription of the third operon, although HspR-dependent, is not affected by salt treatment. Taken together, the analyses carried out indicate that H. pylori has reduced its repertoire of specific regulatory proteins to a basic level that may ensure coordinate regulation of those factors that are necessary during the initial phase of infection including the passage through the gastric lumen and the colonisation of the gastric mucosa. The importance of DNA topology and/or context for transcription of many virulence gene promoters may on the other hand indicate, that a sophisticated global regulatory network is present in H. pylori, which influences transcription of specific subsets of virulence genes in response to changes in the microenvironment.