TY - JOUR A1 - Caliskan, Aylin A1 - Caliskan, Deniz A1 - Rasbach, Lauritz A1 - Yu, Weimeng A1 - Dandekar, Thomas A1 - Breitenbach, Tim T1 - Optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning JF - Computational and Structural Biotechnology Journal N2 - Machine learning techniques are excellent to analyze expression data from single cells. These techniques impact all fields ranging from cell annotation and clustering to signature identification. The presented framework evaluates gene selection sets how far they optimally separate defined phenotypes or cell groups. This innovation overcomes the present limitation to objectively and correctly identify a small gene set of high information content regarding separating phenotypes for which corresponding code scripts are provided. The small but meaningful subset of the original genes (or feature space) facilitates human interpretability of the differences of the phenotypes including those found by machine learning results and may even turn correlations between genes and phenotypes into a causal explanation. For the feature selection task, the principal feature analysis is utilized which reduces redundant information while selecting genes that carry the information for separating the phenotypes. In this context, the presented framework shows explainability of unsupervised learning as it reveals cell-type specific signatures. Apart from a Seurat preprocessing tool and the PFA script, the pipeline uses mutual information to balance accuracy and size of the gene set if desired. A validation part to evaluate the gene selection for their information content regarding the separation of the phenotypes is provided as well, binary and multiclass classification of 3 or 4 groups are studied. Results from different single-cell data are presented. In each, only about ten out of more than 30000 genes are identified as carrying the relevant information. The code is provided in a GitHub repository at https://github.com/AC-PHD/Seurat_PFA_pipeline. KW - single cell analysis KW - machine learning KW - explainability of machine learning KW - principal KW - feature analysis KW - model reduction KW - feature selection Y1 - 2023 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bvb:20-opus-349989 SN - 2001-0370 VL - 21 ER - TY - JOUR A1 - Caliskan, Aylin A1 - Dangwal, Seema A1 - Dandekar, Thomas T1 - Metadata integrity in bioinformatics: bridging the gap between data and knowledge JF - Computational and Structural Biotechnology Journal N2 - In the fast-evolving landscape of biomedical research, the emergence of big data has presented researchers with extraordinary opportunities to explore biological complexities. In biomedical research, big data imply also a big responsibility. This is not only due to genomics data being sensitive information but also due to genomics data being shared and re-analysed among the scientific community. This saves valuable resources and can even help to find new insights in silico. To fully use these opportunities, detailed and correct metadata are imperative. This includes not only the availability of metadata but also their correctness. Metadata integrity serves as a fundamental determinant of research credibility, supporting the reliability and reproducibility of data-driven findings. Ensuring metadata availability, curation, and accuracy are therefore essential for bioinformatic research. Not only must metadata be readily available, but they must also be meticulously curated and ideally error-free. Motivated by an accidental discovery of a critical metadata error in patient data published in two high-impact journals, we aim to raise awareness for the need of correct, complete, and curated metadata. We describe how the metadata error was found, addressed, and present examples for metadata-related challenges in omics research, along with supporting measures, including tools for checking metadata and software to facilitate various steps from data analysis to published research. Highlights • Data awareness and data integrity underpins the trustworthiness of results and subsequent further analysis. • Big data and bioinformatics enable efficient resource use by repurposing publicly available RNA-Sequencing data. • Manual checks of data quality and integrity are insufficient due to the overwhelming volume and rapidly growing data. • Automation and artificial intelligence provide cost-effective and efficient solutions for data integrity and quality checks. • FAIR data management, various software solutions and analysis tools assist metadata maintenance. KW - meta-data KW - error KW - annotation KW - error-transfer KW - wrong labelling KW - patient data KW - control group KW - tools overview Y1 - 2023 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bvb:20-opus-349990 SN - 2001-0370 VL - 21 ER - TY - JOUR A1 - Caliskan, Aylin A1 - Crouch, Samantha A. W. A1 - Giddins, Sara A1 - Dandekar, Thomas A1 - Dangwal, Seema T1 - Progeria and aging — Omics based comparative analysis JF - Biomedicines N2 - Since ancient times aging has also been regarded as a disease, and humankind has always strived to extend the natural lifespan. Analyzing the genes involved in aging and disease allows for finding important indicators and biological markers for pathologies and possible therapeutic targets. An example of the use of omics technologies is the research regarding aging and the rare and fatal premature aging syndrome progeria (Hutchinson-Gilford progeria syndrome, HGPS). In our study, we focused on the in silico analysis of differentially expressed genes (DEGs) in progeria and aging, using a publicly available RNA-Seq dataset (GEO dataset GSE113957) and a variety of bioinformatics tools. Despite the GSE113957 RNA-Seq dataset being well-known and frequently analyzed, the RNA-Seq data shared by Fleischer et al. is far from exhausted and reusing and repurposing the data still reveals new insights. By analyzing the literature citing the use of the dataset and subsequently conducting a comparative analysis comparing the RNA-Seq data analyses of different subsets of the dataset (healthy children, nonagenarians and progeria patients), we identified several genes involved in both natural aging and progeria (KRT8, KRT18, ACKR4, CCL2, UCP2, ADAMTS15, ACTN4P1, WNT16, IGFBP2). Further analyzing these genes and the pathways involved indicated their possible roles in aging, suggesting the need for further in vitro and in vivo research. In this paper, we (1) compare “normal aging” (nonagenarians vs. healthy children) and progeria (HGPS patients vs. healthy children), (2) enlist genes possibly involved in both the natural aging process and progeria, including the first mention of IGFBP2 in progeria, (3) predict miRNAs and interactomes for WNT16 (hsa-mir-181a-5p), UCP2 (hsa-mir-26a-5p and hsa-mir-124-3p), and IGFBP2 (hsa-mir-124-3p, hsa-mir-126-3p, and hsa-mir-27b-3p), (4) demonstrate the compatibility of well-established R packages for RNA-Seq analysis for researchers interested but not yet familiar with this kind of analysis, and (5) present comparative proteomics analyses to show an association between our RNA-Seq data analyses and corresponding changes in protein expression. KW - progeria KW - aging KW - omics KW - RNA sequencing KW - bioinformatics KW - sun exposure KW - HGPS KW - IGFBP2 KW - ACKR4 KW - WNT Y1 - 2022 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bvb:20-opus-289868 SN - 2227-9059 VL - 10 IS - 10 ER -