The search result changed since you submitted your search request. Documents might be displayed in a different sort order.
  • search hit 1 of 2
Back to Result List

Context-specific Consistencies in Information Extraction: Rule-based and Probabilistic Approaches

Kontextspezifische Konsistenzen in der Informationsextraktion: Regelbasierte und Probabilistische Ansätze

Please always quote using this URN: urn:nbn:de:bvb:20-opus-108352
  • Large amounts of communication, documentation as well as knowledge and information are stored in textual documents. Most often, these texts like webpages, books, tweets or reports are only available in an unstructured representation since they are created and interpreted by humans. In order to take advantage of this huge amount of concealed information and to include it in analytic processes, it needs to be transformed into a structured representation. Information extraction considers exactly this task. It tries to identify well-definedLarge amounts of communication, documentation as well as knowledge and information are stored in textual documents. Most often, these texts like webpages, books, tweets or reports are only available in an unstructured representation since they are created and interpreted by humans. In order to take advantage of this huge amount of concealed information and to include it in analytic processes, it needs to be transformed into a structured representation. Information extraction considers exactly this task. It tries to identify well-defined entities and relations in unstructured data and especially in textual documents. Interesting entities are often consistently structured within a certain context, especially in semi-structured texts. However, their actual composition varies and is possibly inconsistent among different contexts. Information extraction models stay behind their potential and return inferior results if they do not consider these consistencies during processing. This work presents a selection of practical and novel approaches for exploiting these context-specific consistencies in information extraction tasks. The approaches direct their attention not only to one technique, but are based on handcrafted rules as well as probabilistic models. A new rule-based system called UIMA Ruta has been developed in order to provide optimal conditions for rule engineers. This system consists of a compact rule language with a high expressiveness and strong development support. Both elements facilitate rapid development of information extraction applications and improve the general engineering experience, which reduces the necessary efforts and costs when specifying rules. The advantages and applicability of UIMA Ruta for exploiting context-specific consistencies are illustrated in three case studies. They utilize different engineering approaches for including the consistencies in the information extraction task. Either the recall is increased by finding additional entities with similar composition, or the precision is improved by filtering inconsistent entities. Furthermore, another case study highlights how transformation-based approaches are able to correct preliminary entities using the knowledge about the occurring consistencies. The approaches of this work based on machine learning rely on Conditional Random Fields, popular probabilistic graphical models for sequence labeling. They take advantage of a consistency model, which is automatically induced during processing the document. The approach based on stacked graphical models utilizes the learnt descriptions as feature functions that have a static meaning for the model, but change their actual function for each document. The other two models extend the graph structure with additional factors dependent on the learnt model of consistency. They include feature functions for consistent and inconsistent entities as well as for additional positions that fulfill the consistencies. The presented approaches are evaluated in three real-world domains: segmentation of scientific references, template extraction in curricula vitae, and identification and categorization of sections in clinical discharge letters. They are able to achieve remarkable results and provide an error reduction of up to 30% compared to usually applied techniques.show moreshow less
  • Diese Arbeit befasst sich mit regelbasierten und probabilistischen Ansätzen der Informationsextraktion, welche kontextspezifische Konsistenzen ausnutzen und somit die Extraktionsgenauigkeit verbessern.

Download full text files

Export metadata

Metadaten
Author: Peter Klügl
URN:urn:nbn:de:bvb:20-opus-108352
Document Type:Doctoral Thesis
Granting Institution:Universität Würzburg, Fakultät für Mathematik und Informatik
Faculties:Fakultät für Mathematik und Informatik / Institut für Informatik
Referee:Prof. Dr. Frank Puppe, Prof. Dr. Andreas Dengel, Prof. Dr. Ulrich Furbach
Date of final exam:2014/12/12
Language:English
Year of Completion:2015
Publisher:Würzburg University Press
Place of publication:Würzburg
ISBN:978-3-95826-018-4 (print)
ISBN:978-3-95826-019-1 (online)
DOI:https://doi.org/10.25972/WUP-978-3-95826-019-1
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
GND Keyword:Information Extraction; Maschinelles Lernen
Tag:knowledge engineering
CCS-Classification:H. Information Systems / H.1 MODELS AND PRINCIPLES / H.1.0 General
Release Date:2015/01/16
Licence (German):License LogoCC BY-SA: Creative-Commons-Lizenz: Namensnennung, Weitergabe unter gleichen Bedingungen