Refine
Has Fulltext
- yes (2)
Is part of the Bibliography
- yes (2)
Year of publication
- 2015 (2) (remove)
Document Type
- Doctoral Thesis (2)
Language
- English (2) (remove)
Keywords
Institute
Context-specific Consistencies in Information Extraction: Rule-based and Probabilistic Approaches
(2015)
Large amounts of communication, documentation as well as knowledge and information are stored in textual documents. Most often, these texts like webpages, books, tweets or reports are only available in an unstructured representation since they are created and interpreted by humans. In order to take advantage of this huge amount of concealed information and to include it in analytic processes, it needs to be transformed into a structured representation. Information extraction considers exactly this task. It tries to identify well-defined entities and relations in unstructured data and especially in textual documents.
Interesting entities are often consistently structured within a certain context, especially in semi-structured texts. However, their actual composition varies and is possibly inconsistent among different contexts. Information extraction models stay behind their potential and return inferior results if they do not consider these consistencies during processing. This work presents a selection of practical and novel approaches for exploiting these context-specific consistencies in information extraction tasks. The approaches direct their attention not only to one technique, but are based on handcrafted rules as well as probabilistic models.
A new rule-based system called UIMA Ruta has been developed in order to provide optimal conditions for rule engineers. This system consists of a compact rule language with a high expressiveness and strong development support. Both elements facilitate rapid development of information extraction applications and improve the general engineering experience, which reduces the necessary efforts and costs when specifying rules.
The advantages and applicability of UIMA Ruta for exploiting context-specific consistencies are illustrated in three case studies. They utilize different engineering approaches for including the consistencies in the information extraction task. Either the recall is increased by finding additional entities with similar composition, or the precision is improved by filtering inconsistent entities. Furthermore, another case study highlights how transformation-based approaches are able to correct preliminary entities using the knowledge about the occurring consistencies.
The approaches of this work based on machine learning rely on Conditional Random Fields, popular probabilistic graphical models for sequence labeling. They take advantage of a consistency model, which is automatically induced during processing the document. The approach based on stacked graphical models utilizes the learnt descriptions as feature functions that have a static meaning for the model, but change their actual function for each document. The other two models extend the graph structure with additional factors dependent on the learnt model of consistency. They include feature functions for consistent and inconsistent entities as well as for additional positions that fulfill the consistencies.
The presented approaches are evaluated in three real-world domains: segmentation of scientific references, template extraction in curricula vitae, and identification and categorization of sections in clinical discharge letters. They are able to achieve remarkable results and provide an error reduction of up to 30% compared to usually applied techniques.
Knowledge-based systems (KBS) face an ever-increasing interest in various disciplines and contexts. Yet, the former aim to construct the ’perfect intelligent software’ continuously shifts to user-centered, participative solutions. Such systems enable users to contribute their personal knowledge to the problem solving process for increased efficiency and an ameliorated user experience. More precisely, we define non-functional key requirements of participative KBS as: Transparency (encompassing KBS status mediation), configurability (user adaptability, degree of user control/exploration), quality of the KB and UI, and evolvability (enabling the KBS to grow mature with their users). Many of those requirements depend on the respective target users, thus calling for a more user-centered development. Often, also highly expertise domains are targeted — inducing highly complex KBs — which requires a more careful and considerate UI/interaction design. Still, current KBS engineering (KBSE) approaches mostly focus on knowledge acquisition (KA) This often leads to non-optimal, little reusable, and non/little evaluated KBS front-end solutions.
In this thesis we propose a more encompassing KBSE approach. Due to the strong mutual influences between KB and UI, we suggest a novel form of intertwined UI and KB development. We base the approach on three core components for encompassing KBSE:
(1) Extensible prototyping, a tailored form of evolutionary prototyping; this builds on mature UI prototypes and offers two extension steps for the anytime creation of core KBS prototypes (KB + core UI) and fully productive KBS (core KBS prototype + common framing functionality). (2) KBS UI patterns, that define reusable solutions for the core KBS UI/interaction; we provide a basic collection of such patterns in this work. (3) Suitable usability instruments for the assessment of the KBS artifacts. Therewith, we do not strive for ’yet another’ self-contained KBS engineering methodology. Rather, we motivate to extend existing approaches by the proposed key components. We demonstrate this based on an agile KBSE model.
For practical support, we introduce the tailored KBSE tool ProKEt. ProKEt offers a basic selection of KBS core UI patterns and corresponding configuration options out of the box; their further adaption/extension is possible on various levels of expertise. For practical usability support, ProKEt offers facilities for quantitative and qualitative data collection. ProKEt explicitly fosters the suggested, intertwined development of UI and KB. For seamlessly integrating KA activities, it provides extension points for two selected external KA tools: For KnowOF, a standard office based KA environment. And for KnowWE, a semantic wiki for collaborative KA. Therewith, ProKEt offers powerful support for encompassing, user-centered KBSE.
Finally, based on the approach and the tool, we also developed a novel KBS type: Clarification KBS as a mashup of consultation and justification KBS modules. Those denote a specifically suitable realization for participative KBS in highly expertise contexts and consequently require a specific design. In this thesis, apart from more common UI solutions, we particularly also introduce KBS UI patterns especially tailored towards Clarification KBS.