Novel Techniques for Efficient and Effective Subgroup Discovery

Lemmerich, Florian

Large volumes of data are collected today in many domains. Often, there is so much data available, that it is difficult to identify the relevant pieces of information. Knowledge discovery seeks to obtain novel, interesting and useful information from large datasets. One key technique for that purpose is subgroup discovery. It aims at identifying descriptions for subsets of the data, which have an interesting distribution with respect to a predefined target concept. This work improves the efficiency and effectiveness of subgroup discovery inLarge volumes of data are collected today in many domains. Often, there is so much data available, that it is difficult to identify the relevant pieces of information. Knowledge discovery seeks to obtain novel, interesting and useful information from large datasets. One key technique for that purpose is subgroup discovery. It aims at identifying descriptions for subsets of the data, which have an interesting distribution with respect to a predefined target concept. This work improves the efficiency and effectiveness of subgroup discovery in different directions. For efficient exhaustive subgroup discovery, algorithmic improvements are proposed for three important variations of the standard setting: First, novel optimistic estimate bounds are derived for subgroup discovery with numeric target concepts. These allow for skipping the evaluation of large parts of the search space without influencing the results. Additionally, necessary adaptations to data structures for this setting are discussed. Second, for exceptional model mining, that is, subgroup discovery with a model over multiple attributes as target concept, a generic extension of the well-known FP-tree data structure is introduced. The modified data structure stores intermediate condensed data representations, which depend on the chosen model class, in the nodes of the trees. This allows the application for many popular model classes. Third, subgroup discovery with generalization-aware measures is investigated. These interestingness measures compare the target share or mean value in the subgroup with the respective maximum value in all its generalizations. For this setting, a novel method for deriving optimistic estimates is proposed. In contrast to previous approaches, the novel measures are not exclusively based on the anti-monotonicity of instance coverage, but also takes the difference of coverage between the subgroup and its generalizations into account. In all three areas, the advances lead to runtime improvements of more than an order of magnitude. The second part of the contributions focuses on the \emph{effectiveness} of subgroup discovery. These improvements aim to identify more interesting subgroups in practical applications. For that purpose, the concept of expectation-driven subgroup discovery is introduced as a new family of interestingness measures. It computes the score of a subgroup based on the difference between the actual target share and the target share that could be expected given the statistics for the separate influence factors that are combined to describe the subgroup. In doing so, previously undetected interesting subgroups are discovered, while other, partially redundant findings are suppressed. Furthermore, this work also approaches practical issues of subgroup discovery: In that direction, the VIKAMINE II tool is presented, which extends its predecessor with a rebuild user interface, novel algorithms for automatic discovery, new interactive mining techniques, as well novel options for result presentation and introspection. Finally, some real-world applications are described that utilized the presented techniques. These include the identification of influence factors on the success and satisfaction of university students and the description of locations using tagging data of geo-referenced images.… zeige mehr

Autor(en):	Florian Lemmerich
URN:	urn:nbn:de:bvb:20-opus-97812
Dokumentart:	Dissertation
Titelverleihende Fakultät:	Universität Würzburg, Fakultät für Mathematik und Informatik
Institute der Universität:	Fakultät für Mathematik und Informatik / Institut für Informatik
Gutachter / Betreuer:	Prof. Dr. Frank Puppe, Prof. Dr. Nada Lavrac, Dr. Arno Knobbe
Datum der Abschlussprüfung:	31.03.2014
Sprache der Veröffentlichung:	Englisch
Erscheinungsjahr:	2014
Allgemeine fachliche Zuordnung (DDC-Klassifikation):	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
Normierte Schlagworte (GND):	Data Mining; Wissensextraktion
Freie Schlagwort(e):	Pattern Mining; Subgroup Discovery
Fachklassifikation Informatik (CCS):	D. Software
Datum der Freischaltung:	31.03.2015
Lizenz (Deutsch):	CC BY: Creative-Commons-Lizenz: Namensnennung

Novel Techniques for Efficient and Effective Subgroup Discovery

Neue Techniken für effiziente und effektive Subgruppenentdeckung

Volltext Dateien herunterladen

Metadaten exportieren

Weitere Dienste