Refine
Has Fulltext
- yes (3)
Is part of the Bibliography
- yes (3)
Document Type
- Doctoral Thesis (3)
Language
- English (3) (remove)
Keywords
- Data Mining (3) (remove)
Institute
- Institut für Informatik (3) (remove)
Social interactions as introduced by Web 2.0 applications during the last decade have changed the way the Internet is used. Today, it is part of our daily lives to maintain contacts through social networks, to comment on the latest developments in microblogging services or to save and share information snippets such as photos or bookmarks online.
Social bookmarking systems are part of this development. Users can share links to interesting web pages by publishing bookmarks and providing descriptive keywords for them. The structure which evolves from the collection of annotated bookmarks is called a folksonomy. The sharing of interesting and relevant posts enables new ways of retrieving information from the Web. Users
can search or browse the folksonomy looking at resources related to specific tags or users. Ranking methods known from search engines have been adjusted to facilitate retrieval in social bookmarking systems. Hence, social bookmarking systems have become an alternative or addendum to search engines.
In order to better understand the commonalities and differences of social bookmarking systems and search engines, this thesis compares several aspects of the two systems' structure, usage behaviour and content. This includes the use of tags and query terms, the composition of the document collections and the rankings of bookmarks and search engine URLs. Searchers (recorded via session ids), their search terms and the clicked on URLs can be extracted from a search
engine query logfile. They form similar links as can be found in folksonomies where a user annotates a resource with tags. We use this analogy to build a tripartite hypergraph from query logfiles (a logsonomy), and compare structural and semantic properties of log- and folksonomies. Overall, we have found similar behavioural, structural and semantic characteristics in both systems. Driven by this insight, we investigate, if folksonomy data can be of use in web
information retrieval in a similar way to query log data: we construct training data from query logs and a folksonomy to build models for a learning-to-rank algorithm. First experiments show a positive correlation of ranking results generated from the ranking models of both systems. The research is based on various data collections from the social bookmarking systems BibSonomy and Delicious, Microsoft's search engine MSN (now Bing) and Google data.
To maintain social bookmarking systems as a good source for information retrieval, providers need to fight spam. This thesis introduces and analyses different features derived from the specific characteristics of social bookmarking systems to be used in spam detection classification algorithms. Best results can be derived from a combination of profile, activity, semantic and location-based features. Based on the experiments, a spam detection framework which identifies and eliminates spam activities for the social bookmarking system BibSonomy has been developed.
The storing and publication of user-related bookmarks and profile information raises questions about user data privacy. What kinds of personal information is collected and how do systems handle user-related items? In order to answer these questions, the thesis looks into the handling of data privacy in the social bookmarking system BibSonomy. Legal guidelines about how to deal with the private data collected and processed in social bookmarking systems are also presented. Experiments will show that the consideration of user data privacy in the process
of feature design can be a first step towards strengthening data privacy.
Large volumes of data are collected today in many domains. Often, there is so much data available, that it is difficult to identify the relevant pieces of information. Knowledge discovery seeks to obtain novel, interesting and useful information from large datasets.
One key technique for that purpose is subgroup discovery. It aims at identifying descriptions for subsets of the data, which have an interesting distribution with respect to a predefined target concept. This work improves the efficiency and effectiveness of subgroup discovery in different directions.
For efficient exhaustive subgroup discovery, algorithmic improvements are proposed for three important variations of the standard setting: First, novel optimistic estimate bounds are derived for subgroup discovery with numeric target concepts. These allow for skipping the evaluation of large parts of the search space without influencing the results. Additionally, necessary adaptations to data structures for this setting are discussed. Second, for exceptional model mining, that is, subgroup discovery with a model over multiple attributes as target concept, a generic extension of the well-known FP-tree data structure is introduced. The modified data structure stores intermediate condensed data representations, which depend on the chosen model class, in the nodes of the trees. This allows the application for many popular model classes. Third, subgroup discovery with generalization-aware measures is investigated.
These interestingness measures compare the target share or mean value in the subgroup with the respective maximum value in all its generalizations. For this setting, a novel method for deriving optimistic estimates is proposed. In contrast to previous approaches, the novel measures are not exclusively based on the anti-monotonicity of instance coverage, but also takes the difference of coverage between the subgroup and its generalizations into account. In all three areas, the advances lead to runtime improvements of more than an order of magnitude.
The second part of the contributions focuses on the \emph{effectiveness} of subgroup discovery. These improvements aim to identify more interesting subgroups in practical applications. For that purpose, the concept of expectation-driven subgroup discovery is introduced as a new family of interestingness measures. It computes the score of a subgroup based on the difference between the actual target share and the target share that could be expected given the statistics for the separate influence factors that are combined to describe the subgroup.
In doing so, previously undetected interesting subgroups are discovered, while other, partially redundant findings are suppressed.
Furthermore, this work also approaches practical issues of subgroup discovery: In that direction, the VIKAMINE II tool is presented, which extends its predecessor with a rebuild user interface, novel algorithms for automatic discovery, new interactive mining techniques, as well novel options for result presentation and introspection. Finally, some real-world applications are described that utilized the presented techniques. These include the identification of influence factors on the success and satisfaction of university students and the description of locations using tagging data of geo-referenced images.
Data mining has proved its significance in various domains and applications. As an important subfield of the general data mining task, subgroup mining can be used, e.g., for marketing purposes in business domains, or for quality profiling and analysis in medical domains. The goal is to efficiently discover novel, potentially useful and ultimately interesting knowledge. However, in real-world situations these requirements often cannot be fulfilled, e.g., if the applied methods do not scale for large data sets, if too many results are presented to the user, or if many of the discovered patterns are already known to the user. This thesis proposes a combination of several techniques in order to cope with the sketched problems: We discuss automatic methods, including heuristic and exhaustive approaches, and especially present the novel SD-Map algorithm for exhaustive subgroup discovery that is fast and effective. For an interactive approach we describe techniques for subgroup introspection and analysis, and we present advanced visualization methods, e.g., the zoomtable that directly shows the most important parameters of a subgroup and that can be used for optimization and exploration. We also describe various visualizations for subgroup comparison and evaluation in order to support the user during these essential steps. Furthermore, we propose to include possibly available background knowledge that is easy to formalize into the mining process. We can utilize the knowledge in many ways: To focus the search process, to restrict the search space, and ultimately to increase the efficiency of the discovery method. We especially present background knowledge to be applied for filtering the elements of the problem domain, for constructing abstractions, for aggregating values of attributes, and for the post-processing of the discovered set of patterns. Finally, the techniques are combined into a knowledge-intensive process supporting both automatic and interactive methods for subgroup mining. The practical significance of the proposed approach strongly depends on the available tools. We introduce the VIKAMINE system as a highly-integrated environment for knowledge-intensive active subgroup mining. Also, we present an evaluation consisting of two parts: With respect to objective evaluation criteria, i.e., comparing the efficiency and the effectiveness of the subgroup discovery methods, we provide an experimental evaluation using generated data. For that task we present a novel data generator that allows a simple and intuitive specification of the data characteristics. The results of the experimental evaluation indicate that the novel SD-Map method outperforms the other described algorithms using data sets similar to the intended application concerning the efficiency, and also with respect to precision and recall for the heuristic methods. Subjective evaluation criteria include the user acceptance, the benefit of the approach, and the interestingness of the results. We present five case studies utilizing the presented techniques: The approach has been successfully implemented in medical and technical applications using real-world data sets. The method was very well accepted by the users that were able to discover novel, useful, and interesting knowledge.