54 Informatik
Filtern
Dokumenttyp
- Masterarbeit (3)
- Dissertation (1)
Schlagworte
- Analysis of social platform (1)
- Handsfree editing (1)
- Natural Language Processing (1)
- Reddit (1)
- Text classification (1)
Institut
- Institute for Web Science and Technologies (4) (entfernen)
The distributed setting of RDF stores in the cloud poses many challenges. One such challenge is how the data placement on the compute nodes can be optimized to improve the query performance. To address this challenge, several evaluations in the literature have investigated the effects of existing data placement strategies on the query performance. A common drawback in theses evaluations is that it is unclear whether the observed behaviors were caused by the data placement strategies (if different RDF stores were evaluated as a whole) or reflect the behavior in distributed RDF stores (if cloud processing frameworks like Hadoop MapReduce are used for the evaluation). To overcome these limitations, this thesis develops a novel benchmarking methodology for data placement strategies that uses a data-placement-strategy-independent distributed RDF store to analyze the effect of the data placement strategies on query performance.
With this evaluation methodology the frequently used data placement strategies have been evaluated. This evaluation challenged the commonly held belief that data placement strategies that emphasize local computation, such as minimal edge-cut cover, lead to faster query executions. The results indicate that queries with a high workload may be executed faster on hash-based data placement strategies than on, e.g., minimal edge-cut covers. The analysis of the additional measurements indicates that vertical parallelization (i.e., a well-distributed workload) may be more important than horizontal containment (i.e., minimal data transport) for efficient query processing.
Moreover, to find a data placement strategy with a high vertical parallelization, the thesis tests the hypothesis that collocating small connected triple sets on the same compute node while balancing the amount of triples stored on the different compute nodes leads to a high vertical parallelization. Specifically, the thesis proposes two such data placement strategies. The first strategy called overpartitioned minimal edge-cut cover was found in the literature and the second strategy is the newly developed molecule hash cover. The evaluation revealed a balanced query workload and a high horizontal containment, which lead to a high vertical parallelization. As a result these strategies showed a better query performance than the frequently used data placement strategies.
Commonsense reasoning can be seen as a process of identifying dependencies amongst events and actions. Understanding the circumstances surrounding these events requires background knowledge with sufficient breadth to cover a wide variety of domains. In the recent decades, there has been a lot of work in extracting commonsense knowledge, a number of these projects provide their collected data as semantic networks such as ConceptNet and CausalNet. In this thesis, we attempt to undertake the Choice Of Plausible Alternatives (COPA) challenge, a problem set with 1000 questions written in multiple-choice format with a premise and two alternative choices for each question. Our approach differs from previous work by using shortest paths between concepts in a causal graph with the edge weight as causality metric. We use CausalNet as primary network and implement a few design choices to explore the strengths and drawbacks of this approach, and propose an extension using ConceptNet by leveraging its commonsense knowledge base.
The content aggregator platform Reddit has established itself as one of the most popular websites in the world. However, scientific research on Reddit is hindered as Reddit allows (and even encourages) user anonymity, i.e., user profiles do not contain personal information such as the gender. Inferring the gender of users in large-scale could enable the analysis of gender-specific areas of interest, reactions to events, and behavioral patterns. In this direction, this thesis suggests a machine learning approach of estimating the gender of Reddit users. By exploiting specific conventions in parts of the website, we obtain a ground truth for more than 190 million comments of labeled users. This data is then used to train machine learning classifiers to use them to gain insights about the gender balance of particular subreddits and the platform in general. By comparing a variety of different approaches for classification algorithm, we find that character-level convolutional neural network achieves performance with an 82.3% F1 score on a task of predicting a gender of a user based on his/her comments. The score surpasses 85% mark for frequent users with more than 50 comments. Furthermore, we discover that female users are less active on Reddit platform, they write fewer comments and post in fewer subreddits on average, when compared to male users.