Exploratory Wrangling and Annotation of Tweets
Berkay Dinçer
Computer Science, MSc. Thesis Defense, 2015
Thesis Jury
Assoc. Prof. Dr. Yücel Saygın (Thesis Advisor), Assoc. Prof. Dr. Hüsnü Yenigün
, Assoc. Prof. Dr. Ali Inan
Date & Time: 23rd of July, 2015 – 13.00
Place: G025
Keywords : Data Mining, Clustering, Twitter, Wrangling
Abstract
Twitter is an ever growing social platform that is full of ideas and opinions. Huge amount of data is produced daily that is usually too cumbersome to process and mine for the opinions of individuals. As of 2010, 55 million tweets are sent daily and the number is doubled by now. Also twitter data is not structured as a text based information source, considering the lack of structure of the data along with its huge volume, it is nearly impossible to have a healthy summarization of all the ideas and opinions at real time. Therefore in this work we propose a set of algorithms to cluster relevant tweets and similar tweets talking about the same concept on twitter domain.
We demonstrate and explain how this information can be used on tweets. As a side benefit we also use these algorithms to detect bots or spammer accounts on twitter since we place such tweets to the same clusters. We show that by transforming twitter data into a clustered structure we are able to overcome problems such as detecting bots and providing a neat summary of the data. These problems are solvable by transforming the unstructured data environment of twitter to a more structured data environment by forming clusters and buckets over the data feed. Another interesting observation we made is that the clusters we form follow the Pareto principle therefore by inspecting only 20% of the clusters we can cover 80% of the whole data.