The Bag of Communities: Identifying Abusive Behavior Online with Preexisting Internet Data
Item
Title
The Bag of Communities: Identifying Abusive Behavior Online with Preexisting Internet Data
CHI '17
Creator
Eshwar Chandrasekharan
Mattia Samory
Anirudh Srinivasan
Eric Gilbert
Abstract
Since its earliest days, harassment and abuse have plagued the Internet. Recent research has focused on in-domain methods to detect abusive content and faces several challenges, most notably the need to obtain large training corpora. In this paper, we introduce a novel computational approach to address this problem called Bag of Communities (BoC)---a technique that leverages large-scale, preexisting data from other Internet communities. We then apply BoC toward identifying abusive behavior within a major Internet community. Specifically, we compute a post's similarity to 9 other communities from 4chan, Reddit, Voat and MetaFilter. We show that a BoC model can be used on communities "off the shelf" with roughly 75% accuracy---no training examples are needed from the target community. A dynamic BoC model achieves 91.18% accuracy after seeing 100,000 human-moderated posts, and uniformly outperforms in-domain methods. Using this conceptual and empirical work, we argue that the BoC approach may allow communities to deal with a range of common problems, like abusive behavior, faster and with fewer engineering resources.
Date
2017
Is Part Of
Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems
Publisher
New York, NY, USA
ACM
pages
3175–3187
Language
EN English
doi
10.1145/3025453.3026018
isbn
978-1-4503-4655-9
short title
The Bag of Communities