TY - GEN
T1 - Detecting Undisclosed Paid Editing in Wikipedia
AU - Joshi, Nikesh
AU - Spezzano, Francesca
AU - Green, Mayson
AU - Hill, Elijah
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/4/20
Y1 - 2020/4/20
N2 - Wikipedia, the free and open-collaboration based online encyclopedia, has millions of pages that are maintained by thousands of volunteer editors. As per Wikipedia's fundamental principles, pages on Wikipedia are written with a neutral point of view and maintained by volunteer editors for free with well-defined guidelines in order to avoid or disclose any conflict of interest. However, there have been several known incidents where editors intentionally violate such guidelines in order to get paid (or even extort money) for maintaining promotional spam articles without disclosing such. In this paper, we address for the first time the problem of identifying undisclosed paid articles in Wikipedia. We propose a machine learning-based framework using a set of features based on both the content of the articles as well as the patterns of edit history of users who create them. To test our approach, we collected and curated a new dataset from English Wikipedia with ground truth on undisclosed paid articles. Our experimental evaluation shows that we can identify undisclosed paid articles with an AUROC of 0.98 and an average precision of 0.91. Moreover, our approach outperforms ORES, a scoring system tool currently used by Wikipedia to automatically detect damaging content, in identifying undisclosed paid articles. Finally, we show that our user-based features can also detect undisclosed paid editors with an AUROC of 0.94 and an average precision of 0.92, outperforming existing approaches.
AB - Wikipedia, the free and open-collaboration based online encyclopedia, has millions of pages that are maintained by thousands of volunteer editors. As per Wikipedia's fundamental principles, pages on Wikipedia are written with a neutral point of view and maintained by volunteer editors for free with well-defined guidelines in order to avoid or disclose any conflict of interest. However, there have been several known incidents where editors intentionally violate such guidelines in order to get paid (or even extort money) for maintaining promotional spam articles without disclosing such. In this paper, we address for the first time the problem of identifying undisclosed paid articles in Wikipedia. We propose a machine learning-based framework using a set of features based on both the content of the articles as well as the patterns of edit history of users who create them. To test our approach, we collected and curated a new dataset from English Wikipedia with ground truth on undisclosed paid articles. Our experimental evaluation shows that we can identify undisclosed paid articles with an AUROC of 0.98 and an average precision of 0.91. Moreover, our approach outperforms ORES, a scoring system tool currently used by Wikipedia to automatically detect damaging content, in identifying undisclosed paid articles. Finally, we show that our user-based features can also detect undisclosed paid editors with an AUROC of 0.94 and an average precision of 0.92, outperforming existing approaches.
KW - Detection of abusive content
KW - Malicious editors
KW - Sockpuppet accounts.
KW - Wikipedia
UR - https://www.scopus.com/pages/publications/85086577260
U2 - 10.1145/3366423.3380055
DO - 10.1145/3366423.3380055
M3 - Conference contribution
AN - SCOPUS:85086577260
T3 - The Web Conference 2020 - Proceedings of the World Wide Web Conference, WWW 2020
SP - 2899
EP - 2905
BT - The Web Conference 2020 - Proceedings of the World Wide Web Conference, WWW 2020
T2 - 29th International World Wide Web Conference, WWW 2020
Y2 - 20 April 2020 through 24 April 2020
ER -