Detecting Fraud in a Large Anonymized Voter Registration Dataset

Nahid Anwar, Amit Jain, Edoardo Serra, Chad Houck

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Voter registration in the United States involves maintaining state-controlled lists that record all legally eligible voters. Election fraud often centers around the manipulation and misuse of voter registration data, potentially allowing ineligible votes or denying legitimate voters the right to participate. Voter registration data, therefore, is part of the backbone of the United States democratic system to ensure election integrity. In today's dynamic political landscape, sharing this data promotes transparency and accountability. However, accessing real voter registration records can be challenging due to privacy concerns about personally identifiable information.In this paper, we present the Idaho Voter Registration Election Dataset (IVRED), which contains anonymized records of real voter data from the Idaho Secretary of State and a curated set of synthetically generated fraudulent records. Although real instances of voter registration fraud are rare and difficult to identify, potential vulnerabilities are yet to be thoroughly explored. By consulting with domain experts, we have identified various scenarios in which voter registration data could be manipulated.Additionally, we provide a similarity graph for each significant attribute, illustrating the inter-relationships between attribute values. Combining these similarity graphs with our anonymized dataset enables the construction of a comprehensive graph, consisting of over 2.1 million nodes and 22 million edges. This allows for the application of advanced machine learning techniques, including spectral graph positioning - also known as positional embedding - which improves the classification of fraudulent voter records compared to baseline machine learning experiments (a classification model with no embedding features). This demonstrates the utility of our dataset and highlights its potential to detect election fraud, which will lead to increased confidence in our election process.To the best of our knowledge, this is the first released voter registration dataset that includes all fields. This resource will hopefully stimulate research in election security, enabling researchers to develop new analytical tools using machine learning techniques.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE International Conference on Big Data, BigData 2024
EditorsWei Ding, Chang-Tien Lu, Fusheng Wang, Liping Di, Kesheng Wu, Jun Huan, Raghu Nambiar, Jundong Li, Filip Ilievski, Ricardo Baeza-Yates, Xiaohua Hu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2275-2282
Number of pages8
ISBN (Electronic)9798350362480
DOIs
StatePublished - 2024
Event2024 IEEE International Conference on Big Data, BigData 2024 - Washington, United States
Duration: 15 Dec 202418 Dec 2024

Publication series

NameProceedings - 2024 IEEE International Conference on Big Data, BigData 2024

Conference

Conference2024 IEEE International Conference on Big Data, BigData 2024
Country/TerritoryUnited States
CityWashington
Period15/12/2418/12/24

Keywords

  • data anonymization
  • election fraud
  • extreme gradient boosting
  • machine learning
  • positional embedding
  • privacy
  • random forest
  • similarity graph
  • voter registration data

Fingerprint

Dive into the research topics of 'Detecting Fraud in a Large Anonymized Voter Registration Dataset'. Together they form a unique fingerprint.

Cite this