TY - GEN
T1 - Detecting Fraud in a Large Anonymized Voter Registration Dataset
AU - Anwar, Nahid
AU - Jain, Amit
AU - Serra, Edoardo
AU - Houck, Chad
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Voter registration in the United States involves maintaining state-controlled lists that record all legally eligible voters. Election fraud often centers around the manipulation and misuse of voter registration data, potentially allowing ineligible votes or denying legitimate voters the right to participate. Voter registration data, therefore, is part of the backbone of the United States democratic system to ensure election integrity. In today's dynamic political landscape, sharing this data promotes transparency and accountability. However, accessing real voter registration records can be challenging due to privacy concerns about personally identifiable information.In this paper, we present the Idaho Voter Registration Election Dataset (IVRED), which contains anonymized records of real voter data from the Idaho Secretary of State and a curated set of synthetically generated fraudulent records. Although real instances of voter registration fraud are rare and difficult to identify, potential vulnerabilities are yet to be thoroughly explored. By consulting with domain experts, we have identified various scenarios in which voter registration data could be manipulated.Additionally, we provide a similarity graph for each significant attribute, illustrating the inter-relationships between attribute values. Combining these similarity graphs with our anonymized dataset enables the construction of a comprehensive graph, consisting of over 2.1 million nodes and 22 million edges. This allows for the application of advanced machine learning techniques, including spectral graph positioning - also known as positional embedding - which improves the classification of fraudulent voter records compared to baseline machine learning experiments (a classification model with no embedding features). This demonstrates the utility of our dataset and highlights its potential to detect election fraud, which will lead to increased confidence in our election process.To the best of our knowledge, this is the first released voter registration dataset that includes all fields. This resource will hopefully stimulate research in election security, enabling researchers to develop new analytical tools using machine learning techniques.
AB - Voter registration in the United States involves maintaining state-controlled lists that record all legally eligible voters. Election fraud often centers around the manipulation and misuse of voter registration data, potentially allowing ineligible votes or denying legitimate voters the right to participate. Voter registration data, therefore, is part of the backbone of the United States democratic system to ensure election integrity. In today's dynamic political landscape, sharing this data promotes transparency and accountability. However, accessing real voter registration records can be challenging due to privacy concerns about personally identifiable information.In this paper, we present the Idaho Voter Registration Election Dataset (IVRED), which contains anonymized records of real voter data from the Idaho Secretary of State and a curated set of synthetically generated fraudulent records. Although real instances of voter registration fraud are rare and difficult to identify, potential vulnerabilities are yet to be thoroughly explored. By consulting with domain experts, we have identified various scenarios in which voter registration data could be manipulated.Additionally, we provide a similarity graph for each significant attribute, illustrating the inter-relationships between attribute values. Combining these similarity graphs with our anonymized dataset enables the construction of a comprehensive graph, consisting of over 2.1 million nodes and 22 million edges. This allows for the application of advanced machine learning techniques, including spectral graph positioning - also known as positional embedding - which improves the classification of fraudulent voter records compared to baseline machine learning experiments (a classification model with no embedding features). This demonstrates the utility of our dataset and highlights its potential to detect election fraud, which will lead to increased confidence in our election process.To the best of our knowledge, this is the first released voter registration dataset that includes all fields. This resource will hopefully stimulate research in election security, enabling researchers to develop new analytical tools using machine learning techniques.
KW - data anonymization
KW - election fraud
KW - extreme gradient boosting
KW - machine learning
KW - positional embedding
KW - privacy
KW - random forest
KW - similarity graph
KW - voter registration data
UR - http://www.scopus.com/inward/record.url?scp=85218039873&partnerID=8YFLogxK
U2 - 10.1109/BigData62323.2024.10826115
DO - 10.1109/BigData62323.2024.10826115
M3 - Conference contribution
AN - SCOPUS:85218039873
T3 - Proceedings - 2024 IEEE International Conference on Big Data, BigData 2024
SP - 2275
EP - 2282
BT - Proceedings - 2024 IEEE International Conference on Big Data, BigData 2024
A2 - Ding, Wei
A2 - Lu, Chang-Tien
A2 - Wang, Fusheng
A2 - Di, Liping
A2 - Wu, Kesheng
A2 - Huan, Jun
A2 - Nambiar, Raghu
A2 - Li, Jundong
A2 - Ilievski, Filip
A2 - Baeza-Yates, Ricardo
A2 - Hu, Xiaohua
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Big Data, BigData 2024
Y2 - 15 December 2024 through 18 December 2024
ER -