TY - JOUR
T1 - Machine Learning Methods for Generating High Dimensional Discrete Datasets
AU - Manco, Giuseppe
AU - Ritacco, Ettore
AU - Rullo, Antonino
AU - Saccà, Domenico
AU - Serra, Edoardo
N1 - Publisher Copyright:
© 2022 The Authors. WIREs Data Mining and Knowledge Discovery published by Wiley Periodicals LLC.
PY - 2022/3/1
Y1 - 2022/3/1
N2 - The development of platforms and techniques for emerging Big Data and Machine Learning applications requires the availability of real-life datasets. A possible solution is to synthesize datasets that reflect patterns of real ones using a two-step approach: first, a real dataset X is analyzed to derive relevant patterns Z and, then, to use such patterns for reconstructing a new dataset X' that preserves the main characteristics of X . This survey explores two possible approaches: (1) Constraint-based generation and (2) probabilistic generative modeling. The former is devised using inverse mining (IFM) techniques, and consists of generating a dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. By contrast, for the latter approach, recent developments in probabilistic generative modeling (PGM) are explored that model the generation as a sampling process from a parametric distribution, typically encoded as neural network. The two approaches are compared by providing an overview of their instantiations for the case of discrete data and discussing their pros and cons.
AB - The development of platforms and techniques for emerging Big Data and Machine Learning applications requires the availability of real-life datasets. A possible solution is to synthesize datasets that reflect patterns of real ones using a two-step approach: first, a real dataset X is analyzed to derive relevant patterns Z and, then, to use such patterns for reconstructing a new dataset X' that preserves the main characteristics of X . This survey explores two possible approaches: (1) Constraint-based generation and (2) probabilistic generative modeling. The former is devised using inverse mining (IFM) techniques, and consists of generating a dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. By contrast, for the latter approach, recent developments in probabilistic generative modeling (PGM) are explored that model the generation as a sampling process from a parametric distribution, typically encoded as neural network. The two approaches are compared by providing an overview of their instantiations for the case of discrete data and discussing their pros and cons.
KW - constraints-based models
KW - data generation
KW - generative adversarial networks
KW - generative models
KW - inverse frequent itemset mining
KW - synthetic dataset
KW - variational autoencoder
UR - http://www.scopus.com/inward/record.url?scp=85122851101&partnerID=8YFLogxK
U2 - 10.1002/widm.1450
DO - 10.1002/widm.1450
M3 - Review article
SN - 1942-4787
VL - 12
JO - WIREs: Data Mining and Knowledge Discovery
JF - WIREs: Data Mining and Knowledge Discovery
IS - 2
M1 - e1450
ER -