TY - JOUR
T1 - Extending Inverse Frequent Itemsets Mining to Generate Realistic Datasets
T2 - Complexity, Accuracy and Emerging Applications
AU - Saccá, Domenico
AU - Serra, Edoardo
AU - Rullo, Antonio
N1 - Publisher Copyright:
© 2019, The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature.
PY - 2019/11
Y1 - 2019/11
N2 - The development of novel platforms and techniques for emerging “Big Data” applications requires the availability of real-life datasets for data-driven experiments, which are however not accessible in most cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real ones using a two-step approach: first a real dataset X is analyzed to derive relevant patterns Z (latent variables) and, then, such patterns are used to reconstruct a new dataset X' that is like X but not exactly the same. The approach can be implemented using inverse mining techniques such as inverse frequent itemset mining (IFM), which consists of generating a transactional dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. This paper introduces various extensions of IFM within a uniform framework with the aim to generate artificial datasets that reflect more elaborated patterns (in particular infrequency and duplicate constraints) of real ones. Furthermore, in order to further enlarge the application domain of IFM, an additional extension is introduced that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics.
AB - The development of novel platforms and techniques for emerging “Big Data” applications requires the availability of real-life datasets for data-driven experiments, which are however not accessible in most cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real ones using a two-step approach: first a real dataset X is analyzed to derive relevant patterns Z (latent variables) and, then, such patterns are used to reconstruct a new dataset X' that is like X but not exactly the same. The approach can be implemented using inverse mining techniques such as inverse frequent itemset mining (IFM), which consists of generating a transactional dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. This paper introduces various extensions of IFM within a uniform framework with the aim to generate artificial datasets that reflect more elaborated patterns (in particular infrequency and duplicate constraints) of real ones. Furthermore, in order to further enlarge the application domain of IFM, an additional extension is introduced that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics.
KW - Big data
KW - Classification
KW - Data mining
KW - Frequent itemset mining
KW - Inverse problems
KW - Linear programming
KW - Synthetic dataset
UR - http://www.scopus.com/inward/record.url?scp=85069523002&partnerID=8YFLogxK
UR - https://scholarworks.boisestate.edu/cs_facpubs/204
U2 - 10.1007/s10618-019-00643-1
DO - 10.1007/s10618-019-00643-1
M3 - Article
SN - 1384-5810
VL - 33
SP - 1736
EP - 1774
JO - Data Mining and Knowledge Discovery
JF - Data Mining and Knowledge Discovery
IS - 6
ER -