Project Details
Description
Modern computational methods, such as Machine Learning (ML) based approaches, have produced impressive gains in efficiency and performance but are increasingly dependent on massive amounts of data. These data-driven approaches transition from the classical techniques of human-engineered source code to algorithms trained on a dataset to produce the desired solution, placing the data in the driver's seat. The proliferation of these data-driven technologies is being enabled and hastened by new hardware and software systems specifically designed to support the complex data-driven computation associated with these algorithms and the massive volumes of data accompanying them. But despite the impressive performance gains of these new hardware and software systems, understanding their design's data component has languished in favor of performance-driven advances in software and hardware-based solutions. The lack of data understanding has led to a number of undesirable outcomes such as unwanted bias in the data-driven solution, an inability to determine the actual suitability of a data set to solving a given problem ahead of time, an inability to determine if a data set has been manipulated or corrupted, and an inability to produce accurate synthetic data that can be used to train and test the performance of these software and hardware systems. This project aims to provide a robust framework for the characterization of large-scale tensor-based datasets to improve understanding of the data itself and enable the production of synthetic data that more accurately replicates real-world data for use in system design testing and validation.Specifically, this project proposes to advance knowledge in the fields of multilinear algebra, large-scale data analytics, machine learning, and artificial intelligence by incorporating a variety of tensor methods for statistical, structural, and performative data analyses to achieve more robust data characterization. A more holistic set of data characterizations will enable better assessment of data for bias and evaluation of datasets for suitability for a particular task. It will also allow the comparison of datasets to understand their differences and assess data for corruption or manipulation. A proof of concept will be established by incorporating the data characterization methods developed in the project into generating synthetic data with higher degrees of realism than conventional methods. The approach will be validated by testing the ability of the synthetic data to characterize software/hardware system performance more accurately.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Status | Active |
---|---|
Effective start/end date | 1/09/22 → 31/08/25 |
Funding
- National Science Foundation: $156,488.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.