《通过高保真合成数据促进数据分析.pdf》由会员分享,可在线阅读,更多相关《通过高保真合成数据促进数据分析.pdf(28页珍藏版)》请在三个皮匠报告上搜索。
1、Boosting Data Analytics ThroughHigh-Fidelity Synthetic DataXiaotong Shenxshenumn.eduSchool of StatisticsUniversity of MinnesotaThe 5th NATIONAL BIG DATA HEALTH SCIENCECONFERENCE,ColumbiaJoint with Yifei Liu and Rex Shen Shen et al.,2023Generative AI and Synthetic DataSynthetic data generation,propel
2、led by generative AI,promotesparadigm shift for data analytics.Synthetic data:artificially created to closely mirror thecharacteristics and distribution of real data.MIT-gartner report Gartner,2022,Eastwood,2023:60%of datautilized in AI and analytics will be synthetically generated by 2024,and synth
3、etic data will surpass real data in AI models by 2030.As synthetic data gains prominence,questions arise concerning ourdata analytics paradigm:(1)how to utilize synthetic data;(2)itsconnection with raw data.Can we benefit from synthetic data for any analytic task?UMN Statistics1/21ExampleFigure 1:Ga
4、o et al.,2023:Machine learning models trained onsynthetic data achieves state-of-art performances compared withreal-data-trained models for medical imaging.UMN Statistics2/21Challenges for Health Care DataTwo importance aspects for healthcare data and medical researchCompliancestorage must be compli
5、ant with regulationsrolebased access control.Efficacy.Data sharing becomes difficulty due to concern of security andprivacy.Focus on the potential impact of generative AI:Can we effectivelyutilize synthetic data to enhance data privacy&efficacy.UMN Statistics3/21OverviewSynthetic data:produced by a
6、generative model to replicate raw data,trainedon raw data via pre-trained models with knowledge transfer from similar studies.Benefits(1)privacy:privacy leakage when sharing real data.(2)scarcity:limited size;expensive trials;time-consuming;imbalance.Generative models:GANs Goodfellow et al.,2014,Kar