《UNIDIR:2024年探索人工智能与自主系统的合成数据入门指南(英文版)(32页).pdf》由会员分享,可在线阅读,更多相关《UNIDIR:2024年探索人工智能与自主系统的合成数据入门指南(英文版)(32页).pdf(32页珍藏版)》请在三个皮匠报告上搜索。
1、FULL REPORTExploring Synthetic Data for Artificial Intelligence and Autonomous SystemsA PrimerHARRY DENGEXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS2AcknowledgementsSupport from UNIDIRs core funders provides the foundation for all of the Institutes activities.This pub-
2、lication was funded by the European Union as part of UNIDIRs Security and Technology Programme,which is supported by the governments of Czech Republic,Germany,Italy,the Netherlands and Swit-zerland,and by Microsoft.The author wishes to thank Dr.Giacomo Persi Paoli and Ioana Puscas for their advice a
3、nd assistance on this paper as well as Prof.Tim Watson and Dr.Leslie Sikos for their in-valuable contributions to this research.About UNIDIRThe United Nations Institute for Disarmament Research(UNIDIR)is a voluntarily funded,autonomous institute within the United Nations.One of the few policy instit
4、utes worldwide focusing on disarmament,UNIDIR generates knowledge and promotes dialogue and action on disarmament and security.Based in Geneva,UNIDIR assists the international community to develop the practical,innovative ideas needed to find solutions to critical security problems.CitationH.Deng,Ex
5、ploring Synthetic Data for Artificial Intelligence and Autonomous Systems:A Primer,Geneva,Switzerland:UNIDIR,2023.NoteThe designations employed and the presentation of the material in this publication do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United
6、Nations concerning the legal status of any country,territory,city or area,or of its authorities,or concerning the delimitation of its frontiers or boundaries.The views expressed in the publication are the sole responsibility of the individual authors.They do not necessary reflect the views or opinio
7、ns of the United Nations,UNIDIR,its staff members or sponsors.About the Security and Technology ProgrammeContemporary developments in science and technology present new opportunities as well as chal-lenges to international security and disarmament.UNIDIRs Security and Technology Programme(SecTec)see
8、ks to build knowledge and awareness on the international security implications and risks of specific technological innovations.It convenes stakeholders to explore ideas and develop new thinking on ways to address them.AuthorHarry Deng(hwrdeng)is a Consultant for the Security and Technology Programme
9、 at UNIDIR,where his work focuses on the international security implications of emerging technologies.He holds a masters degree in global governance from the University of Waterloo,where he is currently a PhD candidate.EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS3Acron
10、yms&Abbreviations3DAIAQIDODGANGGEICTIOTJSONOEWGUAVVAEDull,Dirty and DangerousArtificial IntelligenceAir Quality IndexDepartment of Defense(United States)Generative Adversarial NetworkGroup of Governmental ExpertsInformation and Communications TechnologyInternet of ThingsJavaScript Object NotationOpe
11、n-ended Working GroupUncrewed Aerial VehicleVariational AutoencoderEXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS4ContentsExecutive Summary 51.Introduction 62.Understanding Synthetic Data 82.1 What is Synthetic Data 82.2 Existing Data Challenges 10 2.2.1 Strand 1 Data Ma
12、nagement 10 2.2.2 Strand 2 Data Quality 122.3 Methods of Generating Synthetic Data 13 2.3.1 Rules-Based Methods 13 2.3.2 Agent-Based Methods 14 2.3.3 Deep-Learning Algorithms 153.Synthetic Data and International Security 193.1 Value Added by Synthetic Data 223.2 Risks 264.Conclusion 28Bibliography 2
13、9EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS5Executive SummaryAdvances in the field of artificial intelligence(AI)and machine learning in recent years have created unprecedented opportunities to augment human capabilities and to improve the functionality of various au
14、tonomous systems,including in the field of international security.Yet,there is a scarcity of the high-quality,highly diverse and relevant real-world data sets that are needed to train increasingly complex AI systems in the defence sector.As a consequence,synthetic data is gradually becoming an essen
15、tial tool in the data toolbox to develop and train AI systems.The characteristics and potential benefits of synthetic data,along with proven application of the technology in various sectors,make it a relevant topic for debates surrounding the use of AI within the context of international security.Th
16、is primer provides a brief overview of synthetic data,including its characteristics,how it is generated,the value that it adds,its risks,and its potential use cases for defence organizations and military oper-ations.In addition,the primer provides an outline of existing data challenges and limitatio
17、ns that have facilitated the emergence of synthetic data as an important tool for the development of increasingly complex AI systems.The use of synthetic data within the context of international security has so far mostly remained ex-perimental and exploratory.However,the features of synthetic data
18、could have a beneficial effect on training AI systems.In particular,synthetic data allows for the generation of highly diverse or even novel data sets,fine grain control of data attributes,automatic annotation or data labelling where necessary,and cost-effectiveness.This primer examines how the main
19、 characteristics of synthetic data could benefit militaries and defence organizations by allowing them to integrate more capable and reliable AI systems in both defensive and offensive autonomous systems.While synthetic data can be beneficial for training AI systems and could help alleviate some of
20、the data issues faced by militaries and defence organizations,it is not a silver bullet,and it comes with risks and challenges.The benefits accrued from using synthetic data will depend on the ability of organizations to navigate these risks in order for AI systems trained on synthetic data to be us
21、ed in a responsible and safe manner and in accordance with legal requirements and ethical values.EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS61.Introduction1 The First Committee of the United Nations General Assembly defines“international security”as“global challenges
22、and threats to peace that affect the international community”.See United Nations General Assembly,“Disarmament and Interna-tional Security(First Committee)”,https:/www.un.org/en/ga/first/.2 A.Wilner,“AI and the Future of Deterrence:Promises and Pitfalls”,Centre for International Governance Innovatio
23、n,28 November 2022,https:/www.cigionline.org/articles/ai-and-the-future-of-deterrence-promises-and-pitfalls/;Defence In-novation Board,“AI Principles”,2019,https:/media.defense.gov/2019/Oct/31/2002204458/-1/-1/0/DIB_AI_PRINCIPLES_PRIMARY_DOCUMENT.PDF.3 J.Yan et al.,“Synthetic Dataset Generation and
24、Adaptation for Human Detection”,DEVCOM Army Research Laboratory,November 2020,https:/apps.dtic.mil/sti/pdfs/AD1115446.pdf;A.Holland,“Known Unknowns:Data Issues and Military Au-tonomous Systems”,UNIDIR,17 May 2021,https:/unidir.org/known-unknowns;Government Business Council,“Advancing ISR at the Edge
25、:A Survey on Networks and Processing Technologies in the Digital Battlespace”,July 2020,http:/ Yan et al.,“Synthetic Dataset Generation”,1;Holland,“Known Unknowns”,27.Advances in artificial intelligence(AI),along with the machine learning models that support its use,have made it ubiquitous when opti
26、miz-ing performance for increasingly complex tasks and complicated working environments.Yet the integration of AI introduces unprecedented legal,ethical,safety and security challenges and this is especially relevant in the interna-tional security context.1 Within this context,AI is being explored as
27、 a tool for decision support,operational planning and intelligence analysis.It could also be integrated into both offensive and defensive autonomous systems,such as tar-get-identification systems,swarm robotics and cyber operations.Moreover,it has been asserted that AI could perform certain tasks be
28、tter than traditional methods(e.g.,defensive cyber infra-structures or intelligence analysis),2 meaning that states could be better equipped to uphold their international legal obligations,specifically international humanitarian law,in addition to enhancing operational effectiveness.At the same time
29、,the downstream effect of the tasks envisioned for AI means that machine learning models require ever increasing diversity,volume and velocity of high-quality data,often high-quality labelled data.Without the necessary diversity,volume,and velocity of high-quality data to train complex AI systems,su
30、ch systems could see increases in failures,including unintended harms.Labelled data ex-plicitly informs the machine learning model what the data means rather than leaving the model to figure it out by itself and possibly get it wrong.However,the scarcity of high-quality real-world data along with th
31、e privacy,legal,regulatory and cost challenges associated with sensitive data make real-world data generally unsuitable for training increasingly complex AI systems,3 particularly in the international security context.Because of this scarcity,synthetic data is gradually becoming an essential tool to
32、 develop,improve and condition increasingly complex AI systems.In particular,it can provide data where there is none,can counterbalance various forms of bias in real-world data and can automatically label data where necessary,among other things.4 EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE
33、AND AUTONOMOUS SYSTEMS7However,the implications of using synthetic data in autonomous systems remain unex-plored in relevant United Nations security processes,such as the Group of Governmen-tal Experts(GGE)on emerging technologies in the area of lethal autonomous weapons systems(LAWS)or the Open-end
34、ed Working Group(OEWG)on security of and in the use of information and communications technologies(ICTs).However,the value-added and the risks associated with synthetic data are relevant to these discussions as well as other debates on the use of AI within the context of international security.For e
35、xample,some parties involved in debate in the GGE on LAWS are concerned about the possibility that increased autonomy in weapon systems could lead to increases in unintended harms due to a lack of training data to appropriately train such systems.5 Ad-ditionally,participants in the OEWG on ICTs have
36、 discussed the possibility that AI-pow-ered cyberattacks could autonomously adapt to defensive cyber measures,making them more difficult to detect and mitigate.6 AI-pow-ered cyberattacks can indeed be enabled 5 Group of Governmental Experts on Emerging Technologies in the Area of Lethal Autonomous W
37、eapons Systems,“Proposal for an International Instrument on Lethal Autonomous Weapons(LAWS)”,Submitted by Pakistan,CCW/GGE.1/2023/WP.3/Rev.1,8 March 2023,http:/undocs.org/en/CCW/GGE.1/2023/WP.3/Rev.1;“State of Palestines Proposal for the Normative and Operational Framework on Autonomous Weapons Syst
38、ems”,Submitted by Palestine,CCW/GGE.1/2023/WP.2/Rev.1,3 March 2023,http:/undocs.org/en/CCW/GGE.1/2023/WP.2/Rev.1.6 H.Alkhzaimi,“Contribution to the Fifth Substantive Session by Emerging Research and Security Center,NYU/NYUAD”,NGO Working Papers,28 July 2023,https:/docs-library.unoda.org/Open-Ended_W
39、orking_Group_on_Information_and_Communication_Technologies_-_(2021)/Stakeholder_Recommendation_for_Open-ended_working_group_on_security_APR.pdf.7 A.Anand and H.Deng,“Towards Responsible AI in Defence:A Mapping and Comparative Analysis of AI Principles Adopted by States”,UNIDIR,13 February 2023,https
40、:/unidir.org/publication/towards-responsible-ai-defence-mapping-and-com-parative-analysis-ai-principles-adopted.and augmented by the generation and use of synthetic data.It is therefore essential that the use of synthetic data in autonomous systems does not derogate any commitments to interna-tional
41、 law(e.g.,international humanitarian law)or Responsible AI that is,the broad approach to the development and use of AI to ensure that AI systems are lawful,ethical,safe,secure and responsible.7 As such,this primer aims to provide policy-makers and diplomats engaged in interna-tional security discuss
42、ions with an introduc-tory overview of synthetic data.It describes the main characteristics,value-added,risks and relevance of synthetic data within the context of international security,particularly as an enabler of autonomy.The primer further attempts to demonstrate the growing impor-tance of synt
43、hetic data as well as the evolving paradigm of data usage and governance in the international security context.It does this by il-luminating and mapping out the peculiarities of synthetic data,then re-connecting it to existing data challenges.EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND
44、AUTONOMOUS SYSTEMS82.Understanding Synthetic DataHighlights Unlike real-world data,which refers to data and inputs derived from the real world,synthetic data is artificially created in the digital world.It often seeks to reproduce the characteristics and properties of an existing set of data or to p
45、roduce data based on existing knowledge.The purpose of synthetic data is to improve the quality and utility of training data sets.It is critical that the data on which autonomous systems are trained is of sufficient quality and diversity in order to avoid unintended harms,especially within the conte
46、xt of international security.Defence organizations currently face a myriad of data management challenges,thereby limiting the quality and utility of real-world data to train increasingly complex AI and autonomous systems.While synthetic data may not be a silver bullet that resolves all existing data
47、 challenges within defence organizations,it may provide a means to improve the quality and utility of training data sets.2.1 What is Synthetic Data?8 S.Kannan,“Synthetic Time Series Data Generation for Edge Analytics”,F1000 Research,20 January 2022,https:/doi.org/10.12688/f1000research.72984.1.Synth
48、etic data is data that is artificially generated in the digital world with properties that are often derived from an“original”set of data.This is in contrast to real-world data,which,as the name suggests,is data collected from real-world events and inputs.The“original”data set is typically real-worl
49、d data and informa-tion,but it can also be artificially generated data itself.While there are various methods of gener-ating a synthetic data set(see section 2.3),the objective is often to reproduce the characteris-tics and structures of the original data set,and most methods rely on extracting and
50、replicat-ing the properties of the original data(e.g.,see Figure 1).8 This means that the synthetic data and the original data should deliver very similar,EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS9if not identical,results when undergoing the same statistical analysi
51、s.9 In short,synthetic data is often information that is artificially generated to represent the original data it either seeks to replace(thereby providing an equivalent function)or complement(thus 9 R.Riemann,“Synthetic Data”,European Data Protection Supervisor,https:/edps.europa.eu/press-publicati
52、ons/publi-cations/techsonar/synthetic-data_en.10 F.Longford,“Experiments in Synthetic Data”,Forensic Architecture,6 November 2018,https:/forensic-architecture.org/investigation/experiments-in-synthetic-data.improving the value of the training data set).It is also possible to enhance a training data
53、set by generating synthetic data that does not reproduce the characteristics of the original data set,but instead exaggerates certain char-acteristics(see section 3.1).Figure 1.Real-World versus Synthetic Data10REAL-WORLD IMAGE OF A TANKSYNTHETICALLY GENERATED IMAGE OF A TANKHowever,in some circumst
54、ances,synthetic data can also be artificially generated data that does not rely on an original set of data.It is possible to generate novel data based on existing knowledge.For example,it is possible to synthesize data on how dice of different weights would behave based on existing knowledge of the
55、physics of the object.In these cases,rather than reproducing the char-acteristics of an original data set,synthetic data produces data that reflects the character-istics of the system that would hypothetically produce that data.EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYST
56、EMS102.2 Existing Data Challenges 11 Government Business Council,“Advancing ISR at the Edge”.12 Ibid,4.13 Ibid.Defence organizations and militaries face chal-lenges in obtaining sufficient data of adequate quality.The data challenges are not only technical,but also organizational.11 This means that
57、defence organizations cannot simply“engineer their way out”of their shortcomings using technical solutions.Instead,data chal-lenges within defence organizations should also consider the impact of organizational culture,policies and procedures.Ultimately,any use of autonomous systems,particularly aut
58、onomous systems in combat environments or autonomous systems intended to engage human targets,hinges on the responsibility to anticipate and respond to data issues in order to avoid unintended harms.While synthetic data may not be a silver bullet that alleviates all existing data challenges,it could
59、 provide a means to improve the quality and utility of training data sets,especially in in-creasingly complicated and opaque machine learning models where data issues may not be easily revealed.This section looks in turn at two strands of data challenges faced by defence organizations that synthetic
60、 data can address:first data management,and then data quality.2.2.1 Strand 1 Data Management The first strand of issues that results in a lack of labelled data of sufficient quality is poor data management practices.The data management process includes the following stages:1.Collection 2.Processing(
61、e.g.,data labelling)3.Storage 4.Access 5.Sharing and dissemination6.Extraction and exploitation 7.Utilization 8.Disposal The scale and form of data management problems can be illustrated by the case of the United States Department of Defense(DoD),where processing,exploitation and dissem-ination of d
62、ata were noted to be particularly challenging.In a study,only 29 per cent of both active-duty and civilian personnel said that over 75 per cent of the data reaches the ap-propriate actors.12 These numbers are even bleaker for active-duty personnel,only 11 per cent of whom said that data reaches the
63、ap-propriate analysts at least 75 per cent of the time.13 Furthermore,65 per cent of active-duty personnel noted that warfighters spend more time looking for the right data than using the EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS11data.14 This indicates shortcomings
64、 in estab-lishing the appropriate processes to correctly label data,store it in the appropriate databases,and ensure appropriate access and availability.It may also indicate a struggle to balance the need to protect sensitive or classified data with the need to share that data with those who may ben
65、efit from exploiting it.In fact,siloed data,the compartmentalization of mutually exclusive departments,limited bandwidth,and limited tagging of data were noted by DoD personnel to be some of the most prevalent challenges for their organizations ability to effectively collect,disseminate and analyse
66、data.15 Regarding data labelling,only 32 per cent of defence civilians and 13 per cent of active-duty personnel said that their defence agency had the systems in place to effectively label data.16 The number of personnel required along with the necessary processes and infra-structure to monitor and
67、manage incoming data 14 Ibid,8.15 Ibid,15.16 Ibid.17 Australian Department of Defence,“Defence Data Strategy 20212023”,https:/www.defence.gov.au/sites/default/files/2021-08/Defence data strategy.pdf.18 British Ministry of Defence,“Data Strategy for Defence:Delivering the Defence Data Framework and E
68、xploiting the Power of Data”,27 September 2021,https:/www.gov.uk/government/publications/data-strategy-for-defence/data-strategy-for-de-fence.19 Canadian Department of National Defence,“The Department of National Defence and Canadian Armed Forces Data Strategy”,18 May 2021,https:/www.canada.ca/en/de
69、partment-national-defence/corporate/reports-publications/da-ta-strategy/data-strategy.html.20 P.A.Udayana et al.,“Strategy for Integrated Land Information System Network Arrangements for the Indonesian National Army”,2022,https:/doi.org/10.33172/jspd.v8i1.1054.has failed to keep pace with the ever-i
70、ncreas-ing volume of data.The paucity of data quality control either ex ante or ex poste means that analysts are drowning in data and by the time the correct set of data has been obtained it is often obsolete and unreliable.Similar data management challenges are also mentioned in Australias 2021 Def
71、ence Data Strategy,17 the United Kingdoms 2021 Data Strategy for Defence18 and Canadas 2021 De-partment of National Defence Data Strategy,19 as well as in research on information networks conducted by the Indonesian Defence Uni-versity.20 Common issues found across these documents include,for exampl
72、e,challenges with data visibility,siloed data,lack of common data standards within and across organiza-tions,and cultural issues of not considering data requirements in the initial phase of capa-bilities development.EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS122.2.2 S
73、trand 2 Data Quality21 Holland,“Known Unknowns”,3.22 Interview with Prof.Tim Watson,11 April 2023.23 Holland,“Known Unknowns”,6.24 Interview with Prof.Tim Watson,11 April 2023.Poor data management contribute to the second strand of issues poor data quality.Common data quality issues include incom-pl
74、ete data,unlabelled data,poisoned or spoofed data,inaccurate data,data bias,and discrep-ant data.Poor data quality can be the result of external factors,such as harsh conditions(e.g.,dust,smoke,vibrations,contaminants,camouflage,wear and tear of sensors,etc.)and adversarial actions(e.g.,signal jammi
75、ng,data poisoning,attacks on sensors,unanticipated tactics,etc.).However,proper data manage-ment practices can help filter out data that has been corrupted in order to avoid creating dis-tortions or biases in the training data set and to ensure that the right data reaches the appropri-ate entities.D
76、ifferent autonomous systems will face different types of poor data quality issue.For example,an autonomous system used in a defensive cyber operation is less likely to face issues arising from harsh conditions(e.g.,dust,smoke,contaminants,etc.),but is likely to face adversarial actions such as spoof
77、ing or data poisoning.In contrast,an uncrewed vehicle placed in an“uncontrolled”multivariate combat environment may face both harsh con-ditions and adversarial actions.If autonomous systems rely on the data they are trained on in order to navigate,respond to and manipulate their environment,it is cr
78、itical that the data is of sufficient quality and diver-sity.21 It is,however,important to note that not all AI systems rely on being trained by data;reinforcement learning models which use a reward function to learn the consequences of actions taken can also be used.22 Yet,a certain amount of poor
79、quality data should be expected in any large real-world data set,especially in the international security context where adversarial environments,whether in the digital space or the physical space,pose a wide range of challenges to the collection of complete high-quality data.23 As such,it has been s
80、uggested that synthetic data could play an important role in alleviating some of the pressures of collecting quality real-world data,for example by filling in where data is missing due to sensor failures.24EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS132.3 Methods of Ge
81、nerating Synthetic Data25 In programming,an array is a data structure that consists of a collection of values or variables(e.g.,numbers,words,objects,etc.)formatted and sorted according to their type.The purpose of an array is to store multiple pieces of data of the same type together.Synthetic data
82、 can be generated by leveraging various techniques,such as decision trees or deep-learning algorithms.As a proxy,synthetic data can be classified according to the type of the original data:Real-world data Information or knowledge of the developer Combination of real-world data and the in-formation o
83、f the developer As noted,synthetic data generation relies on extracting and replicating the properties of an original data set.The method with which the properties of the original data set are extracted and replicated depend on the type and structure of the original data.There are three broad method
84、s of synthetic data generation:Rules-based methods,which have pre-defined data structures Agent-based models,which simulate envi-ronments the data may be needed for Deep-learning algorithms,which uses neural network-based methods The following subsections provide a brief ex-planation of each broad m
85、ethod.2.3.1 Rules-Based MethodsRules-based methods represent metadata and human-or computer-readable data that consists of predefined data structures(e.g.,arrays25)that provide ordered lists and objects that follows specific rules defined by humans.The complexity of these rules can vary from simple
86、rules that only take into account specified data types in a column to more sophis-ticated rules that define relationships between multiple parameters and variables.Common data formats include comma-separated values(CSV),JavaScript Object Notation(JSON)and document type definition(DTD).Rules-based me
87、thods are modular and cost-effective,and they can support different statistical distribu-tions,which may be particularly pertinent for training AI systems used in cyber operations where standardized protocols necessitate pre-defined data structures.Rules-based methods of synthetic data gen-eration h
88、ave already been applied in other contexts,outside international security.For example,Kannan(2021)used a JSON structure to generate a synthetic data set derived from an Air Quality Index(AQI)data set.In this particular study,Kannan was able to produce a synthetic AQI data set that outperformed the o
89、riginal AQI EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS14data set in training a machine learning model to predict the four AQI parameters specified in the data sets.26 Kannan concludes that the improved performance of the synthetic data set could be the result of the
90、synthetic data set“filling in”the incomplete data contained in the original data set.27 There are,however,limitations to using a rules-based method to generate synthetic data sets.Most notable are challenges with scalabil-ity,drift and bias.28 Regarding scalability,the more complex a synthetic data
91、set is(e.g.,if the synthetic data set requires thousands of in-terdependent and intertwined rules),the more 26 Kannan,“Synthetic Time Series Data Generation”,7.27 Kannan notes that the incomplete data in the original AQI data set is due to sensor failure at one of the stations that caused partial re
92、cording of the data.See Ibid.28 M.Pasieka,“A Comparison of Synthetic Data Generation Methods and Synthetic Data Types”,Mostly AI,1 September 2022,https:/mostly.ai/blog/comparison-of-synthetic-data-types/.29 K.Chandler,“Does Military AI Have Gender?Understanding Bias and Promoting Ethical Approaches
93、in Military Applica-tions of AI”,UNIDIR,7 December 2021,https:/doi.org/10.37559/GEN/2021/04.30 E.Bonabeau,“Agent-Based Modeling:Methods and Techniques for Simulating Human Systems”,PNAS,14 May 2002,https:/doi.org/10.1073/pnas.082080899.31 Iplicated and abstruse it is to generate.This thereby limits
94、the practicality of using a rules-based method for more complex or intricate re-lational networks.Relatedly,data drift that is,the shift in data distribution over time may limit the practicality of rules-based methods,particu-larly if there is no well-established change-man-agement system to govern
95、how the rules are changed to fit their application.Lastly,since the rules are defined by humans,the bias of the developer is reflected in the generated data,whether it is conscious(e.g.,business logic)or unconscious(e.g.,gender bias29).2.3.2 Agent-Based Models Agent-based modelling is a proven simul
96、ation technique that has seen various real-world ap-plications,from business problems to public policy evaluations.Agent-based modelling is essentially a system that describes agents and the relationships between them in order to derive an outcome.Based on their interac-tions,behavioural patterns an
97、d the inputted pa-rameters,the agents are capable of evolving.This allows for unanticipated behaviours to emerge,30 and makes agent-based modelling especially applicable for capturing emergent phenomena.Even simple agent-based models can exhibit complex patterns and can provide valuable information
98、about real-world dynamics.31 Deep-learning can also be incor-porated into an agent-based model to achieve more dynamic interactions between agents and more realistic,complex and adaptable outcomes.EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS15Example32An autonomous sys
99、tem that could identify people who are escaping a flood by standing on their roofs would be useful for humanitarian aid and disaster recovery(HADR)operations.However,since such a situation may occur only infrequently in real life,there is very little data that can be used to train an autonomous syst
100、em to identify this scenario.An agent-based model may simulate such a scenario.By generating high-fidelity and highly diverse synthetic data to augment the training data set with rare data points,the synthetic data generated from an agent-based model could potentially be beneficial in training auton
101、omous systems for such scenarios.2.3.3 Deep-Learning Algorithms 32 Yan et al.,“Synthetic Dataset Generation”,3.33 Pasieka,“A Comparison of Synthetic Data Generation Methods”.Deep-learning algorithms are a class of methods based on representation learning,which refers to machine learning techniques t
102、hat automatically learn features and statistical distributions of training data and can generate new data based on those learned features and statistical distributions.Unlike rules-based methods or agent-based models of synthetic data generation,human guidance and super-vision can be minimal,or even
103、 non-existent,depending on the deep-learning model used.Additionally,deep-learning models are not limited by the complexity of the data that can be learned by such models and,in theory,the ap-plication of deep-learning to generate synthetic data sets is“limitless”.33 These algorithms can manage more
104、 intricate and complicated data distributions than rules-based methods and can synthesize unstructured data,such as visual data.There are three main families of deep-learning techniques:generative adversarial networks(GANs),variational autoencoders(VAEs)and diffusion models.EXPLORING SYNTHETIC DATA
105、FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS16Figure 2.Generative Adversarial Network3434 T.Silva,“A Short Introduction to Generative Adversarial Networks”,Thalles Blog,7 June 2017,https:/sthalles.github.io/intro-to-gans/.35 Riemann,“Synthetic Data”.36 A neural network,also known as artificial
106、 neural network,is a network of interconnected layers of nodes that transmit infor-mation from one layer to another and each layer performs a different function on its inputs.Neural networks rely on training data to learn,and their performance improves over time.See IBM,“What are Neural Networks?”,h
107、ttps:/ J.Hradec et al.,“Multipurpose Synthetic Population or Policy Application”,European Commission Joint Research Centre,13 April 2022,14,https:/doi.org/10.2760/50072.Generative adversarial networks are commonly used for image recognition and image genera-tion.35 They are usually comprised of two
108、neural networks,36 one a generator network and the other a discriminator network,that train each other on an iterative basis(see Figure 2).The generator network would produce a synthetic data point(e.g.,an image)as an input with the same characteristics as the training data.The discriminator network
109、,containing batches of both training data and synthetic data,would then try to classify the observations as real or generated.The generator network improves its performance over-time based on the feedback it receives from the discriminator network.The two networks then converge when the discrim-inat
110、or is no longer able to differentiate between the“real”data and the synthetically generated data.37 EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS17Figure 3.Variational Autoencoder3838 J.Jordan,“Variational Autoencoders”,Jeremy Jordan,19 March 2018,https:/www.jeremyjorda
111、n.me/variational-auto-encoders/.39 D.P.Kingma and M.Welling,“An Introduction to Variational Autoencoders”,Foundations and Trends in Machine Learning,2019,https:/arxiv.org/pdf/1906.02691.pdf.Variational autoencoders are a type of like-lihood-based generative model.VAEs are comprised of an encoder and
112、 a decoder(see Figure 3).The encoder ingests data and sim-plifies it(known as the“latent representation”)to represent the key features of the data.The decoder takes in the latent representation and returns a reconstruction of it.Like GANs,VAEs function on an iterative basis.At each iteration,the VAE
113、 ingests data which is then compared with the encoderdecoder output.The essential function for a VAE,then,is to learn the optimal encodingdecoding scheme to iteratively optimize the process.As such,more complex VAE architectures can support higher dimensionality reduction(i.e.,learning the key featu
114、res)while keeping reconstruction errors low.39 EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS18Figure 4.Diffusion Model40NOISINGDENOISING40 A.Vahdat and K.Kreis,“Improving Diffusion Models as an Alternative to GANs,Part 1”,NVIDIA,26 April 2022,https:/ The term“denoising”
115、refers to the process of removing imperfections and defects from audio-visual data in order to restore the actual features and characteristics.See L.Fan et al.,“Brief Review of Image Denoising Techniques”,Visual Computing for Industry,Biomedicine,and Art,vol.2(2019),https:/doi.org/10.1186/s42492-019
116、-0016-7.42 N.Carlini et al.,“Extracting Training Data from Diffusion Models”,arXiv:2301.13188,30 January 2023,1,https:/arxiv.org/abs/2301.13188.43 Ibid.Diffusion models are an emerging class of deep-learning models that produce data,such as images,from a training distribution via an iterative denois
117、ing41 process.In other words,diffusion models work by corrupting an image(e.g.,by adding noise),which the model then learns how to remove(or denoise)in order to generate a coherent image(see Figure 4).It can then generate variations of that image by introducing different noises to the otherwise cohe
118、rent image.While GAN was a break-through technology that enabled the produc-tion of high-fidelity images at scale,diffusion models have largely displaced GANs in recent years.42 It has been posited that,due to its ability to synthesize novel high-fidelity images that are ostensibly unlike its traini
119、ng data as well as its ease of use,diffusion models are the de facto method for generating large-scale images.43 Popular diffusion models include DALL-E and Stable Diffusion.EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS193.Synthetic Data and International SecurityHighli
120、ghts Militaries and defence organizations could benefit from continued advances in AI and autonomous systems.Ensuring that AI and autonomous systems are properly trained prior to deployment and use is of critical importance in the international security context.Benefits of synthetic data include hig
121、hly diverse data sets,shortened training cycles,fine grain control and flexibility,ability to generate hypothetical data,and identifying and addressing skewed data sets,among others.Synthetic data may also eliminate legal challenges related to collecting,storing,disseminating and disposing of sensit
122、ive data,thus potentially allowing for more sharing of sensitive data among allies.The use of synthetic data comes with its own set of risks.These include difficulties in fully replicat-ing the complex physics of the real world,data poisoning,unintended biases,and lower levels of privacy associated
123、with some synthetic data-generation techniques.While these risks may also be applicable to real-world data sets,synthetic data may expand the potential for some of these risks.Thus,it is critical to establish processes to ensure the reliability and quality of synthetic data sets.44 P.Scharre,“Roboti
124、cs on the Battlefield Part II:The Coming Swarm”,Center for a New American Security,15 October 2014,https:/as.org/publications/reports/robotics-on-the-battlefield-part-ii-the-coming-swarm.Continued advances in AI and machine learning are generating great expectations that the functionality and reliab
125、ility of autonomous systems can be enhanced.Indeed,it has been posited that militaries and defence organiza-tions alike could achieve greater capabilities and efficiencies by conferring more autonomy on autonomous systems.44 At a minimum,fielded autonomous systems,regardless of the level of autonomy
126、,should be reliable,predict-able and safe,and should be able to operate in compliance with international humanitarian law.Ensuring that autonomous systems are properly trained prior to deployment and use is of critical importance in the international security context,where decisions,inferences and a
127、ctions are sometimes made“in-theatre”EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS20by the autonomous system in order to reduce the latency between analysis and any actions taken.In other words,the autonomous system that collects the data is the same system that perform
128、s the analysis and delivers an output a process called“edge analytics”.45 The ability for autonomous systems to be placed at the edge has become an increasingly important component of technical solutions for various military applications.Yet,the development of autonomous systems intended to be place
129、d at the edge(e.g.,uncrewed vehicles used in military operations)is unlike that of autono-mous systems in other contexts(e.g.,those used in cyber operations)due to hardware lim-itations.46 To be sure,the issue is not necessar-ily a lack of data,but rather a lack of high-quality labelled data as a re
130、sult of the lack of data-col-lection hardware.There is also the issue of a lack of diversity of data collected by autono-mous systems,which are fielded for specific operational functions,not data collection.For example,an uncrewed aerial vehicle(UAV)that operates at high altitudes will only be able
131、to collect images from high angles,thereby gener-ating a data set that may be largely irrelevant for another UAV operating at lower altitudes and at lower angles of vision.45 M.Hagstrm,“Military Applications of Machine Learning and Autonomous Systems”,The Impact of Artificial Intelligence on Strateg
132、ic Stability and Nuclear Risk,SIPRI,May 2019,https:/www.sipri.org/sites/default/files/2019-05/sipri1905-ai-strate-gic-stability-nuclear-risk.pdf.46 Kannan,“Synthetic Time Series Data Generation”,3.47 V.Boulanin,“Artificial Intelligence:A Primer”,The Impact of Artificial Intelligence on Strategic Sta
133、bility and Nuclear Risk,SIPRI,May 2019,https:/www.sipri.org/sites/default/files/2019-05/sipri1905-ai-strategic-stability-nuclear-risk.pdf.48 Defense Science Board,“Task Force Report:The Role of Autonomy in DoD Systems”,US Department of Defense,July 2012,20,https:/irp.fas.org/agency/dod/dsb/autonomy.
134、pdf.49 Yan et al.,“Synthetic Dataset Generation”,1.50 J.Kwik and T.Van Engers,“Algorithmic Fog of War:When Lack of Transparency Violates the Law of Armed Conflict”,Journal of Future Robot Life,2021,7,https:/doi.org/10.3233/FRL-200019.51 R.Crootof,“The Killer Robots are Here:Legal and Policy Implicat
135、ions”,Cardozo Law Review,2015,1869,https:/ has been postulated that autonomous systems possess tremendous value for military operations.The advantages range from executing tasks quicker than any human or hu-man-operated systems for time-critical mission(e.g.,air-defence or defensive cyber opera-tion
136、s)to carrying out so-called 3D(dull,dirty and dangerous)missions where human per-formance is prone to deterioration over time.47 However,data issues for autonomous systems continue to plague defence organizations.The task of designing autonomous systems for either on-board or off-board data processi
137、ng represents a trade-off,as diverse stakeholders each have unique requirements.48 Defence or-ganizations are grappling with this trade-off and face challenges in obtaining real-world data along with the associated annotations(i.e.,data labels)that can be used to train algorithms for on-board data p
138、rocessing.49 The current data management architecture only permits autono-mous systems to operate in controlled environ-ments and with limited degrees of autonomy.50 For example,the Israeli Guardium uncrewed ground vehicle is only used autonomously at the IsraelGaza border,a location that is well-ma
139、pped and relatively static.51 The use of synthetic data for training autonomous systems EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS21may therefore represent a means to alleviate the data challenges associated with the current data-collection and data-processing archit
140、ec-ture.It may thereby present defence organiza-tions with the opportunity to further exploit the use of autonomous systems by placing them in highly dynamic and multivariate environments while reducing the associated risks.In the cyber realm,however,the incorpora-tion of AI could be an essential el
141、ement of the 52 Interview with Prof.Tim Watson,11 April 2023.53 Wilner,“AI and the Future of Deterrence”.54 Interview with Prof.Tim Watson,11 April 2023.55 Hagstrm,“Military Applications of Machine Learning”,37.ability to conduct defensive cyber operations at scale and to identify threats before the
142、y arise.52 Put another way,AI could make the defence of cyber infrastructures more reliable,especially against AI-enabled offensive cyber opera-tions.53 AI could be critical in dealing with chal-lenges arising from both the increasing scale and the increasing sophistication of the cyber realm.SCALES
143、OPHISTICATIONAs societies and urban environments grow more digitally in-terconnected and heterogenous,it creates more pressure points and vulnerabilities for defensive cyber operations to supervise.Vulnerabilities in digital systems are not only the result of increased sophistication in the vector o
144、f attacks,but are also created by the size of the attack surface.In other words,while the types of attack may not necessarily be changing,the scale of the risks are.As such,AI could act as a force multiplier to provide enough“eyeballs”on enough segments within a digital space in order to be effectiv
145、e.54Offensive cyber operations augmented and amplified by AI(e.g.,synthetic images,adversarial data manipulation and other deceptive techniques)could pose threats to the regular operations of a government,private enterprise or individual.Using AI-enhanced systems and practices defensively may then b
146、e necessary to detect and respond to anomalies by using fine grain control.As such,unlike the issue of scale,leveraging AI against AI-enhanced attacks is not just a matter of correcting social or organizational deficits,but of amelio-rating technological shortcomings.Evidently,there is a wide range
147、of actual and potential use cases for AI within the interna-tional security context.Yet,cultural and social overhangs as well as technological barriers associated with the deployment of AI technol-ogies continue to raise concerns regarding the safety,predictability and reliability of AI,espe-cially
148、in the context of international security.These concerns are creating a gap between“experimental tools and fielded systems”.55 Synthetic data is one proposed solution that could contribute to ameliorating the trust deficit associated with the integration of AI technol-ogies in high-risk situations by
149、 enhancing the quality and usability of training data.EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS223.1 Value Added by Synthetic Data 56 Riemann,“Synthetic Data”.57 Yan et al.,“Synthetic Dataset Generation”,1.The value added by synthetic data depends on where,how and f
150、or which AI systems it is being applied.In general,synthetic data allows for the generation of highly diverse data sets,fine grain control of data attributes,automatic anno-tation or data labelling,and cost-effectiveness.The aim is that the application of synthetic data would have a beneficial effec
151、t on training AI systems.The extent to which synthetic data is an appro-priate proxy for the original data is a measure of the utility of the method used to generate the synthetic data as well as the machine learning model and AI system using the synthetic data.56 In some circumstances,machine learn
152、ing algo-rithms may even be trained on synthetic data sets that have no real-world equivalent,partic-ularly in cases where real-world data cannot be properly collected(e.g.,objects placed in uncommon or rare environments).In these circumstances,the use of synthetic data may be essential.This is espe
153、cially relevant in the military context,where autonomous systems placed in complex combat environments are designed and built for operational efficiency and efficacy,rather than high detail data col-lection.As such,real-world data collected by a UAV,for example,may not be able to capture all possibl
154、e combinations of relevant attributes with high levels of detail,such as images of the relevant object in different environments,captured at different distances,viewing angles and orientations,and under different illumina-tions.57 The ability to synthetically generate highly varied scenarios with al
155、l possible com-binations of relevant attributes as well as the ability to properly identify rare occurrences will thus be essential for any autonomous system to be safe,predictable and reliable,especially in uncontrolled environments.Using synthetic data to train autonomous systems may also shorten
156、training cycles.Because AI systems are dependent on the experiences held within the data,rather than the data in and of itself,training AI systems on real-world data can be impractical.Col-lecting a sufficient amount of real-world data and ensuring that there is ample diversity within that data set
157、is a resource-intensive and time-consuming process.Even then,it is difficult to ensure that all possible variations and diversity in the training data set have been exhausted.In addition,real-world data may not provide the fine grain control and flexibility that synthetic data provides in training a
158、n AI system to fit different requirements.On the other hand,sometimes the characteristics of a real-world data set can be muddled or the data set may simply be too cumbersome to be used effec-tively.In some cases,mere seconds could be worth gigabytes of data(e.g.,packet capture).As such,simple synth
159、etic data sets that retain the characteristics and statistical distributions of the underlying original data set could be suf-ficient in certain cases.Moreover,in instances where the collection of parsed and properly index data may not be an EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND A
160、UTONOMOUS SYSTEMS23issue(e.g.,for defensive cyber operations),synthetic data may be used to produce and learn hypothetical situations.For example,de-velopers may leverage agent-based modelling a technique to simulate interactions between multiple variables(e.g.,people,internet of things(IoT)systems,
161、time,etc.)to create synthetic data sets that reflect people working on certain IoT or enterprise systems for a specified amount of time.58 The value-added here is that,even though organizations working on an IoT system may be able to capture a vast amount of complete data,organizations may not posse
162、ss the fine grain control of the data they collect to detect or predict all anomalies or distinguish anomalies from regular patterns.By using agent-based modelling to generate a synthetic reality,organizations may be able to train IoT systems to simulate,identify and classify malicious and non-malic
163、ious activity at various levels of sophistication.The application of such techniques is not exclusive to the international security context nor are they simply theoretical.Indeed,tech-niques such as agent-based modelling have been applied in other contexts.For example,agent-based modelling has been
164、used to 58 Interview with Prof.Tim Watson,11 April 2023.59 M.Prdhumeau and E.Manley,“A Synthetic Population for Agent-Based Modelling in Canada”,Scientific Data,vol.10,21 March 2023,https:/doi.org/10.1038/s41597-023-02030-4.60 A.Smith et al.,“An Open-Source Model for Projecting Small Area Demographi
165、c and Land-Use Change”,Geographical Analysis,7 February 2022,https:/doi.org/10.1111/gean.12320.61 W.Wheaton et al.“Synthesized Population Database:A US Geospatial Database for Agent-Based Models”,Methods Report,RTI Press,May 2009,https:/doi.org/10.3768/rtipress.2009.mr.0010.0905.62 S.Hrl and M.Bala,
166、“Synthetic Population and Travel Demand for Paris and le-de-France Based on Open and Publicly Available Data”,Transportation Research Part C:Emerging Technologies,vol.130,December 2021,https:/doi.org/10.1016/j.trc.2021.103291.63 L.Perez et al.,“A Geospatial Agent-Based Model of the Spatial Urban Dyn
167、amics of Immigrant Populations:A Study of the Island of Montreal,Canada”,PLOS ONE,vol.14,July 2019,https:/doi.org/10.1371/journal.pone.0219188.simulate and anticipate the impact of public policies(e.g.,urban planning and public trans-portation),for policy evaluation and for simu-lating disease outbr
168、eaks and interventions.59 In fact,open-source synthetic populations that reflect the characteristics of a local population have already been developed for the United Kingdom60 and the United States61 as well as for smaller geographic regions(e.g.,the le-de-France region,France,62 and the Island of M
169、ontreal,Canada63).It can,therefore,be implied that agent-based modelling can help militaries to prepare for unexpected situations or to plan operations.By generating synthetic data via agent-based modelling simulations,militaries can prepare for a range of potential situations and develop strategies
170、 to address them.This may help to improve the readiness and effectiveness of military operations,making them better prepared for unforeseen events and creating data points for rare events or uncommon envi-ronments.The fine grain control of synthetic data sets grants developers the ability to make mi
171、nor ad-justments to the traits and characteristics of the synthetic data set and to test the performance EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS24and limitations of machine learning algo-rithms.64 Indeed,it is possible to create several synthetic data sets with th
172、e same underlying data in order to serve different functions.65 The digital world can also test how variations of a synthetic data set derived from the same un-derlying data influence how an AI system ulti-mately responds to its environment.This can also be particularly useful for identifying and ad
173、dressing skewed data sets,where one trait or a class of traits within a data set is overrepre-sented(i.e.,data or algorithmic bias).In a data set where one class of traits is swamped by a larger class,techniques such as the Synthetic Minority Oversampling Technique(SMOTE)66 can be used to balance th
174、eir frequencies.67 Conditional GANs(CGANs)can also reduce skew in a data set by using adversarial training to address the ability of the discriminator network to predict underrepresented classes more accurately to eliminate class-wide bias.68 These“benchmarking”characteristics imply applicability in
175、 the military context.For example,in an experiment conducted by the US Army Research Laboratory,the research-ers found that the performance of computer vision systems(e.g.,those used in uncrewed vehicles)is correlated to the angle of the 64 Interview with Dr.Leslie Sikos,25 April 2023.65 A.Alfons et
176、 al.,“Synthetic Data Generation of SILC Data”,European Commission,2011,6,https:/www.uni-trier.de/fileadmin/fb4/projekte/SurveyStatisticsNet/Ameli_Delivrables/AMELI-WP6-D6.2-240611.pdf.66 SMOTE is a rules-based method of synthetic data generation.67 Focus Group on Artificial Intelligence for Health,“
177、Data and Artificial Intelligence Assessment Methods(DAISAM)Reference”,International Telecommunication Union and World Health Organization,May 2020,13.68 Ibid,12.69 Yan et al.,“Synthetic Dataset Generation”.70 A.Tucker et al.,“Generating High-Fidelity Synthetic Patient Data for Assessing Machine Lear
178、ning Healthcare Software”,NPJ Digital Medicine,vol.3,9 November 2020,https:/doi.org/10.1038/s41746-020-00353-9.71 Interview with Dr.Leslie Sikos,25 April 2023.images on which the system was trained.69 The researchers noted that the classifier model demonstrated bias against images collected directly
179、 above a subject(e.g.,human,building,tank etc.),and the performance improved as the camera moved further away decreasing the angle of vision.The researchers concluded that one possible reason is that,because the clas-sifier model was trained on ground imagery,the performance would improve as the exp
180、eriment inputs looked more like ground imagery.The system thus needs to be retrained using more aerial imagery with higher angles.As such,the researchers noted that synthetic data can be used to compare the different classifier models with different model complexities and architec-ture,which would t
181、hen allow the optimal classi-fier to be chosen for a specific task.Lastly,it has been argued that the creation of synthetic data that represents real-world data may also eliminate legal challenges associ-ated with collecting,storing,disseminating and disposing of sensitive data.70 Currently,organi-z
182、ations may be unwilling to share data related to their digital infrastructure if sensitive details of their environment(e.g.,IP address,network types,etc.)are exposed,which could pose a risk to the safety of their enterprise digital infrastruc-ture.71 This may be even more pertinent in the EXPLORING
183、 SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS25international security context,where sensitive real-world data is not easily shared even among allies.72 For example,the Australian Department of Defence has noted the challenge of not being aligned with the data standards of the ot
184、her members of the“Five Eyes”intelligence-shar-ing partnership.73 Privacy preservation also suggests the ability of synthetic data to act as a hedge against changes in data privacy regula-tions that could heighten risks by disrupting or-ganizational and inter-organizational routines of sharing sensi
185、tive data.72 W.hman,“Data Augmentation Using Military Sim-ulators in Deep Learning Object Detection Applications”,KTH Royal Institute of Technology,10 September 2019,2,https:/www.diva-portal.org/smash/get/diva2:1375838/FULLTEXT01.pdf.73 Australian Department of Defence,“Defence Data Strategy”,20.EXP
186、LORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS263.2 Risks 74 hman,“Data Augmentation”,6.75 Overfitting is a term used in machine learning that denotes that the algorithm or network does not generalize well enough to the unseen test data,although it performs well on training
187、 data.See,Ibid,5.76 Interview with Dr.Leslie Sikos,25 April 2023.While synthetic data can help alleviate some of the data challenges faced by defence orga-nizations,it is not a silver bullet.Synthetic data comes with its own set of risks and challenges.The ability to manage those risks and chal-leng
188、es is particularly important in order for AI systems to be used in a responsible and safe manner and in accordance with legal require-ments and ethical values.One of the most prominent risks with using synthetic data is called the“reality gap”.This refers to the subtle differences between the synthe
189、tic data and the real world.Sophisticated machine learning models often learn to exploit small discrepancies,making simulated environ-ments difficult to learn from.74 In other words,if synthetic data is not simulated properly,it can run into issues of not being able to fully replicate the complex an
190、d chaotic physics of the real world and may fail to properly capture the un-expected shifts or one-off cases that emerge in real-world data.While synthetic data can be used to benchmark data quality,data bias and algorithmic bias,synthetic data itself can also create(or even amplify)unintended biase
191、s.Intended biases can be useful in certain applications,for example,overrepresenting specific classes of rare malicious network traffic patterns so that AI systems used for either surveillance or incidence response have a higher chance of detecting those malicious patterns.Nonethe-less,it is critica
192、l that these intended biases do not have unintended consequences.In virtually all AI systems,there is an optimal number of synthetic data points,which is dependent on the composition of both the synthetic data and real-world data on which the AI systems is trained.Too much synthetic data could“overf
193、it”75 the AI system,thereby degrading the performance of the system.Therefore,ensuring that the specified scope is correct is critical to avoiding unintended harms or other unintended consequences.Not only could poor scoping lead to unintended consequences once an autonomous system is fielded,but it
194、 could also lead to low-quality data,sampling errors,gender or racial bias,labelling or aggre-gation bias,or incomplete synthetic data sets when it is being generated.76 The issue of data bias and algorithmic bias is also a social and cultural challenge in addition to a technical challenge.For examp
195、le,if a synthetic data set is generated based on the traits and characteristic of an original real-world data set that contains certain assumptions of gender or racial norms,that synthetic data set could further amplify those biases.Even if gender or race is not“explicit in the machine learning mode
196、l,patterns drawn from neutral characteristics,such as uniforms or evidence of weapons,could still implicitly incorporate EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS27gender or racial norms”.77 As such,gender-and racial-based approaches underline the im-portance of div
197、ersifying the people and range of expertise involved in each step of the AI system,including data generation.78Moreover,synthetic data can still be prone to data poisoning by sophisticated malicious actors.It is possible for adversaries to bury unwanted changes in the synthetic data or data set in o
198、rder to disrupt the learning procedure,such as by injecting a small fraction of malicious samples into a training data set or making small adjustments to a synthetically generated image.79 However,it is important to note that,while data poisoning is a risk with synthetic data,it is less prone to poi
199、soning than real-world data,which is often created in distant or uncontrolled environments.Lastly,while some synthetic data-genera-tion techniques are privacy-preserving,other techniques may not provide adequate levels of privacy protection.Specifically,diffusion models are the least private form of
200、 image gen-eration when compared to other techniques,77 Chandler,“Does Military AI Have Gender?”,17.78 Chandler,“Does Military AI Have Gender?”,9.79 Interview with Prof.Tim Watson,11 April 2023.80 Carlini et al.,“Extracting Training Data”,1.81 Ibid.82 Interview with Prof.Tim Watson,11 April 2023.suc
201、h as GANs.This is directly related to the utility of diffusion models to generate higher quality images compared to GANs and VAEs.80 In other words,in some contexts synthetic data may present a privacyutility trade-off,as in-creasingly powerful generative models raise questions about how diffusion m
202、odels work and how,and under what circumstances they should be responsibly deployed.81While these risks may also be applicable to re-al-world data sets,synthetic data may expand the potential surface area for most of these risks.Synthetic data itself does not provide new discrete risks,but these ris
203、ks could be more pervasive.Simply put,the types of risk may be similar,but the risk vectors are shifting and the scale of risk is increasing.However,it has been posited that the use of synthetic data may raise more questions than real-world data because people generally trust it less,which may provi
204、de more opportunities to establish processes to verify synthetic data more so than real-world data.82EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS284.Conclusion 83“Emerging technologies”refers to technologies that create new opportunities to address global challenges wh
205、ile also creating new regulatory challenges.See Organisation for Economic Co-operation and Development(OECD),“OECD Science,Technology,and Industry Outlook 2012”,13 September 2012,222,https:/doi.org/10.1787/sti_outlook-2012-en.Synthetic data has proven to be a useful tech-nology in a variety of secto
206、rs,from healthcare to fraud detection and public policy planning.While synthetic data can still be considered an“emerging technology”,83 it has reached an adequate degree of maturity and there is a suf-ficient amount of expertise for adoption across industries and public services,including those rel
207、ated to international security.Indeed,the value-added and risks associ-ated with synthetic data are relevant to United Nations multilateral processes and other dis-cussions surrounding the use of AI within the context of international security.The potential benefits offered by synthetic data,particu
208、larly the fine grain control,data diversity and cost-ef-fectiveness,should not be ignored.Synthetic data could present a solution to ameliorate some of the data challenges that continue to plague defence organizations,such as poor data quality and low-diversity data sets.By ameliorating some of thes
209、e challenges,mili-taries and defence organizations alike could improve operational capabilities while ensuring compliance with their international humanitar-ian law obligations,particularly during 3D op-erations where human performance is prone to deteriorate over time.At the same time,the risks ass
210、ociated with synthetic data should not be understated.While synthetic data does not necessarily create new risks that are distinct from the risks associated with real-world data,synthetic data may expand the risk surface.In other words,the risks may be similar,but there may be more ways to generate
211、those risks.For example,a lack of diversity in a real-world training data set may create unin-tended biases in the same way as overfitting an AI system with one synthetic data point may create unintended biases.While the features of synthetic data make it a promising technology for the development o
212、f autonomous systems in international security,it should not be viewed as a silver bullet or a cure-all to existing data challenges.It should,instead,be understood as a tool in the data-gov-ernance toolbox.There is ample research that demonstrates the utility of synthetic data,and generation models
213、have advanced significantly in the past few years.As such,next steps should include,but not be limited to,identi-fication of specific use cases;more targeted research on how to apply existing methods and knowledge of synthetic data in the field of inter-national security while considering the gender
214、 and racial aspects of data;and how synthetic data can be integrated into existing data gover-nance strategies.EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS29Bibliography Alfons Andreas,Peter Filzmoser,Beat Hullinger,Jan-Philipp Kolb,Stefan Kraft,Ralf Mnnich,and Matthia
215、s Templ.“Synthetic Data Generation of SILC Data”,European Commission,2011,6,https:/www.uni-trier.de/fileadmin/fb4/projekte/SurveySta-tisticsNet/Ameli_Delivrables/AMELI-WP6-D6.2-240611.pdf.Alkhzaimi,Hoda.“Contribution to the Fifth Substantive Session by Emerging Research and Security Center,NYU/NYUAD
216、”,NGO Working Papers,28 July 2023,https:/docs-library.unoda.org/Open-Ended_Working_Group_on_Information_and_Communication_Technologies_-_(2021)/Stakeholder_Recommendation_for_Open-ended_working_group_on_security_APR.pdfAnand,Alisha and Harry Deng,“Towards Responsible AI in Defence:A Mapping and Comp
217、arative Analysis of AI Principles Adopted by States”,UNIDIR,13 February 2023,https:/unidir.org/publication/towards-responsible-ai-defence-map-ping-and-comparative-analysis-ai-principles-adoptedAryawan Udayana.Putu,Tri Legionosukumo,and Sri Sundari.,“Strategy for Integrated Land Information System Ne
218、twork Ar-rangements for the Indonesian National Army”,2022,https:/doi.org/10.33172/jspd.v8i1.1054.Australia,“Defence Data Strategy 2021-2023”,Australia Department of Defence,https:/www.defence.gov.au/about/strate-gic-planning/defence-data-strategy-2021-2023#:text=The%205%20pillars%20in%20the,capabil
219、ity%20within%20the%20Defence%20workforce.Bonabeau,Eric.“Agent-Based Modeling:Methods and Techniques for Simulating Human Systems”,PNAS,14 May 2002,https:/doi.org/10.1073/pnas.082080899.Boulanin,Vincent.“Artificial Intelligence:A Primer”,SIPRI,May 2019,https:/www.sipri.org/sites/default/files/2019-05
220、/si-pri1905-ai-strategic-stability-nuclear-risk.pdf.Canada,“The Department of National Defence and Canadian Armed Forces Data Strategy”,Department of National Defence,18 May 2021,https:/www.canada.ca/en/department-national-defence/corporate/reports-publications/data-strategy/da-ta-strategy.html.Carl
221、ini,Nicholas,Jamie Hayes,Milad Nasr,Matthew Jagielski,Vikash Sehwag,Florian Tramr,Borja Balle,Daphne Ippolito,and Eric Wallace.“Extracting Training Data from Diffusion Models”,30 January 2023,1,https:/arxiv.org/abs/2301.13188.Chandler,Katherine.“Does Military AI Have Gender?Understanding Bias and Pr
222、omoting Ethical Approaches in Military Applica-tions of AI”,UNIDIR,7 December 2021,https:/doi.org/10.37559/GEN/2021/04.Crootof,Rebecca.“The Killer Robots are Here:Legal and Policy Implications”,Cardozo Law Review,2015,1869,https:/ Innovation Board,“AI Principles”,Defense Innovation Board,2019,https:
223、/media.defense.gov/2019/Oct/31/2002204458/-1/-1/0/DIB_AI_PRINCIPLES_PRIMARY_DOCUMENT.PDF.Defense Science Board,“Task Force Report:The Role of Autonomy in DoD Systems”,Department of Defense,July 2012,20,https:/irp.fas.org/agency/dod/dsb/autonomy.pdf.Fan.Linwei,Fan Zhang,Hui Fan,and Caiming Zhang.“Bri
224、ef Review of Image Denoising Techniques”,Visual Computing for Industry,Biomedicine,and Art,https:/doi.org/10.1186/s42492-019-0016-7.Government Business Council,“Advancing ISR at the Edge:A Survey on Networks and Processing Technologies in the Digital Battlespace”,July 2020,4,http:/ Applications of M
225、achine Learning and Autonomous Systems”,SIPRI,May 2019,https:/www.sipri.org/sites/default/files/2019-05/sipri1905-ai-strategic-stability-nuclear-risk.pdf.Hrl,Sebastian and Milos Bala.“Synthetic Population and Travel Demand for Paris and le-de-France Based on Open and Publicly Available Data”,Transpo
226、rtation Research Part C:Emerging Technologies,December 2021,https:/doi.org/10.1016/j.trc.2021.103291.Hradec.,Jiri,Massimo Craglia,Margherita Di Leo,Sarah De Nigris,Nicole Ostlaender,and Nicholas Nicholson.“Multipurpose Synthetic Population or Policy Application”,European Commission Joint Research Ce
227、ntre,13 April 2022,14,https:/dx.doi.org/10.2760/50072.EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS30Holland,Arthur.“Known Unknowns:Data Issues and Military Autonomous Systems”,UNIDIR,17 May 2021,https:/unidir.org/known-unknowns.International Telecommunications Union.“D
228、ata and Artificial Intelligence Assessment Methods(DAISAM)Reference”,ITU-T Focus Group on Artificial Intelligence for Health,May 2020,13.Jordan,Jeremy.“Variational Autoencoders”,19 March 2018,https:/www.jeremyjordan.me/variational-autoencoders/.Kannan,Subarmaniam.“Synthetic Time Series Data Generati
229、on for Edge Analytics”,F1000 Research,20 January 2022,https:/doi.org/10.12688/f1000research.72984.1.Kingma Diederick P.,and Max Welling.“An Introduction to Variational Autoencoders”,Foundations and Trends in Machine Learning,2019,https:/arxiv.org/pdf/1906.02691.pdf.Kwik,Jonathan and Tom Van Engers.“
230、Algorithmic Fog of War:When Lack of Transparency Violates the Law of Armed Conflict”,Journal of Future Robot Life,2021,7,https:/doi.org/10.3233/FRL-200019.Longford,Frank.“Experiments in Synthetic Data”,Forensic Architecture,6 November 2018,https:/forensic-architecture.org/investigation/experiments-i
231、n-synthetic-data.Manon Prdhumeau and Ed Manley,“A Synthetic Population for Agent-Based Modelling in Canada”,Scientific Data,21 March 2023,https:/doi.org/10.1038/s41597-023-02030-4.hman,Wilhelm.“Data Augmentation Using Military Simulators in Deep Learning Object Detection Applications”,KTH Royal Inst
232、itute of Technology,10 September 2019,2,https:/www.diva-portal.org/smash/get/diva2:1375838/FULLTEXT01.pdf.Organization for Economic Cooperation and Development.“OECD Science,Technology,and Industry Outlook 2012”,13 September 2012,222,https:/doi.org/10.1787/sti_outlook-2012-en.Pakistan,“Proposal for
233、an International Instrument on Lethal Autonomous Weapons(LAWS)”,CCW/GGE.1/2023/WP.2/Rev.1,8 March 2023,https:/docs-library.unoda.org/Convention_on_Certain_Conventional_Weapons_-Group_of_Governmen-tal_Experts_on_Lethal_Autonomous_Weapons_Systems_(2023)/CCW_GGE1_2023_WP.3_REv.1_0.pdfPasieka,Manuel.“A
234、Comparison of Synthetic Data Generation Methods and Synthetic Data Types”,1 September 2022,https:/mostly.ai/blog/comparison-of-synthetic-data-types/.Perez,Liliana,Suzana Dragicevic,and Jonathan Gaudreau.“A Geospatial Agent-Based Model of the Spatial Urban Dynamics of Immigrant Populations:A Study of
235、 the Island of Montreal,Canada”,PLOS ONE,24 July 2019,https:/doi.org/10.1371/journal.pone.0219188.Scharre,Paul.“Robotics on the Battlefield Part II:The Coming Swarm”,Center for a New American Security,15 October 2014,https:/as.org/publications/reports/robotics-on-the-battlefield-part-ii-the-coming-s
236、warm.Silva,Thalles.“A Short Introduction to Generative Adversarial Networks”,Thalles Blog,7 June 2017,https:/sthalles.github.io/intro-to-gans/.Smith,Andrew,Luke Archer,Alistair Ford,and James Virgo.“An Open-Source Model for Projecting Small Area Demographic and Land-Use Change”,Geographical Analysis
237、,7 February 2022,https:/doi.org/10.1111/gean.12320.State of Palestine,“State of Palestines Proposal for the Normative and Operational Framework on Autonomous Weapons Systems”,CCW/GGE.1/2023/WP.2/Rev.1,3 March 2023,https:/docs-library.unoda.org/Convention_on_Certain_Con-ventional_Weapons_-Group_of_Go
238、vernmental_Experts_on_Lethal_Autonomous_Weapons_Systems_(2023)/CCW_GGE1_2023_WP.2_Rev.1.pdf.Tucker,Allan,Zhenchen Wang,Ylenia Rotalinti,Paja Myles.“Generating High-Fidelity Synthetic Patient Data for Assessing Machine Learning Healthcare Software”,NPJ Digital Medicine,9 November 2020,https:/ Nations
239、 General Assembly,“Disarmament and International Security(First Committee)”,https:/www.un.org/en/ga/first/.Wheaton,William,James Cajka,Bernadette Chasteen,Diane Wagener,Phillip Cooley,Laxminarayana Ganapathi,Douglas Roberts,and Justine Allpress.“Synthesized Population Database:A US Geospatial Databa
240、se for Agent-Based Models”,Methods Report RTI Press,May 2009,https:/doi.org/10.3768%2Frtipress.2009.mr.0010.0905.Wilner,Alex.“AI and the Future of Deterrence:Promises and Pitfalls”,Centre for International Governance Innovation,28 November 2022,https:/www.cigionline.org/articles/ai-and-the-future-of
241、-deterrence-promises-and-pitfalls/.Riemann,Robert.“Synthetic Data”,European Data Protection Supervisor,https:/edps.europa.eu/press-publications/publi-cations/techsonar/synthetic-data_en EXPLORING SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND AUTONOMOUS SYSTEMS31United Kingdom,“Data Strategy for Def
242、ence:Delivering the Defence Data Framework and Exploiting the Power of Data”,27 September 2021,Ministry of Defence,https:/www.gov.uk/government/publications/data-strategy-for-defence/data-strat-egy-for-defence.Vahdat,Arash and Karsten Kreis.“Improving Diffusion Models as an Alternative to GANs,Part
243、1”,NVIDIA,26 April 2022,https:/ Joo Lee,Damon Conover,and Heesung Kwon.“Synthetic Dataset Generation and Adaptation for Human Detection”,DEVCOM Army Research Laboratory,November 2020,1,https:/apps.dtic.mil/sti/pdfs/AD1115446.pdf.unidir/unidir/un_disarmresearch/unidirgeneva/unidirPalais des Nations 1211 Geneva,Switzerland UNIDIR,2023WWW.UNIDIR.ORG