1、November 26,2023Offline Reinforcement Learning withReward-Free DatasetsChongjie ZhangHu Hao(胡昊)Yiqin Yang(杨以钦)Machine Intelligence GroupIntelligent Decision-Making/Control2GameAIRecommended SystemIntelligent RobotMachine Intelligence GroupReinforcement LearningsOpportunities and Challenges Success i
2、n Artificial Domains Real-world challenges Online interaction is expensive&dangerous Healthcare,Robotics,Recommendation Sample complexity Transfer3DataInteractionMachine Intelligence GroupData-Driven Solution:Offline RL4Interaction costSample complexityTransferNo InteractionFinetuneSample from Targe
3、t TaskBig data from past interactionsTraining policy with many epochsOccasional interaction for more dataMachine Intelligence GroupOffline RL Setting =!,!,!,!#!$,(,)Objective:max#%&()+#),-#(-)%(%,%)Problem SettingPolicy is learned with a static dataset,which is collected by unknown behavior policy!I
4、nteractions are not allowedMachine Intelligence GroupChallenges of Offline RL Significant overestimation:Reward-free dataset Reward-free Datasets can be cheap,while dataset for a specific task can be expensive.6go right to get higher!Extrapolation Error+Bootstrapping!(|)(|)(!()Machine Intelligence G
5、roupOutline71.Offline RLwith EVLICLR22OtherDataOtherDataOtherData2.Provable DataSharingICLR23Online RLReward-freeData3.Behavior Extraction via Random IntentionsNeurIPS23Machine Intelligence GroupFirst Challenge:Significant Overestimationin Offline RL8Xiaoteng Ma*,Yiqin Yang*,Hao Hu*,Jun Yang,Chongji
6、e Zhang+,Qianchuan Zhao,Bin Liang,and Qihan Liu.Offline Reinforcement Learning with Value-based Episodic Memory.ICLR.2021.Machine Intelligence GroupReasons for Overestimation in Offline RL Extrapolation Error9True Q-ValueEstimated Q-ValueFigure 2.OOD action invalue estimationTheorem 11:Given a deter