1、Reinforcement Learning for Short Video Recommender SystemQingpeng CaiOutlineReinforcement Learning for Short Video Recommender SystemsRL for Multi-objectives(Two-Stage Constrained Actor-Critic for Short Video Recommendation,WWW 2023)Summary125RL for Large Action Space(Exploration and Regularization
2、of the Latent Action Space in Recommendation,WWW 2023)3RL for Delayed feedback(Reinforcing User Retention in a Billion Scale Short Video Recommender System,WWW 2023)4Reinforcement Learning for Short Video RSDifference between Short Video RS and Other RSUsers interact with short video RS Scroll up an
3、d down Watch multiple videosMulti-objectives Watch time of multiple videosMain objective,Dense responses Share,Download,CommentSparse responses,constraintsDelayed feedbackSession depthUser RetentionMotivation of RL in Short Video RS Problems of supervised learning methods predict the value of an ite
4、m or a list of items lack of exploration and can not optimize the long-term value Hyper-parameter tuning in Kuaishou RS Many hyper-parameters Exist 1 1+2 2+How to learn optimal parameters to maximize different objectives?Objectives:watch time,interactions,session depth Non-gradient methods CEM/Bayes
5、 are used in Kuaishou Unable to optimize long-term metric Lack of personalization RL Exploration Aim to maximize the long-term performanceRL for Hyper-parameter Tuning:MDP MDP State:(user information,user history)User information:User history:states,actions,and rewards of previous steps Action Param
6、eters of several ranking functions A continuous vector Reward=+Episode Requests from opening the app to leaving the appRL for Hyper-parameter Tuning:Algorithms Objective Policy DNN Input state,output mu and sigma Sample action from Gaussian distribution Algorithm Selection Reinforce Slow convergence