1. 8.9 Principled RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation
  2. 8.7 MF-OML: Online Mean-Field Reinforcement Learning with Occupation Measures for Large Population Games
  3. 8.5 Self-Play Preference Optimization for Language Model Alignment
  4. 8.3 UCB-driven Utility Function Search for Multi-objective Reinforcement Learning
  5. 8.1 No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO