- 8.9 Principled RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation
- Authors: Chanwoo Park, Mingyang Liu, Kaiqing Zhang, Asuman Ozdaglar
- Reason: Addresses the challenge of heterogeneous human feedback in RLHF with novel frameworks; authors are from reputable institutions and the paper presents strong theoretical guarantees.
- 8.7 MF-OML: Online Mean-Field Reinforcement Learning with Occupation Measures for Large Population Games
- Authors: Anran Hu, Junzi Zhang
- Reason: First fully polynomial RL algorithm for Nash equilibria beyond certain game types; significant for multi-agent RL with theoretical contributions.
- 8.5 Self-Play Preference Optimization for Language Model Alignment
- Authors: Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, Quanquan Gu
- Reason: Proposes a novel self-play method for language model alignment with theoretical convergence guarantees and state-of-the-art performance in experiments.
- 8.3 UCB-driven Utility Function Search for Multi-objective Reinforcement Learning
- Authors: Yucheng Shi, Alexandros Agapitos, David Lynch, Giorgio Cruciata, Hao Wang, Yayu Yao, Aleksandar Milenovic
- Reason: Introduces an innovative approach to optimising weight vectors in MORL, demonstrating improved performance on benchmark problems.
- 8.1 No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO
- Authors: Skander Moalla, Andrea Miele, Razvan Pascanu, Caglar Gulcehre
- Reason: Offers a critical insight into the dynamics of representation in PPO with empirical studies, leading to a brand-new auxiliary loss that improved PPO agent performance.