9.2 Reinforcement Learning from Human Feedback with Active Queries
- Authors: Kaixuan Ji, Jiafan He, Quanquan Gu
- Reason: High impact due to alignment with human preference in LLMs and significant reduction in necessary human feedback, which is a major bottleneck.
9.1 Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path
- Authors: Qiwei Di, Jiafan He, Dongruo Zhou, Quanquan Gu
- Reason: Presents a nearly optimal algorithm for a fundamental problem in RL, with rigorous theoretical bounds.
8.9 PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
- Authors: Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, Tingbo Hou
- Reason: Introduces a novel approach for stable reward finetuning applicable to large-scale datasets, with theoretical foundations and practical utility demonstrated through superior generation quality in experiments.
8.8 Exploiting Estimation Bias in Deep Double Q-Learning for Actor-Critic Methods
- Authors: Alberto Sinigaglia, Niccolò Turcato, Alberto Dalla Libera, Ruggero Carli, Gian Antonio Susto
- Reason: Introduces innovative methods to address bias, which is crucial for the stability and performance of deep RL.
8.7 Hybrid Inverse Reinforcement Learning
- Authors: Juntao Ren, Gokul Swamy, Zhiwei Steven Wu, J. Andrew Bagnell, Sanjiban Choudhury
- Reason: Proposes a reduction from inverse RL to expert-competitive RL, providing significant improvements in sample efficiency and demonstrating potential for a broader impact on continuous control tasks.
8.6 Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning
- Authors: Michael Lanier, Ying Xu, Nathan Jacobs, Chongjie Zhang, Yevgeniy Vorobeychik
- Reason: Offers a novel framework for interpretable policy learning in POMDPs, balancing performance and interpretability.
8.5 MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
- Authors: Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang
- Reason: Offers an original perspective on equitable alignment with diverse human preferences in RLHF, presenting a significant improvement in the win-rates for minority groups without sacrificing majority group performance.
8.4 Measuring Exploration in Reinforcement Learning via Optimal Transport in Policy Space
- Authors: Reabetswe M. Nkhumise, Debabrota Basu, Tony J. Prescott, Aditya Gilra
- Reason: Proposes a new metric for measuring exploration in RL, with potential to influence how exploration strategies are evaluated and compared.
8.3 Second Order Methods for Bandit Optimization and Control
- Authors: Arun Suggala, Y. Jennifer Sun, Praneeth Netrapalli, Elad Hazan
- Reason: Provides an optimal regret bound for a new class of convex functions, significantly improving computational efficiency in high-dimensional bandit problems and potentially resolving several open questions in the field.
8.1 Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption
- Authors: Chenlu Ye, Jiafan He, Quanquan Gu, Tong Zhang
- Reason: Addresses a critical area of adversarial corruption in model-based RL with two novel algorithms, CR-OMLE and CR-PMLE, achieving regret and suboptimality bounds close to the lower bounds.