1. 9.2 Reinforcement Learning from Human Feedback with Active Queries
  2. 9.1 Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path
  3. 8.9 PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
  4. 8.8 Exploiting Estimation Bias in Deep Double Q-Learning for Actor-Critic Methods
  5. 8.7 Hybrid Inverse Reinforcement Learning
  6. 8.6 Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning
  7. 8.5 MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
  8. 8.4 Measuring Exploration in Reinforcement Learning via Optimal Transport in Policy Space
  9. 8.3 Second Order Methods for Bandit Optimization and Control
  10. 8.1 Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption