9.4 From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
- Authors: Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn
- Reason: This paper bridges the gap between RLHF (Reinforcement Learning From Human Feedback) and DPO (Direct Preference Optimization), offering theoretical insights and practical implications for generative AI models. The potential influence is high due to its relevance in the rapidly evolving field of AI alignment and the authoritative position of the authors like Chelsea Finn, who is recognized for her work in robotics and machine learning.
9.2 Actor-Critic Reinforcement Learning with Phased Actor
- Authors: Ruofan Wu, Junmin Zhong, Jennie Si
- Reason: The PAAC (Phased Actor in Actor-Critic) method proposed in this paper could significantly impact the quality of control policy in continuous optimal control problems. The technical novelty and experimental validation using DeepMind Control Suite may draw considerable attention from the RL community. Moreover, Jennie Si is a reputable author with substantial contributions to the field of reinforcement learning.