# Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control

@inproceedings{Mao2020ModelFreeNR, title={Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control}, author={Weichao Mao and Kaiqing Zhang and Ruihao Zhu and David Simchi-Levi and Tamer Bacsar}, year={2020} }

We consider model-free reinforcement learning (RL) in non-stationary Markov decision processes. Both the reward functions and the state transition functions are allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain variation budgets. We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL, and show that it outperforms existing solutions in terms of dynamic regret. Specifically… Expand

#### References

SHOWING 1-10 OF 86 REFERENCES

Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism

- Computer Science, Mathematics
- ICML
- 2020

This work develops the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establishes its dynamic regret bound when the variation budgets are known, and proposes the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL1-CW algorithm to achieve the same dynamic regrets bound. Expand

Nonstationary Reinforcement Learning with Linear Function Approximation

- Computer Science, Mathematics
- ArXiv
- 2020

This work develops the first dynamic regret analysis in nonstationary reinforcement learning with function approximation in episodic Markov decision processes with linear function approximation under drifting environment and proposes a parameter-free algorithm that works without knowing the variation budgets but with a slightly worse dynamic regret bound. Expand

Online Markov Decision Processes Under Bandit Feedback

- Computer Science
- IEEE Transactions on Automatic Control
- 2014

It is shown that after T time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is O(T1/2lnT), giving the first rigorously proven, essentially tight regret bound for the problem. Expand

Near-optimal Regret Bounds for Reinforcement Learning

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2008

This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps. Expand

Dynamic Regret of Policy Optimization in Non-stationary Environments

- Computer Science, Mathematics
- NeurIPS
- 2020

This work proposes two model-free policy optimization algorithms, POWER and POWER++, and establishes guarantees for their dynamic regret, and shows that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction. Expand

Online algorithms for the multi-armed bandit problem with Markovian rewards

- Computer Science, Mathematics
- 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
- 2010

It is shown that under certain conditions on the state transition probabilities of the arms, a sample mean based index policy achieves logarithmic regret uniformly over the total number of trials. Expand

Is Q-learning Provably Efficient?

- Computer Science, Mathematics
- NeurIPS
- 2018

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically… Expand

Decentralized Q-Learning in Zero-sum Markov Games

- Computer Science, Mathematics
- ArXiv
- 2021

A radically uncoupled Q-learning dynamics that is both rational and convergent: the dynamics converges to the best response to the opponent’s strategy when the opponent follows an asymptotically stationary strategy; the value function estimates converge to the payoffs at a Nash equilibrium when both agents adopt the dynamics. Expand

Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards

- Computer Science
- NIPS
- 2014

This paper fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward "variation" and the minimal achievable regret, and by established a connection between the adversarial and the stochastic MAB frameworks. Expand

Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management

- Computer Science, Mathematics
- EC
- 2019

The main contribution is a learning algorithm with a regret bound of ~O (L√T+D) for the inventory control problem, significantly improving the previously best known regret bounds for this problem where the dependence on L was exponential and many further assumptions on demand distribution were required. Expand