Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Ronald Ortner

doi:10.1613/jair.1.11316

PDF

Published: Jan 23, 2020

DOI: https://doi.org/10.1613/jair.1.11316

Keywords:

reinforcement learning, markov decision processes

Ronald Ortner

Montanuniversitaet Leoben

Abstract

We give a simple optimistic algorithm for which it is easy to derive regret bounds of O(sqrt{t-mix SAT}) steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t-mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

Issue

Vol. 67 (2020)

Section

Articles

news

AAAI Contributes to JAIR Sustainability Campaign

Reproducibility Initiative

2025 IJCAI-JAIR Prize Awarded

Special Track on Multi-Agent Path Finding

JAIR Available in ACM Library

JAIR Sustainability Campaign: Help Support Us

submission

JAIR invites submissions in all areas of AI. Articles published in JAIR must meet the highest quality standards as measured by originality and significance of the contribution.

Submit an Article

afiliatedsites

JAIR is published by AI Access Foundation, a nonprofit public charity whose purpose is to facilitate the dissemination of scientific results in artificial intelligence. JAIR, established in 1993, was one of the first open-access scientific journals on the Web, and has been a leading publication venue since its inception.

Learn more

Article Sidebar

Main Article Content

Abstract

Article Details