UCB proof is using an assumption that is never justified.

26 Views Asked by At

I'm walking though deep mind's course on reinforcement learning: https://www.youtube.com/watch?v=aQJP3Z2Ho8U&t=4940s

at timestamp 1:14:40, the lecturer states that we assume that we have a time step "m", <= t (the current time step), for which:

$N_m(a)\Delta_a \leq x_alog(m)$

where:

  1. $N_m(a)$ is the number of times we picked action a.
  2. $\Delta_a$ is the expected regret for picking action a.
  3. $x_a$ is some constant we did not determine yet.

the rest of the proof seem to rely on this assumption. but, the lecturer never stops to justify it! what if you don't have this kind of a time step? the whole proof becomes useless, isn't it?