Although my question contains code, my question is regarding the mathematics of the algorithm. With that being said...
I'm meant to implement a Nonstationary Bandit algorithm using an action-value method using 1) sample averages and 2) constant step-size, with alpha=0.1.See full assignment instructions below:
Here's where I'm confused. Our instructions dictate:
"At a particular step, the ratio of optimal actions is calculated as the number of times the bandit took an action that was actually optimal at that timestep divided by the total number of runs, which is 300. Therefore, if at timestep 7913, the agent selected the optimal action in 210 out of the 300 runs (70%), the value for timestep 7913 should be 0.7."
And we've been told that a correctly implemented algorithm should demonstrate roughly the following behavior:
My code produces the following graph:

What I don't understand is how the optimal actions are meant to grow as they do in the first graph? If actions are randomly selected, wouldn't we generally see that the algorithm selects the optimal action 90% of the time at the beginning and at the end? How would it select the optimal action less at the beginning and more at the end?
Here is my code (you may need to install packages, but otherwise it can be run as is):
import numpy as np
import random
from scipy.stats import norm
import sys
import matplotlib.pyplot as plt
def sample_average(steps, runs):
rewards = np.zeros(steps)
optimal = np.zeros(steps)
for r in range(runs):
q = np.zeros(10) # q starts out equally at zero
for t in range(steps):
flag = False # optimal action selected flag
eps = random.random()
if eps > 0.1:
flag = True # optimal action selected
a = np.argmax(q) # greedy
else:
a = random.randint(0,9) # random
# take independent random walk
n = norm.rvs(loc=0, scale=0.01)
q[a] += n # add random walk value to selected arm
# reward generated from normal distribution, mean=q[a], std_dev=1
reward = norm.rvs(loc=q[a], scale=1)
rewards[t] += reward
if flag == True:
optimal[t] += 1
rewards = rewards / runs
optimal = optimal / runs
return rewards, optimal
def constant_step_size(steps, runs):
rewards = np.zeros(steps)
optimal = np.zeros(steps)
alpha = 0.1
for r in range(runs):
q = np.zeros(10) # q starts out equally at zero
for t in range(steps):
flag = False # optimal action selected flag
eps = random.random()
if eps > 0.1:
flag = True # optimal action selected
a = np.argmax(q) # greedy
else:
a = random.randint(0,9) # random
# take independent random walk
n = norm.rvs(loc=0, scale=0.01)
q[a] += n # add random walk value to selected arm
# reward generated from normal distribution, mean=q[a], std_dev=1
reward = norm.rvs(loc=q[a], scale=1)
rewards[t] += reward*alpha
if flag == True:
optimal[t] += 1
q[a] += alpha*(reward - q[a])
rewards = rewards / runs
optimal = optimal / runs
return rewards, optimal
if __name__ == "__main__":
steps = 10000
runs = 300
rewards_sa, optimal_sa = sample_average(steps, runs)
rewards_cs, optimal_cs = constant_step_size(steps, runs)
# Plot the results
fig,axes = plt.subplots(2,1)
axes[1].set_ylim([0.,1.])
axes[0].plot(rewards_sa)
axes[1].plot(optimal_sa)
axes[0].plot(rewards_cs)
axes[1].plot(optimal_cs)
axes[0].set_title('Average Rewards')
axes[1].set_title('Average Ratio of Optimal Actions')
fig.show()
