Nonstationary Bandit - Optimal Action Accumulation

11 Views Asked by At

Although my question contains code, my question is regarding the mathematics of the algorithm. With that being said...

I'm meant to implement a Nonstationary Bandit algorithm using an action-value method using 1) sample averages and 2) constant step-size, with alpha=0.1.See full assignment instructions below:

enter image description here

Here's where I'm confused. Our instructions dictate: "At a particular step, the ratio of optimal actions is calculated as the number of times the bandit took an action that was actually optimal at that timestep divided by the total number of runs, which is 300. Therefore, if at timestep 7913, the agent selected the optimal action in 210 out of the 300 runs (70%), the value for timestep 7913 should be 0.7." And we've been told that a correctly implemented algorithm should demonstrate roughly the following behavior: enter image description here My code produces the following graph: enter image description here

What I don't understand is how the optimal actions are meant to grow as they do in the first graph? If actions are randomly selected, wouldn't we generally see that the algorithm selects the optimal action 90% of the time at the beginning and at the end? How would it select the optimal action less at the beginning and more at the end?

Here is my code (you may need to install packages, but otherwise it can be run as is):

import numpy as np
import random
from scipy.stats import norm
import sys
import matplotlib.pyplot as plt

def sample_average(steps, runs):
   rewards = np.zeros(steps)
   optimal = np.zeros(steps)

   for r in range(runs):
       q = np.zeros(10) # q starts out equally at zero
       for t in range(steps):
           flag = False # optimal action selected flag
           
           eps = random.random()
           if eps > 0.1:
               flag = True # optimal action selected
               a = np.argmax(q) # greedy
           else:
               a = random.randint(0,9) # random
           
           # take independent random walk
           n = norm.rvs(loc=0, scale=0.01)
           q[a] += n # add random walk value to selected arm
           
           # reward generated from normal distribution, mean=q[a], std_dev=1
           reward = norm.rvs(loc=q[a], scale=1)
           rewards[t] += reward
           
           if flag == True:
               optimal[t] += 1

   rewards = rewards / runs
   optimal = optimal / runs
   return rewards, optimal

def constant_step_size(steps, runs):
   rewards = np.zeros(steps)
   optimal = np.zeros(steps)
   alpha = 0.1

   for r in range(runs):
       q = np.zeros(10) # q starts out equally at zero
       for t in range(steps):
           flag = False # optimal action selected flag
           
           eps = random.random()
           if eps > 0.1:
               flag = True # optimal action selected
               a = np.argmax(q) # greedy
           else:
               a = random.randint(0,9) # random

           # take independent random walk
           n = norm.rvs(loc=0, scale=0.01)
           q[a] += n # add random walk value to selected arm
           
           # reward generated from normal distribution, mean=q[a], std_dev=1
           reward = norm.rvs(loc=q[a], scale=1)
           rewards[t] += reward*alpha
           
           if flag == True:
               optimal[t] += 1

           q[a] += alpha*(reward - q[a])

   rewards = rewards / runs
   optimal = optimal / runs
   return rewards, optimal

if __name__ == "__main__":
   steps = 10000
   runs = 300

   rewards_sa, optimal_sa = sample_average(steps, runs)
   rewards_cs, optimal_cs = constant_step_size(steps, runs)

   # Plot the results
   fig,axes = plt.subplots(2,1)
   axes[1].set_ylim([0.,1.])
   axes[0].plot(rewards_sa)
   axes[1].plot(optimal_sa)
   axes[0].plot(rewards_cs)
   axes[1].plot(optimal_cs)
   axes[0].set_title('Average Rewards')
   axes[1].set_title('Average Ratio of Optimal Actions')
   fig.show()