I'm interested in numerically finding the maximum likelihood estimator of a parameter $\theta$, as well as the confidence interval of this estimator. First I'll describe the method I've been trying, then I'll ask my specific questions about this method.
I perform $N$ binary trials at each possible value of $\theta$ (in my case, $\theta$ is integer-valued and known to fall in a certain range). Given these results, I can choose the maximum likelihood estimator $\hat \theta$ to be the value of $\theta$ that results in the most positive trials.
To find a confidence interval for $\hat \theta$, I can take subsets of my trials and use these to find estimators $\hat\theta_i$. Let the fraction of the trials that I take for each subset be $d$. Let $\hat\theta_\mu$ be the mean of all these estimators $\hat\theta_i$, and $\hat\theta_\sigma$ be the standard deviation.
My questions are: For my estimator, is it better to use the original estimator $\hat\theta$ based on all the data, or $\hat\theta_\mu$ based on my subset estimators? Is the confidence interval just equal to $\hat\theta \pm \hat\theta_\sigma$, or can I take into account the fact that the subset estimators will have more variance than the overall estimator, and use $\hat\theta \pm \sqrt{d} \hspace{3pt}\hat\theta_\sigma$? Is this overall a sound method, or are there other ways to improve it, or another method I should use entirely? I can always run more trials, for example I could run many sets of $N$ trials independently and then get a confidence interval from that, but of course that gets costly.
I think you want the confidence interval on $\hat p_\theta$, the estimated probability of success for each $\theta$, rather than $\hat \theta$, since I am not sure what the later would mean given that there is a finite discrete set of $\theta$ each with it own independent $p_\theta$. In general I think that if you want find the confidence interval for some statistic you are better off taking bootstrap samples (ie draws of $N$ samples with replacement) rather than subsamples. With bootstrap samples you do not need to worry about choosing a $d$ and your variance estimate is just the variance across the samples. However, in your case sampling seems unnecessary. The $N$ trials for each $\theta$ constitute a draw from a binomial distribution with probability $p_\theta$. So you can estimate the confidence interval for $p_\theta$ analytically. It is fairly easy to see that this is equivalent to the estimate you would get from a large number of bootstrap samples, since the bootstrap samples for each $\theta$ are just draws from the binomial distribution with $p = \hat p_\theta = S/N$ where $S$ is the number of success for $\theta$.
EDIT
If $p_\theta$ looks like "smooth" function of the discrete $\theta$, like the Poisson distribution is a "smooth" function of $k$, then you could try to estimate a CI on that function, but you would need to need to know the form of that function. You would need to guess a function $f$ such that $p_\theta=f(\theta, \phi)$ for some unknown $\phi$ and then you would estimate $\hat \phi$ and its confidence interval. For example when people estimate a Poisson distribution from data, the $k$ of the Poisson corresponds to your $\theta$, but what needs to be estimated the $\lambda$, which comes from the assumption that the form of the distribution is the Poisson.
EDIT 2
If you want to stay non-parametric and want to know how likely any given $\theta$ is to be optimal, then you could take a bootstrap sample from each of the N trials for each $\theta$, or equivalently a size $N$ binomial sample with $p = \hat p_\theta$ and record which $\theta$ sample had the highest success rate. Repeat to get a distribution over $\theta$ for producing the highest success rate sample. That seems like a reasonable proxy for what you are looking for if you do enough repetitions.
Alternatively you could try to estimate the distribution for the regret choosing $\hat \theta_{opt}$ the parameter value that was optimal in your intial trials. Do the same resampling as above, but each time record the difference between the success rate of sample produced by $\hat \theta_{opt}$ and the highest sample success rate of that iteration.