Suppose I have a (expensive to run, mechanical) system that as a decent approximation involves two independent, random, sources of error. One of these is something I can tinker with in hope of reducing the error. I want to determine if a change I make may be claimed to to be an improvement or not.
Typically, when I run the system, I take some samples (I want to take as few as permits a sufficient confidence level--remember the system is expensive to run) I then calculate a standard deviation for the aggregated error. I then tinker with the system and take some more samples and create a new SD value. But of course, for a small sample, it's not safe to simply assume that if the SD improved in this sample, that this reflects an actual improvement, rather than luck.
I'm familiar, but not particularly competent, with the idea of confidence as it is taught in high schools, but that has always been presented in terms of testing if a change of mean is significant. That formula doesn't make sense (to me at least!) for determining if a change of SD is significant.
So, I'd like to understand two things:
1) How can I determine if a change in SD between two samples of particular sizes is significant at a given level? 2) Can I estimate the necessary sample size to obtain a result that's significant at a particular level (and if so, how)?
There are several issues in your Question (and in the Comments). I will try to deal with some of them. Suppose $X_i$'s and $Y_i$'s are independent random samples of sizes $n$ and $m$ respectively.
Hypothesis and test. Then the statistic $F = S_x^2/S_y^2 \sim \mathsf{F}(n-1, m-1).$ This fact can be used to test $H_0: \sigma_x^2/\sigma_y^2 = 1$ (population variances $\sigma_x^2$ and $\sigma_y^2$ are equal) against $H_a: \sigma_x^2/\sigma_y^2 > 1.$
In R statistical software, the test can be performed as shown below. I begin by generating two normal samples of sizes $n = 10$ and $m = 8$ with a 4:1 ratio of population variances (a 2:1 ratio of population standard deviations), so we hope to reject $H_0.$
The boxplots clearly show that the $X_i$'s are more variable than the $Y_i$'s. The sample variances and their ratio $F$ are as follows:
The test below, rejects $H_0$ at the 5% level because the P-value is $0.01225 < 0.05.$
The P-value of this 1-sided test is computed as the area under the density curve of $\mathsf{F}(11,7)$ to the right of $F = 6.0918.$
If you are doing this test without software and using printed tables of the F-distribution, the 5% critical value is found by cutting area 0.0275 from the upper tail of $\mathsf{F}(11, 7),$ which the printed table will show as something like 3.60 (perhaps by interpolation).
The figure below shows the density function of $\mathsf{F}(11, 7)$ showing the critical value 3.603 (dashed vertical line) and the F-statistic 6.0918 (solid line). The area beneath the curve to the right of 6.0918 is the P-value 0.01225.
Power of the F-test. There are two difficulties with the F-test described just above. First, it may not give reliable answers unless data are from normal populations, as Commented by @awkward. Various tests for difference in variances ('heteroscedasticity') that are less-sensitive to non-normal data are discussed in intermediate-level applied statistics books and implemented in software packages such as R. One of them is the 'Levene Test'.
Second, the F-test and its competitors (for non-normal data) have notoriously bad power. That is, they may fail to identify real differences in population variances, as reflected in sample variances.
The power of this F-test depends on the ratio $\sigma_x^2/\sigma_y^2$ and the sizes of the samples. For a reasonably complete discussion of the power of this F-test, see this Q & A.
Here is a simulation that approximates the power for a test at the 5% level, against an alternative with a 4:1 ratio of population variances (2:1 for SDs) and for sample sizes $n = m = 10$ (population means are irrelevant). The idea is to run a large number of tests on data simulated to these specifications and see how often the null hypothesis is rejected.
The power is about 63%. (About 37% of 4:1 differences in variances will go undetected. The sample sizes in the example above are similar, so it was not a 'sure thing' that we would reject there.) However, with larger sample sizes $n = m = 25,$ the power is slightly above 95%.
Note: If you can determine a base-level variance $\sigma_0^2$ for the process in its current state, then it will be easier to detect whether a single sample (after 'tinkering' and perhaps improvement) has a smaller variance. Details of that would be for another discussion.