I am new to algebra and help would be more than welcome to tell me if the process I have built is OK, and if my attempt to apply formula on two sets to create a new one is also OK.
Context
I have a database containing species records (e.g. 10 different species, with 100 rows by species ; in columns are quantitative variables). I want to compute Euclidean distances (considering all variables) between randomly sampled 20 row by species, between each species and species h. I want to bootstrap this calculation an increasing number of time to assess the effect of iteration augmentation on results linearity (to say: OK, we have reach linearity, results should be OK). The aim is to show a figure like that (1 line color = 1 species Euclidean distance to species h):
To explain the process, I illustrate it with distance calculation between species and species h:
- We define sets and as original species records.
- Then we define and as proper subsets of and composed of 20 records randomly sampled in and , without replacement, so that probability P(r) for records to be selected is:
- Then we define the following function to compute the mean Euclidean distance between all records of and
With n = 20 (variables) and n' = 20 (randomly sampled records ; size of and ).
- Then we define set D, which contains Euclidean distances between records and :
- Finally, we define set B containing number of iterations of the whole process, from sampling event (with replacement between each iteration, giving a probability for records to be selected between iterations) to sed D computation. The following formula allow computing set M:
With B = rounded values.
Mainly, I am not pretty sure that I have the right to build this way...
Could you please tell me if it is OK to call functions this way in sets ? And if you spot mistakes in the process ?
Many thanks for your help !