Scaling data sets to match each other with least error

47 Views Asked by At

I have two data sets: A & B with values: $a_1,a_2...a_n$ and $b_1,b_2...b_n$ that represent the values for the same elements ($x_1,x_2...x_n$). For instance, $x_1 = a_1$ in the first data set and $x_1 = b_1$ in the second data set.

These data sets have very different values, but its relative values should be the same ($\frac{a_i}{a_j}=\frac{b_i}{b_j}$). This is not the case because the data come from experiments. I would like to obtain a scaling constant to multiply data set B to match data set A with the least error.

What is the best method to do this?

Edit: Also, each value in B has an uncertainty measurement, how can I take into a account this effect? As I must be more focused on matching the values that have the least uncertainty.

2

There are 2 best solutions below

0
On BEST ANSWER

Here is an example using data similar to yours. At a hospital, blood tests are routinely performed on newborn babies to determine whether too many red cells are present in the blood. Two methods of assaying blood cells are in common use: hematocrit (which determines the percent by volume of red cells) and hemoglobin (which is found by making a chemical determination of the amount of hemoglobin in the blood, expressed as grams per deciliter).

We have laboratory measurements of both, called LabCrit and LabHgb for 43 newborn babies. A regression 'through the origin' ($0$ y-intercept), as suggested by @AdrianKeister (+1), gives the following result:

Regression Equation

LabHgb = 0.340060 LabCrit

R-sq = 99.97%

enter image description here

Notes: (1) One reason for monitoring newborns in this way is that some babies are born with too many red cells, a potentially life-threatening condition, which is easily remedied if detected immediately. (2) It is well-known that hemoglobin (in g/dl) is about $1/3$ times hematocrit (in %), so our findings match what has been observed before. (3) The reason for this particular study was to determine the feasibility of using a new optical method to assay red blood cells (4) Data from Herzog and Felton: Hemoglobin screening for normal newborns: J. Perinatology, XIV, 4, July 1994.

0
On

This looks like a linear least-squares fit problem. If you imagine $y=\{a\}$ and $x=\{b\},$ (in a shameless abuse of notation) you're interested in a relationship $y=mx+c,$ most likely with $c=0.$ For most solvers, you can force the origin to be on the line and thus eliminate $c$. Your $m$ found from the solver is the solution to your problem. Taking the measurement uncertainty into account is more difficult. Thinking...

[EDIT]: I am imagining that you could normalize the data points somehow, and use the uncertainty to weight the points more or less: more uncertainty equals less weight, and less uncertainty equals more weight. This would have to be a reversible process, however. This sounds a bit like feature scaling, except that we're not normalizing every data point the same. You could probably get this to work, I think.