I'm trying to implement an areal interpolation technique described in https://www.tandfonline.com/doi/abs/10.1080/00045608.2011.627054 but I'm struggling to understand how the linear programming formulation in the paper translates to something like lpSolve using R.
Basically the task is to interpolate population from a set of source zones, each with an aggregate population value, to an arbitrary set of target zones. By using ancillary land cover data that classifies different land uses, and obtaining the area of intersection of each land use class with each source zone and using these as independent variables in a quantile regression, the coefficients of the quantile regression at the exact percentile of each source zone should give an estimate of the population density for each land class for each source zone. These can then be summed for each area of intersection for each class in each source zone and each target zone to get an estimate for the target zone.
The formulation is:
Minimise $\sum_{i}p\lambda_{i}^{-}+(1-p)\lambda_{i}^{+}$
Subject to:
$\sum_{j}\beta_{j}X_{ij}+\lambda_{i}^{-}+\lambda_{i}^{+} = P_{i}$ (for all i)
$\beta_{j},\lambda_{i}^{-},\lambda_{i}^{+} >= 0$
Where:
$p$ is the quantile parameter (0..1)
$\beta_{j}$ is the estimated population density for land class $j$
$X_{ij}$ is the amount of the $j^{th}$ land cover class associated with the $i^{th}$ observation
$P_{i}$ is the population count for the $i^{th}$ observation
$\lambda_{i}^{-}$ and $\lambda_{i}^{+}$ are deviational variables representing the amount of under and over estimation for the $i^{th}$ observation by the regression model
Essentially it's quantile regression but without an intercept term, and I can see how the formulation is meant to minimise the absolute deviation from the regression line for the chosen quantile, I'm just not sure how that quantile parameter fits in with the standard formulation for a linear program, or how exactly the deviational variables fit in and need to be treated (even if they are actually different in any way from a standard decision variable?).
I'm also slightly confused in that $\beta_{j}$, which is actually the value I'm interested in, is not part of the objective function, but only appears in the constraints - is this 'allowed' in linear programming? I guess it's just the same as it having a zero coefficient in the objective function, but I haven't found any other examples of linear programs that look like this, and as a geography student this is all fairly new to me!
Any help appreciated!
EDIT
OK, thanks to David M's response confirming what the decision variables are, I think I now know what vectors and matrices I need to pass to the lp function of the lpSolve package in R to carry out the optimisation!
Does the following look sensible?
direction: min
objective.in (vector of coefficients of the objective function): a vector of alternating p, (1-p), p, (1-p) ... for as many i as I have
const.mat (matrix of constraint coefficients, one row per constraint, one column per decision variable):
$X_{i,j}$, $X_{i,j+1}$ ... for as many j as there are for i ... , 1, 1
$X_{i+1,j}$, $X_{i+1,j+1}$ ... for as many j as there are for i+1 ... , 1, 1 ... for as many i as there are
A row of 1s of length: number of combinations of i and j (for $\beta_{ij}$) + number of i (for $\lambda_{i}^{-}$) + number of i (for $\lambda_{i}^{+}$) - this last row is for the final constraint specifying that $\beta_{ij}$, $\lambda_{i}^{-}$ and $\lambda_{i}^{+}$ must be >= 0 - or should this be defined elsewhere?
const.dir (vector of directions for each constraint): == for all except the last element which is >= for the final constraint as above?
const.rhs (vector of numeric values for the right hand side of the constraints): $P_{i}$, $P_{i+1}$, ... etc. for all i then zero for the final constraint as above?
I hope that makes sense, apologies if it's not a standard way of describing this stuff!
I'm assuming also that the +$\lambda_{i}^{-}$+$\lambda_{i}^{+}$ in $\sum_{j}\beta_{j}X_{ij}+\lambda_{i}^{-}+\lambda_{i}^{+} = P_{i}$ (for all i) is meant to be outside the summation? It's not very clear the way it's written in the paper but I'm pretty sure it wouldn't make any sense otherwise!
Thanks again for any help.
Dan
EDIT
Ah, I just spotted the following in the lpSolve docs:
"Note that every variable is assumed to be >= 0!"
Which answers my question about the final constraints above - I can leave those bits out, which makes things a bit simpler.