I need some suggestions on how to go above solving this problem:
Suppose I have $n$ vectors: $X_1, X_2, ..., X_n$, and a known vector $Y$. Each vector has $T$ rows.
I want to select only $3$ out of $n$ vectors, say $\: X_i, X_j, X_k\: $, along with finding some corresponding thresholds $\: \chi_i, \chi_j, \chi_k \: $ such that the following is maximized (globally).
$\frac{1}{T} \, \sum_{t=1}^T Y_t \, * \, \Big[ 1\{\: X_{t, i} > \chi_i \: \& \: X_{t, j} > \chi_j \: \& \: X_{t, k} > \chi_k \} - 1\{\: X_{t, i} < \chi_i \: | \: X_{t, j} < \chi_j \: | \: X_{t, k} < \chi_k \} \Big] $
where $\: 1\{.\}\: $ is the Indicator function, and I need to select $\: i, j, k\: $ such that $\; i \neq j \neq k$.
I tried solving this using brute force, running massive loops, over all combinations of $\: i, j, k \: $, i.e. looping over $\: X_i, X_j, X_k \: $, and thresholds $\: \chi_i, \chi_j, \chi_k \: $ .
Bruce force gives me good solution, but brute force solutions can take a while. For my problem $n$ is typically $2000$ and $T$ is typically $30,000$ and repeating the bruce force every time on a new data is time-consuming.
To speed up, I tried to restructure this problem approximately as a "Greedy" Decision Trees (GBM, Random Forest, etc.) to get some approximate solutions (such as picking up the most promising path among all possible paths), but "Greedy" Decision Tree solutions are typically vastly inferior to running a brute force solutions, e.g. generally, the best path from a Tree fitting gives a value that 20%-30% of the optimal solution.
As an alternative to a brute force approach, I was wondering if anyone had any idea (any pointers will do too) on how I could go about converting the above maximization into a optimization problem (convex, or non-convex) where I am hoping to use a solver to speed up my work?
Any papers that deal with such a problem will do too.
You can solve the problem via mixed integer linear programming as follows. For $i \in \{1,\dots,n\}$ and $s\in\{1,2,3\}$, let binary decision variable $u_{i,s}$ indicate whether vector $X_i$ is selected to be in "slot" $s$. For $s\in\{1,2,3\}$, let continuous decision variable $\chi_s$ be the threshold, let $\overline{\chi_s}$ be an upper bound on $\chi_s$, and let $M_{s,t}=\overline{\chi_s}-\min_i X_{t,i}$. For $t\in\{1,\dots,T\}$, let binary decision variable $v_t$ indicate whether all three selected vectors $X_i$ exceed their respective thresholds in component $t$. The problem is to maximize $\sum_t Y_t v_t$ subject to linear constraints: \begin{align} \sum_i u_{i,s} &= 1 &&\text{for all $s$} \tag1\\ \chi_s - \sum_i X_{t,i} \cdot u_{i,s} &\le M_{s,t}(1-v_t) &&\text{for all $s$ and $t$} \tag2\\ \sum_i i\cdot u_{i,s} + 1 &\le \sum_i i\cdot u_{i,s+1} &&\text{for $s\in\{1,2\}$} \tag3 \end{align} Constraint $(1)$ selects exactly one $X_i$ per slot. Constraint $(2)$ enforces the logical implication $(u_{i,s} = 1 \land v_t = 1) \implies X_{t,i} \ge \chi_s$. Constraint $(3)$ prohibits selecting the same vector twice.
You can recover your original objective function value as $\frac{1}{T}\sum_t Y_t(2 v_t-1)$.