Finding the smallest sub-family of subsets needed to form a new subset

433 Views Asked by At

TL/DR

I have a universe $U$ of items $u_i$ and a family $F$ of subsets of $U$ (call them $P_j$ ⊆ $U$).

Given this family of subsets, I would like to find the sub-family $C$ ⊆ $F$ of subsets that can produce a new subset $P_{new}$ of members $u_i$ by adding together (or subtracting) as few subsets $P_j$ as possible.

That's the best I can do. Hopefully an example is more clear:


Example

For instance, if we start with the following family of subsets:

$$ \begin{align} F = \{&P_1 = \{a\},\ P_2 = \{b\},\ ...,\ P_{16} = \{p\}, \\ &P_{17} = \{a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p\}, \\ &P_{18} = \{a, b, c, d, e\}, \\ &P_{19} = \{g, h, i\} \,\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \}&\\ \end{align} $$

When requested to compute $\{a, b, c, d, e, f, g, h, i\}$, the simplest thing to do is calculate:

$$\{a, b, c, d, e, f, g, h, i\} = \{a\} + \{b\} + \{c\} +\ ...\ + \{h\} + \{i\}$$

This isn't optimal though (requires 8 additions). For instance, I know that I could more quickly produce the set if I instead took advantage of my previously computed sets (using 2 additions):

$$ \begin{align} P_{new} &= \{a, b, c, d, e, f, g, h, i\}\\ &= \{a, b, c, d, e\} + \{f\} + \{g, h, i\} \\ &= P_{18} + P_{6} + P_{19} \\ \mathord{\therefore}\ C ⊆ F &= \{ P_{6}, P_{18}, P_{19} \} \\ \end{align} $$


Example 2

What's even more complex is that (if possible) I want to know when involving subtraction might be optimal:

$$\{e, f, g, h, i\} = \{e\} + \{f\} + \{g, h, i\}$$

This is the best solution using only addition (2 operations), But I could have gotten this even faster with 1 subtraction:

$$\{e, f, g, h, i\} = \{a, b, c, d, e, f, g, h, i\} - \{a, b, c, d\}$$


Why I need this

Each subset $P_j$ has a value $p_j = f(P_j)$ that can be computed. The function $f(P_j)$ is additive. So $p_{\{1,2\}} = p_{\{1\}} + p_{\{2\}}$

When I start my application, I start only by calculating the value $p_i$ for each item $l_i$ on its own. For example:

$$ \begin{align} P_1 = \{a\} ,&\ \ p_1 = f(P_1) = 5 \\ P_2 = \{b\} ,&\ \ p_2 = f(P_2) = 20 \\ P_3 = \{c\} ,&\ \ p_3 = f(P_3) = 8 \\ ...\ & \end{align} $$

I then have to start servicing requests for different subsets. For example:

$$ \begin{align} P_{18} &= \{a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p\}\ &f(P_{18}) &= 400 \\ P_{19} &= \{b, c, d\}\ &f(P_{19}) &= 43\\ P_{20} &= \{g, h, i\}\ &f(P_{20}) &= 30 \\ ...& \end{align} $$

My goal is to be able to process these request as fast as possible. For early requests, I unavoidably have to spend a lot of time adding up $p_j$ values. But since these values are additive, I know there should be faster ways to process requests by taking advantage of sets for which I've already computed $p_j$.

If $P_{21} = \{b, c, d, g, h, i\}$ is requested, I don't want to waste precious resources retrieving the the 6 values for $p_{2}$ to $p_{7}$, and then adding these values in 5 lengthy operations, when I could have just done a single operation $p_{21} = p_{19}+p_{20}$.


Not set-theory?

This might actually be a better fit for linear algebra, if formulated as follows:

If I have the following known equations and values:

$$ \begin{align} P_{1} &= a, P_{2} = b,\ ...,\ P_{8} = g &f(P_{1}) &= p_{1},\ ...\\ P_{9} &= a + b + c + d &f(P_{9}) &= p_{9} \\ P_{10} &= d + e + f + g &f(P_{10}) &= p_{10} \\ \end{align} $$

And I wish to calculate

$$ \begin{align} P_{11} &= a + b + c + d + e + f + g &f(P_{11}) ? \\ \end{align} $$

I want to be able discover that the fastest solution comes from

$$ \begin{align} P_{11} &= P_{9} + P_{10} - P_{4} \\ P_{11} &= (a + b + c + d) + (d + e + f + g) - (d) \\ &= (a + b + c + 2d + e + f + g) - (d) \\ &= a + b + c + d + e + f + g\ \checkmark\\ \mathord{\therefore}\ p_{11} &= p_{9} + p_{10} - p_{4} \\ \end{align} $$

It's starting to look suspiciously like an np hard problem to me :( If no one can come up with an elegant way of solving the problem, I would also accept a more elegant way of wording the problem (perhaps in terms of an existing well known problem), or a bound on the complexity.

3

There are 3 best solutions below

2
On BEST ANSWER

OK, first let's work on a formal representation of your problem.

  • Let $m$ be the number of computed sets.
  • Let $n$ be the number of elements in the universe
  • Let $U=\{u_i\}_{i=1}^{n}$ be the universe
  • Let $S$ be the set of all subsets of $U$
  • Let $\{P_j\}_{j=1}^{m}$ be the already computed sets
  • Let $A$ be an $n \times m$ matrix
  • Let $A_{ij} = \cases{\begin{array}[c]\\1,\,u_i \in P_j \\0,\,u_i \notin P_j \end{array}}$
  • Let $c(x) = \cases{\begin{array}[c]\\0,\,x=0\\1,\, x \ne 0 \\ \end{array}}$
  • Let $v(s): S \rightarrow \mathbb{R}^n$ be a function that creates a vector representation of a subset of $U$. Define $v(s) = \left[\begin{array}[c]\\v(s)_1\\v(s)_2\\ \,\,\,\,\vdots \\ v(s)_n \end{array}\right]$ and $v(s)_i = \cases{\begin{array}[c]\\1, \, u_i \in s\\0, \, u_i \notin s \end{array} }$
  • Let $f$ be the function as defined in the original question.

The first thing to note is that since $f$ is linear, and we have a new input set $P^*$, if we can find a vector $q \in \mathbb{R}^m$ such that $Aq = v(P^*)$, then $f(P^*) = \sum_{i=1}^n q_i \cdot f(P_i)$. Furthermore, note that the cost of computing $f(P^*)$ using the formula above is $C(q) = \sum_{i=1}^m c(q_i) $.

With all of that, we have our optimization problem: $$ \DeclareMathOperator*{\argmin}{argmin} \begin{align} \\ q^*(P^*)&=\argmin_q C(q) \text{ such that }Aq=v(P^*) \end{align} $$

$C$ is piecewise constant function, meaning it has discontinuities both in value and in derivative. Furthermore, if any of the existing subsets are not used in the optimal calculation, it will have a discontinuous derivative at the optimal point. The input vector $q$ can take on any value in $\mathbb{R}^m$. The constraint is a linear equality constraint.

So at first blush, this is a nonsmooth optimization problem with linear equality constraints.


There's also another way to think about the objective function. Suppose we have a guess about which variables are non-zero in the optimal solution. Represent this by a vector with $m$ elements, each either one or zero, which we will call $b$. Can we determine whether this set of subsets to use in the calculation is feasible, and if so, what the appropriate weights are? We can. Simply remove the columns of $A$ and the entries in $q$ corresponding to the $0$ entries in $b$, and solve the resulting problem in a least squares sense. If the solution is feasible, it will have a zero residual. Otherwise, it will have non-zero residuals. If we use $b$ as our input variable instead of $q$ we end up with a nicer objective function, with more restriction on available inputs (each entry of b can be either 1 or 0):

$$ \begin{align} \\ D(b) &= 1^Tb \end{align} $$

Now we just need to specify a set of constraints that make that objective function work. First, we specify the least squares problem we need to solve inside the constraint function: $$ \begin{align} \\ \bar{A}\bar{q} &= v(P^*) \end{align} $$

The least squares solution is : $$ \begin{align} \\ \hat{q} &= \argmin_{\bar{q}} (\bar{A}\bar{q} - v(P^*))^T (\bar{A}\bar{q} - v(P^*)) \\ \hat{q} &= \left(\bar{A}^T \bar{A} \right)^{-1} \bar{A}^T v(P^*) \\ \end{align} $$

And our constraint is $ (\bar{A}\hat{q} - v(P^*))^T(\bar{A}\hat{q} - v(P^*)) = 0 $. In this, it's important to note that $\bar{A}$ and $\hat{q}$ are functions of the input variable $b$.

So, under this formulation of the problem, we have a linear binary programming problem with a nonlinear equality constraint.


Both of these types of problems are NP complete. There are, however, algorithms that can help solve these types of problems, or at least get somewhat close to an optimal solution, in a reasonable amount of computation time. For the first problem formulation, you can look at genetic/evolutionary global optimizers, or direct/pattern search algorithms. These are somewhat common, and included, for instance, in MATLAB's optimization toolbox. The second section approach is a pretty exotic problem, and probably shouldn't be the focus of your search for an optimizer, although I think there are some academic papers that might help solve that formulation.

1
On

Let's reformulate the problem in terms of linear algebra. Let the universe $U$ be finite, and consider the vector space $V=\mathbb{R}^U$ of vectors $\mathbf{x}=(x_u)_{u\in U}$. We associate to every subset $S\subseteq U$ a vector $\mathbf{x}_S\in V$ by $$\mathbf{x}_u=\cases{1&if $u\in S$\\0&if $u\notin S$}$$ Then, given $F\subseteq\mathcal{P}(U)$, the problem of writing some subset $P\subseteq U$ as a sum (resp. sum and subtraction) of elements of $F$ is exactly the problem of finding a solution containing only $0$'s and $1$'s (resp. $0$'s, $1$'s and $-1$'s) of $$\mathbf{A}\mathbf{y}=\mathbf{x}_P$$ where $\mathbf{A}$ is the matrix with columns $\mathbf{x}_f$ for $f\in F$. Thus your problem reduces to finding some particular lattice points of a linear subspace of $\mathbb{R}^F$. Unfortunately I know very little on the subject, and I am unable to help you any further, but I am sure somebody else will. Good luck.

Addendum: Notice that once you have found all those lattice points you can find the ones corresponding to the minimal number of needed operations by minimizing $\|\mathbf{y}\|_1=\sum_{f\in F}|y_f|$.

2
On

Not an answer, but I'm not allowed to comment, and I have a question, so here goes:

In parts of your problem you make it sound like the point of your program is to calculate $f(P)$ for a user input $P \subseteq U$. You also say that $f$ is linear, and that you calculate $f$ for every single element set in startup. Suppose that the number of elements in the universe is $u$ and the number of elements in the requested subset is $r$. Then the lookup time for the function value of each element in the request is $\approx O(1)$ if you use a hash table, or $O(\log_2u)$ using a binary search tree. Thus the total time for requesting function values is $O(r\log_2u)$ or $O(r)$. The addition time would be $O(r)$. So overall you're looking at a runtime of $O(r)$ or $O(r\log_2u)$. Which seems fine. The only way you have a problem here that doesn't make the problem completely unsolvable is if it's not the function call that's expensive, but the addition operation. If that's the case, then we need more information about the cost properties of the addition operation to answer the question.

For instance, assuming $f(P1)$ is computed, is computing $nf(P1)$ more expensive than simply fetching $f(P1)$? Is that computation linear in $n$? Quadratic? How about the computation costs of $f(P1)-f(P2)$ vs $f(P1)+f(P3)$, where $P2$ and $P3$ have the same size. Bottom line, unless I completely misunderstand the question, we need more information to answer it.