Minimal number of objects, each containing $n$ features, s.t. in the set of all objects each feature is represented at least $k$ times.

39 Views Asked by At

Can anyone please help frame this problem theoretically and/or guide me to its solution?
I think it might be partly related to 'set cover' type of problems, but the requirements are a bit different.

Suppose you have a total of $M$ features ($M \ge 3$), and you have the set of all objects possessing (each) exactly $n$ of these features ($2 \le n \le M-1$).
E.g. the features could be the first $M=5$ uppercase letters of the Latin alphabet: $\{A,B,C,D,E\}$.
For $n=2$, you have a set of objects $\{(A,B); (A,C); (A,D); (A,E); (B,C); ...; (D,E)\}$.
The number of such objects is of course $M \choose n$; in this example, ${5 \choose 2} =10$.

I need to calculate the minimal number of objects from this set that I am required to select, such that in their union each of the $M$ features is represented at least $k$ times ($1 \le k \le M-1)$.

E.g. in the above example ($M=5,n=2$), if I want each feature to be represented at least twice ($k=2$), I need at least $5$ objects, which could be $\{(A,B);(C,D);(A,E);(C,E);(B,D)\}$.
Note that I don't particularly need to know the objects themselves (as there are probably several degenerate solutions): just how many they are.

Supposing that we want to have each feature repeated $k=3$ times, the required number of objects is of course larger ($8$ for $M=5,n=2,k=3$).
Conversely, if we increase the number of features each object can have, the required number of objects decreases (e.g. $4$ for $M=5,n=3,k=2$)

I have tried framing this in a matrix format, to figure out a formula, meaning a function of $M,n,k$ that gives me the desired number, but I can't see it.

Any suggestions?
Thanks

PS
Find below an R script that calculates the number using linear programming

# calculate the minimal number of objects, each containing n features out of a total of M features
# such that in the set of all objects each feature is represented at least k times

M = 5
n = 2
k = 2

# make the sparse matrix of all objects containing n features out of M
# objects are in columns, and are as many as choose(M,n)
# features are in rows, and are M

N.objects <- choose(M,n)
obj.mat <- sparseMatrix(j = rep(1:N.objects,each = n), i = unlist(combn(1:M, m = n, simplify = F)), x = 1)

# the constraint is that each row of the matrix sums to at least k

dir <- rep(">=",M)
rhs <- rep(k,M)

# the objective is all 1's (we want to count and minimise the number of objects used)

obj <- rep(1,N.objects)

# solve the task by linear programming

if (length(find.package(package="Rsymphony",quiet=TRUE))==0) install.packages("Rsymphony")
require(Rsymphony)
lp.out <- Rsymphony_solve_LP(obj,obj.mat,dir,rhs,types="B",max=FALSE)

cat("\nN objects needed =",lp.out$objval)
cat("\n\n")
print(obj.mat[,as.logical(lp.out$solution)])
1

There are 1 best solutions below

0
On

I am going to attempt a justification of the result we worked out with Paul's help.

Suppose the solution is a binary matrix with the $N$ objects in columns and the $M$ features in rows.

E.g. for $M=7, n=4, k=3$, where $6$ objects are necessary:

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    1    1    0    0
[2,]    1    1    0    0    1    0
[3,]    1    0    1    0    1    0
[4,]    0    1    0    1    0    1
[5,]    1    0    1    1    0    1
[6,]    0    0    0    1    1    1
[7,]    0    1    1    0    1    1

It is required to represent each of the $M$ features at least $k$ times, thus each row must contain at least $k$ 1's.
Each object has exactly $n$ distinct features, thus each column contains exactly $n$ 1's.

Therefore the total number of 1's ($N_1$) in the matrix, counting by rows, is $N_1 \ge M \cdot k$; counting by columns, it is $N_1=N \cdot n$.

Substituting $N_1$ from the equation into the inequality, we have:

$N \cdot n \ge M \cdot k$

i.e.:

$N \ge \frac {M \cdot k} {n}$

As all variables are positive integers, and we want the minimal $N$:

$N = \lceil \frac {M \cdot k} {n} \rceil$

I hope this makes sense and I haven't overlooked anything.