Combinatorial optimization - Bijections between duplicated numbers

513 Views Asked by At

English is not my native language: sorry for my mistakes. Thank you in advance for your answers.

Two Bijections and an Array...


The duplication problem

Here is a 2D array (in this particular example: rows: 1 to 4; columns: A to D) nonrandomly and nonarbitrary filled with integers.

$$ \left[ \begin{array}{cc|cc} 7 & 1 & 1 & 7 \\ 4 & 4 & 1 & 1 \\ 5 & 1 & 1 & 5 \\ 3 & 3 & 5 & 5 \end{array} \right]. $$

These numbers can be handled, but only one kind of move is allowed: You can move the numbers on the left or the right in each row, but you cannot shuffle them and you cannot move them vertically.

Example (first row):

  • Right: 7 1 1 7 → 7 7 1 1 (shift)

  • Wrong: 7 1 1 7 → 7 1 7 1 (shuffle)

By the nature of the data in this model (here is the nonrandom and nonarbitrary part), it is possible to have the same number of identical integers in columns A and B, and in column C and D, respectively. In other words, two bijections (or perfect one-to-one correspondences).

That is: if you have 3 '1' in column A, then you need to have 3 '1' in column B, and so on, in order to have a solution.

An example is worth a thousand words. Applying allowed moves on our array, we can produce this result:

$$ \left[ \begin{array}{cc|cc} 1 & 1 & 7 & 7 \\ 1 & 1 & 4 & 4 \\ 1 & 1 & 5 & 5 \\ 3 & 3 & 5 & 5 \end{array} \right]. $$

It is easy to see that columns A and B share the same number of identical integers (1, 3), and columns C and D, too (7, 4, 5).

Then, a solution is valid when there is a perfect one-to-one correspondence between the numbers of A and B (bijection 1), regardless their order, on the one side; as well as a perfect one-to-one correspondence between the numbers of C and D (bijection 2), regardless their order, on the other side.

The values can be sorted, but do not need to. The following example is then a valid one:

$$ \left[ \begin{array}{cc|cc} 6 & 23 & 5 & 0 \\ 4 & 3 & 7 & 5 \\ 23 & 4 & 2 & 7 \\ 3 & 6 & 0 & 2 \end{array} \right]; $$

but the next one is invalid because there is no bijection:

$$ \left[ \begin{array}{cc|cc} 6 & 23 & 5 & 0 \\ 4 & 3 & 7 & 5 \\ 4 & 2 & 7 & 23 \\ 3 & 6 & 0 & 2 \end{array} \right]. $$

My question is: Do you know a way to solve this problem without a brute-force approach (totally inefficient for large arrays)?

A Connection With Graph Theory?


Thanks to a link provided by EvilTeach, I am trying to make connections between this problem and graph theory. I am not an expert in this field, but I think I may have “discovered” a little something.

In order to transform my array into a graph, I take each different value and make directional links between them.

For example: if the first row is 6 23 5 0, I create a link (inside columns A and B) between 6 and 23 with an arrow to show the direction from 6 to 23. Then, I draw another link between 5 and 0 (inside columns C and D, then), with an arrow.

At the end, I count the number of times each value is designated by an arrow (“IN”) and also the number of times this same value designates another value (“OUT”).

The graph is then:

Valid array

With a valid array, we can see that the number of INs equals the number of OUTs for each value.

What happens with an invalid array?

Something like that:

Invalid array

Here, for some values, the number of IN is not equal to the number of OUT.

So let's see if there are any links between these values (here: 2, 4, 7, and 23).

  • Between 2 and 4: the link created thanks to row 3
  • Between 2 and 7: none
  • Between 2 and 23: none
  • Between 4 and 7: none
  • Between 4 and 23: none
  • Between 7 and 23: only one: row 3

Then, in this example, the graph seems to indicate that there is a problem with row 3. And this is indeed the case.

So... This approach can be seen as very naive. Indeed, some values will have self-recursive links, some will have more than 1 IN and 1 OUT, and it becomes more complex with a lot of values and links. Moreover, I do not prove anything here. Maybe it is just a lot of chance. But perhaps it is a clue of a deep connection between this problem and graph theory.

As I said earlier in this message, I am not familiar with graph theory, so you are welcome to give your remarks!

A More Complex Example


I have decided to see what happens with a more complex array:

$$ \left[ \begin{array}{cc|cc} 6 & 3 & 4 & 1 \\ 8 & 8 & 3 & 2 \\ 5 & 2 & 1 & 5 \\ 8 & 5 & 7 & 12 \\ 7 & 11 & 2 & 7 \\ 3 & 12 & 5 & 1 \\ 8 & 2 & 13 & 4 \\ 2 & 6 & 4 & 4 \\ 3 & 8 & 7 & 3 \\ 20 & 3 & 2 & 13 \\ 6 & 6 & 4 & 4 \\ 2 & 6 & 1 & 2 \\ 11 & 7 & 12 & 7 \\ 4 & 8 & 3 & 8 \\ 6 & 20 & 7 & 7 \\ 12 & 4 & 8 & 3 \end{array} \right]. $$

This array is valid: the bijections in columns A and B, and C and D, respectively, are perfect.

I have tried to create an intelligible graph which confirms the validity of the bijections (blue lines: links between A and B; orange lines: links between C and D):

Graph 2: valid array

Now, let's experiment two shifts negating the bijections, and affecting rows 7 and 12:

$$ \left[ \begin{array}{cc|cc} 6 & 3 & 4 & 1 \\ 8 & 8 & 3 & 2 \\ 5 & 2 & 1 & 5 \\ 8 & 5 & 7 & 12 \\ 7 & 11 & 2 & 7 \\ 3 & 12 & 5 & 1 \\ 13 & 4 & 2 & 8 \\ 2 & 6 & 4 & 4 \\ 3 & 8 & 7 & 3 \\ 20 & 3 & 2 & 13 \\ 6 & 6 & 4 & 4 \\ 6 & 1 & 2 & 2 \\ 11 & 7 & 12 & 7 \\ 4 & 8 & 3 & 8 \\ 6 & 20 & 7 & 7 \\ 12 & 4 & 8 & 3 \end{array} \right], $$

and its corresponding graph (purple for the new links between A and B; green for C and D):

Invalid array 2

At first glance, the errors revealed by the imbalance between the INs and the OUTs are not very efficient: the inequality concerns the integers 1, 2 and 6 linked between them such as:

  • Link between 1 and 2: none,
  • Link between 1 and 6: row 12 [great!],
  • Link between 2 and 6: row 8 [so close...].

The present “method” is half full. One the one hand, it seems to allow to detect which rows are likely to be displaced, “out of sync” with the ideal of bijection (and then, how correct them); on the other hand, it really needs to be improved in order to, perhaps, implement an algorithm allowing to solve this problem. That is why I need your help.

4

There are 4 best solutions below

0
On BEST ANSWER

A quick update.

I have solved this problem using a genetic algorithm I wrote in C++. It is able to find a solution in 1 or 2 hour(s). It is not even optimized (I am new to C++; my program is not even parallelized).

Initialization: DNA: encodage of the shift (for each line: 0 to 3, 0 for no shift, 3 for max shift to the right).

E.g. :

0: { 1 2 3 4 }
1: { 4 1 2 3 }
2: { 3 4 1 2 }
3: { 2 3 4 1 }

Evaluation: std::set_intersection for the number of bijections; fitness = 1/(# bijections max-#bijections in the DNA+1); if fitness = 1, break (we have a solution).

Selection: a simple 50|50 one-point crossover.

Mutation: 1%

GA gives amazing results for this NP-complete problem.

If you want more information, feel free to contact me.

7
On

In the context of the 4X4 example. Call this array A. I would create an array of 4 X 31, and initialize them to zero. Call this array C. The value 4 is the number of rows within the array.

For each row, i would iterate though its columns and increment the element in C which is subscripted by the value in A. This would yield C row 1 as [0][2][0][0][0][0][2][0]... There are two 1s and two 7s.

C looks like this

  • [0][2][0][0][0][0][0][2]...
  • [0][2][0][0][2][0][0][0]...
  • [0][2][0][0][0][2][0][0]...
  • [0][0][0][2][0][2][0][0]...

Scan the columns looking left to right. You need a column value for each row that is at least 2.

Some possible solutions include. 1,1,1,3 1,4,1,3 1,4,5,5

Now you probably have a smaller list of possible solutions.

do some O(N) type looping shifting stuff into place until you verify that the solution is correct or impossible.

4
On

Assuming that one is using the brute force algorithm, there may be a way to make the "test if this is valid solution" run faster.

For the discussion assume we have a 500 row by 4 array which holds the integer values to test. Those integers range from 0 .. 30. We have two groups of related columns (A, B) and (C, D).

Ignoring (C, D) we can compute two values.

  1. Sum(A)
  2. Sum(B)

If those two sums differ, then we do not have a valid solution.

We can compute two other values (adding 1, so that zero is not part of the product)

  1. Product(A + 1)
  2. Product(B + 1)

If those two products differ, then we do not have a valid solution.

If we have a valid solution, print it and stop otherwise, do the next brute force shift, and retest. Continue that process until a solution is found, or we run out of shifts.

Now to speed it up, consider this:

  • At any time, we know the two sums, and the two products.
  • When we go to do a shift, we can adjust the sums and the products to remove the values for that row.
  • After the shift we can adjust the sums and the products to add the "new" values for that row.

Other factors

  • N is still very large
  • the product can get large 31^500 probably won't fit in an int32. Hell maybe not an int64
11
On

Have you considered an integer programming approach? This is not my area of expertise, so I don't know whether an array of $500$ rows will be feasible, but here's the idea.

Let there be $R$ distinct rows in the array. For $1\le r\le R,$ let $n_r$ be the number of times row $r$ occurs in the array. Since rows get cyclically permuted anyway, we may first cyclically permute each row so that it is in lexicographically minimal. This may reduce $R,$ which will result in a system with fewer variables. On the other hand, I imagine that $n_r$ will equal $1$ most of the time, so this step is not essential.

For each $1\le r\le R,$ introduce four variables, $a_{r0},$ $a_{r1},$ $a_{r2},$ $a_{r3},$ satisfying $$a_{r0}+a_{r1}+a_{r2}+a_{r3}=n_r,\qquad a_{r0}\ge0,\quad a_{r1}\ge0,\quad a_{r2}\ge0,\quad a_{r3}\ge0.\quad$$ Each of the variables $a_{ri}$ is a nonnegative integer indicating how many times the $i$-rotated version of row $r$ occurs in the final array. If $n_r=1,$ then three of the $a_{ri}$ will be $0$ and one of them will equal $1$ in any solution.

For each distinct integer, $\ell,$ that occurs in the array, we get two equations, one of which says that the number of appearances of $\ell$ in column $0$ equals the number of appearances of $\ell$ in column $1,$ the other of which says that the number of appearances of $\ell$ in column $2$ equals the number of appearances of $\ell$ in column $3.$

We then apply an integer programming method to the solution of the resulting system.

Here's an example. Start with the array $$ \begin{bmatrix} 0 & 6 & 23 & 5\\ 3 & 7 & 5 & 4\\ 2 & 7 & 23 & 4\\ 0 & 2 & 3 & 6 \end{bmatrix}, $$ which has lexicographically minimal rows, obtained from one of your examples. Let $a_{10}$ be the number of times $(0,6,23,5)$ appears in the final array, $a_{11}$ the number of times $(5,0,6,23)$ appears in the final array, $a_{12}$ the number of times $(23,5,0,6)$ appears in the final array, $a_{13}$ the number of times $(6,23,5,0)$ appears in the final array, $a_{20}$ the number of times $(3,7,5,4)$ appears in the final array, $a_{21}$ the number of times $(4,3,7,5)$ appears in the final array, and so on.

The number of times $0$ appears in column $0$ of the final array is $a_{10}+a_{40};$ the number of times it appears in column $1$ of the final array is $a_{11}+a_{41}.$ This yields the equation $$ a_{10}+a_{40}=a_{11}+a_{41}. $$ Similarly, the number of times $0$ appears in column $2$ of the final array is $a_{12}+a_{42};$ the number of times it appears in column $3$ of the final array is $a_{13}+a_{43}.$ This yields the equation $$ a_{12}+a_{42}=a_{13}+a_{43}. $$

For the element $2,$ we get the equations $$ \begin{aligned} a_{30}+a_{43}&=a_{31}+a_{40}\\ a_{32}+a_{41}&=a_{33}+a_{42}. \end{aligned} $$ We get similar pairs of equations for the elements $3,$ $4,$ $5,$ $6,$ $7,$ and $23.$

Our final system consists of all of the above equations, combined with $$ \begin{aligned} a_{10}+a_{11}+a_{12}+a_{13}&=1\\ a_{20}+a_{21}+a_{22}+a_{23}&=1\\ a_{30}+a_{31}+a_{32}+a_{33}&=1\\ a_{40}+a_{41}+a_{42}+a_{43}&=1, \end{aligned} $$ and the inequalities $a_{ri}\ge0$ for $1\le r\le4,$ $0\le i\le3.$

The idea is to feed this system into an integer programming routine, such as branch-and-cut. I don't have experience with the standard packages for doing such things, but I did try solving some systems with Mathematica, using the command "Reduce[system,Integers]". Playing around a bit, I managed to get Mathematica to solve systems up to $35$ rows, and I'm sure it could go higher. I have no idea what's going on internally, and whether it's state-of-the-art or not. It definitely looks likes it's scaling far better than brute force does. It might be worth trying more specialized software.

I've added some tags to your post to see whether we can get some experts to weigh in.

Added: The array above has the property that every distinct array element occurs exactly twice and in exactly two different rows, which is atypical. Here's a better example, which shows that the constraints can look more complicated than they do in the previous example. Consider the array $$ \begin{bmatrix} 1 & 1 & 6 & 5\\ 1 & 3 & 5 & 6\\ 1 & 6 & 6 & 2\\ 1 & 1 & 2 & 3 \end{bmatrix}. $$ The constraint that the element $1$ occur the same number of times in columns $0$ and $1$ is $$ a_{10}+a_{13}+a_{20}+a_{30}+a_{40}+a_{43}=a_{11}+a_{10}+a_{21}+a_{31}+a_{41}+a_{40}. $$ The constraint that the element $1$ occur the same number of times in columns $2$ and $3$ is $$ a_{12}+a_{11}+a_{22}+a_{32}+a_{42}+a_{41}=a_{13}+a_{12}+a_{23}+a_{33}+a_{43}+a_{42}. $$ The constraint that the element $6$ occur the same number of times in columns $0$ and $1$ is $$ a_{12}+a_{21}+a_{33}+a_{32}=a_{13}+a_{22}+a_{30}+a_{33}. $$ The constraint that the element $6$ occur the same number of times in columns $2$ and $3$ is $$ a_{10}+a_{23}+a_{31}+a_{30}=a_{11}+a_{20}+a_{32}+a_{31}. $$

Also, about $40$ rows appears to be practical limit of what the Mathematica Reduce command can handle, at least on my laptop.