Clique enumeration for substring overlap graph surely has polynomial (not exponential) running time.

Question

Clique enumeration for substring overlap graph surely has polynomial (not exponential) running time.

86 Views Asked by Bumbble Comm At 16 May 2026 - 1:47

https://en.wikipedia.org/wiki/Clique_problem

The article clearly says that all known algorithms run in exponential time. That is okay for us, since we're limited by formal language string properties.

In $s = aaa aaa$ there are two considerable substrings, $a^2$ and $a^3$. Let $A_i$ denote the $i$th occurence (starting from the left) of $a^2$ and similarly $B_i$ for $a^3$.

A1  A3
--- --- 
a a a a a a
  --- 
   A2

Then you can by eye check that $\{A_1, A_2, B_1, B_2\}$ is a maximal complete subgraph of the undirected graph $G$ where vertices are substring (occurences) and an edge is present if and only if its nodes overlap in the string $s$. Another clique would be $\{A_2, A_3, B_1, B_2, B_3\}$.

Regardless of alphabet size, what is the largest overlapping substring clique in this graph assuming a maximum considerable substring length of $m$ and a minimum length of $2$?

Isn't it obvious that we can deduce this all from the $|\Sigma| = 1$ case? Since surely adding more letters shouldn't increase this maximum. How can I prove that formally?

$m \leq |s|/2$ since you can't have a considerable (in particular, repeating) substring if it's length won't fit twice disjointly into $|s|$ spots. So I'm looking for a polynomial time algorithm in $|s|$ the input size.

Proof attempt. Since we're maximizing, clearly it suffices to consider the case in which the maximum shared length amongst a clique of occurences in $s = aaaaa...$ is $1$ since if you enforce a larger overlap then there is a smaller length outside of the overlap of which potential clique members can occupy. Lesser spots means there is a lesser possible number of distinct substrings.

Here's an example with the length 4 substring:

  -------  
      -------
a a a a a a a a a a ...
-------
    -------

Well, clearly, you'd center the previous example on the same common overlap (that's length $1$). So it's now obvious to conjecture from this data alone:

The maximum clique size in this situation is $2 + 3 + \dots + m = m(m+1)/2 - 1$ or $\dfrac{m^2 + m - 2}{2}$ by the "sum the integers $1$ to $N$ trick".

Do you have a cleaner proof? I think you could say something like the common overlap spot of length $1$ can be at any one of the spots in the length subtring $t$ which of course is $|t|$ in number.

The above was just finding the maximum clique size w.r.t. $|s|$. Now I'm still not sure if clique enumeration would indeed be polynomial time.

The way around this is to use the fact that the cliques along a string of purely $a$'s has a regular pattern, and thus if you're well within a region of straight $a$'s, then you can just write down the clique. Then when you add letters to the alphabet, the maximum clique size can be shown to taper off drastically.

Original Q&A

There are 2 best solutions below

Bumbble Comm On 03 Aug 2019 - 8:03

An alternative, string-based approach, no graph algorithm required.

You build all complete templates consisting of all spans $2..m$, and slide these templates along the string in $O(|s|)$ time, dropping template parts that happen upon non-considerable strings. There will be $O(m)$ common overlap lengths to try and one (complete) template per. So this should be a $O(|s|^2)$ algorithm times the template construction time.

**Bumbble Comm** · Accepted Answer

"Anyone know how I could restate this without using templates..."

Here it is. It's not very robustly written, but seems like it will do the job for my initial tests.

def substring_positions(s, mn=None, mx=None):
    # TODO: proof of running time
    if mn is None:
        mn = 2
    if mx is None:
        mx = int(len(s) / 2)

    res = {}
    for i in range(0, len(s) - mn + 1):
        for length in range(mn, min(len(s) - i, mx) + 1):
            t = s[i:i + length]
            if t not in res:
                res[t] = [i]
            else:
                res[t].append(i)
    return res


def maximal_substring_packing(len, pos):
    # Do a "leftmost" packing, and this does equal the maximal number you could
    # pack into s.  TODO: proof
    # TODO: proof of running time
    x = pos[0]
    count = 1
    for y in pos[1:]: 
        if x + len <= y:
            count += 1
            x = y
        else:  # there's a non-empty overlap
            continue
    return count


def considerable_substrings(s): 
    # TODO: proof of maximum number of considerables in big-O w.r.t. |s|
    positions = substring_positions(s)
    res = {}

    for t, pos in positions.items():
        if len(pos) <= 1:   # If there is only one position (occurence) of a substring_positions
            # then it can never be a considerable.
            continue
        p = maximal_substring_packing(len(t), pos)

        # From definition of "considerable" substring.
        if len(t) == 2 and p >= 3:
            res[t] = p
        elif len(t) >= 3 and p >= 2:
            res[t] = p

    return res, positions     # Keep the position info for now


## Not sure if we need this yet:
#def subproblem_considerable_substring(s):
    #"""
    #Since we're doing a divide-and-conquer approach using proper subproblems.
    #A considerable needs to be defined to be t <= s such that |t| >= 2 and the 
    #maximal number of disjoint occurences is >= 2.  This way, a subproblem while
    #not neccessarily compressible on its own, might be in the scope of the global
    #problem for a given input string s.
    #"""    
    ## TODO: proof of maximum number of considerables in big-O w.r.t. |s|
    #positions = substring_positions(s)
    #res = {}

    #for t, pos in positions.items():
        #if len(pos) <= 1:   # If there is only one position (occurence) of a substring_positions
            ## then it can never be a considerable.
            #continue
        #p = maximal_substring_packing(len(t), pos)

        ## From definition of "subproblem considerable" substring.
        #if len(t) == 2 and p >= 2:
            #res[t] = p

    #return res, positions     # Keep the position info for now    


# TODO: rewrite this to have less if-filters and be more robust.
# TODO: test / verify the algorithm.
def conflict_lists(s, cons):
    """
    E.g.
    isect = 1, mn = 2, mx = 3:
      [][]
        [][]
    [][][]
        [][][]
    isect = 2: mn = 2, mx = 4:
        [][]
      [][][]
        [][][]
    [][][][]
      [][][][]
        [][][][]
    Because each positioned substring is unique, there will not end up being any duplicate conflict
    equations across different intersect (isect) lengths.  Thus, there is no need to do another
    filtering stage.

    A conflict in the list returned is encoded as a dictionary keyed by the substring with value
    equal to a list of all of its offsets.  This is in particular efficient for strings over
    the singleton alphabet.
    """    
    if len(cons) == 0:
        return []
    lengths = [len(c) for c in cons.keys()]
    mn = min(lengths)
    mx = max(lengths)
    conflicts = []

    # isect = length of intersection common intersection region among all involved considerables
    # h = overhang = how far to the left the left-most substring is, for this isect, from the base offset k
    #      which is always the left-most spot of the isect region.
    # The right-most substring for a given length is then offset at base offset k

    for k in range(0, len(s) - mn + 1):  # Skip invalid offsets
        for isect in range(1, mx): # Yes, it should be mx - 1 maximally, since a single max length
            # substring on its own is not in conflict with any other.
            conflict = {}
            count = 0
            for length in range(mn, mx + 1):   # Inclusive of max
                h = isect - length
                for offset in range(max(0, k + h), k + 1):  # Skip invalid offsets using max
                    t = s[offset:offset + length]
                    if t in cons:
                        if t not in conflict:
                            conflict[t] = [offset]
                        else:
                            if offset not in conflict[t]:
                                conflict[t].append(offset)
                        count += 1
            if count >= 2:    # You need at least two positioned substrings, for a conflict to occur
                conflicts.append(conflict)

    # A uniqueness filter is still needed, as [][]
    #                                         [][][]
    # is conflicted with respect to two s locations (0, 1).  So without the below filter,
    # there would be duplicates in conflicts. 
    filtered = []
    for conflict in conflicts:
        if conflict not in filtered:   # Surprisingly this works in python; the dicts are compared by value.
            filtered.append(conflict)
    return filtered


if __name__ == '__main__':
    while True:
        s = input('s = ')
        #pos = substring_positions(s)
        cons, pos = considerable_substrings(s)
        print("considerables = ", cons)
        C = conflict_lists(s, cons)
        print("conflict sets = ", C)
        print("|s| = ", len(s))
        print("|C| = ", len(C))

conflict_lists(s, cons) is the main function of concern in this post, but I've included the whole sgp.py so that you can test the output if you're interested in this topic. The running time looks to be conservatively $O(|s|^4)$ which is polynomial in the input size, it being 4 nested for-loops with loop variables bounded by $|s|$ conservatively, and the number of conflicts seems to be $O(|s|^2)$ but a lot less in practice, perhaps it's $O(|s|\log |s|)$ or something...

Clique enumeration for substring overlap graph surely has polynomial (not exponential) running time.

There are 2 best solutions below

Related Questions in GRAPH-THEORY

Related Questions in ALGORITHMS

Related Questions in COMPUTER-SCIENCE

Related Questions in COMPUTATIONAL-COMPLEXITY

Related Questions in FORMAL-LANGUAGES

Trending Questions

Popular # Hahtags

Popular Questions