Let R = A (A | B) (A | B | C) (A | B | C | D) ... be an R-Form(A sort of repetition matching grammar).
Where A,B,C,D,... are symbols that allow equality comparison.
Any sequence of A,B,C,D,... can be expressed by R.
e.g., AABBBCCCAAA can be expressed by the R-Form.
The question is, how can one compute the simplest way to express such sequences using substitutions(a = ABC)?
e.g., the above example can be written abcaA where a = AA, b = BB, c = CC. It is obviously "simpler".
Now, this isn't a simple token reducing scheme because one must do this "recursively" and there are multiple possibilities. e.g., ABABCABCAB which can be abB, aBc, aCq, etc where a = ABA, b=BCA, c=CAB, q=ABCAB, etc.
Of course, we could just use the sequence be z and be done with it.
Ultimately we want a sequence that maximizes information and minimizes memorization. One substitution symbol z (= ABABCABCAB) is just as bad as the original sequence ABABCABCAB.
But there is a sort of optimal balance between the number of substitution symbols and the length of the sequence. The more repetition involved the more efficient the optimal substitution sequence is at representing the sequence.
Basically this deals with memorization, compression, structures involving repetition, etc. (It acts like it's some type of recursive RLE structure)
In fact, by allow not really substituting and allowing for more general operations such as power (A^n = A...A n times) and parenthesis for grouping(sort of substitutions) we can make expressions such as (ABA)(BCA)^2(C) which allow some measure of efficiency(since we can directly compare it to the original sequence to see if it is easier to "remember" or "compress").
We can add more operations such as reverse ABABCABCAB = (AB)^2-(BAC), etc.
The goal is ultimately to find the most efficient way to represent sequences in a given context. (Set of operations that we can easily carry out to simplify the sequence)
What we know is that the set of all possible sequences, which can be represented by the R-Form... from AAAAAA.... to ABCDEFG...., represent the range from the most compressible sequence to the least. All other sequences fit somewhere in between.
(AAAAA... represents sequences that completely repeat. ABCDEFG represents sequences that have no repetition(e.g., random with no repetition))
The goal is to find an algorithm to take any set of data that can be partitioned into symbols A, B, C, D, ... and find the best way(information wise) to represent them with the given set of operations(at least groupings() and powers^).
(The reason why the question is written well is that I just got up and I'd like general answers because part of this applies to general memorization(I need to memorize a sequence of stuff(numbers or dates, etc...) and to compression in general). Hence I'd like to understand it from a more theoretical basis than some specific application)
Thanks.