I'm not entirely sure whether this question is fit for Mathematics or Stats, but here goes. I have tried to be as clear as possible, leading to quite a long post. Not everyone will read this post til the end, so I plan to add a bounty for those who do take the time to get through it all. If you need any more details, do let me know. If this question is not suited for Mathematics, please tell me in the comments and I'll delete it - though it seems that a problem concerning averaging and normalisation is ultimately well-fit for Math.
I am working on a language problem where you have two sentences (let's say a source sentence and a target sentence). These sentences are aligned, meaning that each word in each sentence is "connected" to zero, one or more words in the other sentence. (This is the task of word alignment, typically done for sentences in different languages that are translations of each other.) Let's assume for simplicity's sake that a word must be connected to at least one word in the other sentence (unlike the example given below).
The example above has the following alignment index pairs where the first item in a tuple (Python notation) is a source index and the second a target index. In my case, though, null alignment (here (1, )) is not possible, i.e. all words are aligned. Furthermore, my indices would start at 0.
[(1, ), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (6, 6), (6, 7)]
To measure the amount of syntactic shifts/reordering one has to do to translate the source sentence to the target sentence, I use a so-called cross value which actually counts the number of times that these alignments cross one another. Here is a visualization, where the circles are the crosses that we try to count.
Cross is calculated as follows (paper; not written in a mathematical style - I'm a linguist and then a programmer, not by far a mathematician):
- sort the alignment pairs (
align_pairs), first by source index, then by target - only extract the second item of each tuple and store those (
tgt_idxs) - count all the occurrences where: for each combinations of target indices
t1, t2,t2is smaller thant1
In Python (online test):
from itertools import combinations
align_pairs = [(0,0), (1, 0), (1, 1), (2, 0)]
align_pairs.sort()
tgt_idxs = [pair[1] for pair in align_pairs]
cross = sum([1 for t1, t2 in combinations(tgt_idxs, 2) if t2 < t1])
print(cross)
# 1
The above is rather intuitive to understand, but a problem arises when trying to normalize this value into a value that lies between 0 and 1. To do so, we would need to get the maximally possible number of crosses for a given number of source indices and target indices. To do that, we can just create a list of all possible alignments pairs, i.e. a sequence where every source index is connected with every target index. The Python code is perhaps a bit too long to post here, but you can find it online. It comes down to this:
# extract unique, sorted source and target indices
src_idxs = sorted(set(src_idxs))
tgt_idxs = sorted(set(tgt_idxs))
# make the product of both lists, leading
# to every source index being combined with every target
all_combs = list(product(src_idxs, tgt_idxs))
# ... calculate cross in the same way as above
This maximal value can then be used to normalize a cross value to a number between 0 and 1.
The problem, though, is that the maximal cross is not at all linear, i.e. for larger source and target sentences, the maximal cross will grow exponentially. As a consequence, one single cross between two short sentences will weigh a lot heavier than a single cross in two large sentences.
As an example, both of the following examples only have one cross but because the max_value for the second is so high, that one cross is almost nullified:
Given alignments [(0, 1), (1, 0)]:
- cross 1
- max_cross 1
- avg_cross 1.0
Given alignments [(0,0), (1, 1), (2, 3), (3, 2), (4, 4), (5, 5)]
- cross 1
- max_cross 225
- avg_cross 0.0044444444444444444
From my perspective, it seems that a mathematically sound operation should be possible to counter the influence of the non-linear max_cross, but I can't get my head around it. Perhaps that intuition is even false, and there is no valid way to normalise this max value to a more reasonable value where the influence of the length is less important for the final value.
I am aware that other averaging options are possible, e.g. by number of alignments or average sentence length, but for other reasons I wish to get a value between 0 and 1.


This will not map each cross count to a number in $[0,1]$, but might make numbers more comparable.
Suppose the source sentence has $n$ words and the target sentence has $m$ words. For each word in the source and each word in the target, there may or may not be an edge connecting them. This would mean there are $2^{mn}$ possible edge configurations.
For each such configuration, count the number of crosses and obtain a discrete distribution $X_{n,m}$ of the number of crosses, meaning
$$\Bbb P(X_{n,m} = k) = \frac{\text{# of configurations with cross count $k$}}{2^{mn}}.$$
We could then normalize this distribution and consider $Z = (X-\mu)/\sigma$, where $\mu = \Bbb E(X)$ is the mean and $\sigma$ is the standard deviation of $X$. The number of crosses is $X$, the number you came up with is $\frac{X}{\max X}$. I would try using $Z$ and see how it goes.
The caveat is that brute forcing the calculation of $X_{n,m}$ may not be computationally feasible, as the number of operations increases very fast with $m,n$. There may be some smart combinatorics that can help here. I might try and tackle this later.