I have a data file with x- and y-coordinates and have written a code to implement the Blahut-Arimoto algorithm to evaluate the clustering membership probability. Now I want to run this code for 2 clusters ($Nc=2$) and construct its information curve, however, I do not know how to construct an information curve and what metrics to use (a more detailed problem description is given at the very end). All help is appreciated!
Description: I have a data file that contains 200 data points to be clustered. In the file, the first and second columns are the x- and y-coordinates of the data points, respectively. Moreover, the first and second 100 points are independently sampled from the normal distributions,
$$N\left(\mu=(0 \ 1), \Lambda = \begin{pmatrix} 0.09 & 0 \\ 0 & 0.09\end{pmatrix}\right) \text{ and } N\left(\mu=(1 \ 0), \Lambda = \begin{pmatrix} 0.09 & 0 \\ 0 & 0.09\end{pmatrix}\right),$$
respectively. And I want to write a code to implement the Blahut-Arimoto algorithm to evaluate the clustering membership probability, $p(\tilde{x}|x)$ with fixed number of clusters, Nc, and compression-distortion tradeoff parameter, $\beta$. The code should implement a multiple run each starting with random initial conditions. The element-to-element distance matrix is assumed to have Euclidean distance and each data point carries the same weight, i.e. $p(\mathbf{x}_i)=1/200$.
Blahut-Arimoto algorithm: According to Elements of Information Theory by Thomas M. Cover and Joy A. Thomas, the Blahut-Arimoto algorithm works as follows: We begin with a choice of $\beta$ and an initial output distribution $r(\hat{x})$, and calculate the conditional distribution $q(\hat{x}|x)$ that minimizes the mutual information subject to the distortion constraint via
$$q(\hat{x}|x)=\frac{r(\hat{x})e^{-\beta d(x,\hat{x})}}{\sum_x r(\hat{x})e^{-\beta d(x,\hat{x})}}.$$
For this conditional distribution, $q(\hat{x}|x)$, we calculate the output distribution, $r(\hat{x})$, that minimizes the mutual information via
$$r(\hat{x})=\sum_x p(x)q(\hat{x}|x).$$
We use this output distribution as the starting point of the next iteration. Each step in the iteration, minimizing over $q(\hat{x}|x)$ and then minimizing over $r(\hat{x})$, reduces the right-hand side of $R(D)$ (the rate distortion function).
Code: Below you can see the code, which I wrote in python.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist, squareform
# Read the data from the provided text file
data = pd.read_table("Data-Exercise2-2024.txt", header=None)
data = data[0].str.split(expand=True).astype(float)
# Plot the data points on the x-y plane
plt.scatter(data[0], data[1])
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Data Points")
plt.show()
# Calculate the element-to-element distance matrix using Euclidean distance
distance_matrix = squareform(pdist(data, metric='euclidean'))
# Calculate the distortion matrix using squared distortion
distortion_matrix = distance_matrix ** 2
# Weights for all data points (assuming equal weight)
p = np.full(len(distance_matrix), 1/len(distance_matrix))
# Function to calculate the conditional probabilities q(x_tilde|x)
def calculate_q(distortion_matrix, r, beta):
numerator = np.exp(-beta * distortion_matrix) @ r
return numerator / np.sum(numerator, axis=0)
# Function to calculate the output distribution r(x_tilde)
def calculate_r(p, q):
return p[:, None] * q
# Blahut-Arimoto algorithm with squared distortion
def blahut_arimoto_impl(Nc, beta, max_iter, tolerance, distance_matrix, p):
squared_distortions = []
for _ in range(10): # 10 random initializations
r = np.random.rand(len(data), Nc) # Random initialization
for _ in range(max_iter):
r_old = r.copy()
q = calculate_q(distance_matrix, r, beta)
r = calculate_r(p, q)
# Check for convergence
if np.sum(np.abs(r - r_old)) < tolerance:
break
# Parameters
Nc = 2 # Different numbers of clusters
beta = 5
max_iter = 100
tolerance = 1e-6
blahut_arimoto_impl(Nc, beta, max_iter, tolerance, distance_matrix, p)
Problem: I now want to run this code for $Nc=2$ (2 clusters) and construct its information curve for different values of $\beta$ in between 1 to 40. The problem I am having is that I do not know how to construct the information curve and what metric I should use on the y-axis of this plot.
I read somewhere that one can use either the distortion or the mutual information as a measure, but I cannot get my head around how to do this. Maybe the answer is just as easy as that I should update the distortion matrix for each iteration (now I only use the distance matrix raised to 2 since I initially do not have any $\hat{x}$), but then I need to know either the reconstructed data or if there is another approach that I am not aware of.
Data: This is some of the data in my txt file if someone is interested to try the code:
0.35018 1.01496
-0.491826 1.06325
-0.0424478 0.810602
-0.0485904 0.429519
-0.313149 1.27697
-0.0745224 1.25301
-0.24639 1.13452
0.48474 0.645292
-0.192411 0.722434
-0.482834 1.09381
0.138134 1.11462
-0.249814 1.47742
0.348229 1.01375
0.264055 1.24531
-0.0231769 1.18124
0.166838 0.89533
-0.239649 1.17173
-0.273667 0.532211
0.338615 0.835743
-0.411282 1.02389
-0.398042 1.16161
0.37481 1.51544
-0.00191126 1.23194
-0.273601 1.02413
0.0866693 1.49634
-0.319453 1.35711
-0.432486 1.38478
-0.139173 0.61804
0.222087 1.14062