Are you afraid of maths? I guess you say "NO". Well, maths is scaring me always, this time again :(
I was going through this Research Paper. In this paper , I don't understand how do they solve the following optimization problem P2?
And How do they generated the dataset for training the deep neural network from solution vector of P2 problem?

For the sake of exposition, we have the following linear program: \begin{align} & \min_{t_s} \sum_{s \in S_n} t_s \sum_{k\in\mathcal{K}_n^s} P_{nk} \\ & \;\text{s.t. } \sum_{s\in S_n}t_s \log_2(1+\text{SINR}^s_k) \geq \sum_{f\in\mathcal{F}_{nk}} L_f \,\;\forall\,\;\,k\in\mathcal{K}_n\\ & \;\;\;\;\;\;\, \sum_{s\in S_n} t_s \leq T \end{align} where $n$ is the cell, $T$ is the time limit, $\mathcal{K}_n$ is the set of users, $S_n$ is the set of all possible clusters for cell $n$ (i.e. $|S_n|=2^{|\mathcal{K}_n|}-1$), $s\in S_n$ is a cluster index, $\mathcal{K}_n^s\subseteq S_n$ is the user set for cluster $s$, $P_{nk}$ is the transmit power for small base station $n$ to user $k$, SINR$^s_n$ is the signal-to-interference-plus-noise ratio for cluster $s$ and user $k\in\mathcal{K}_n^s$, $L_f$ is the size of file $f$, $\mathcal{F}_{nk} $ are the files of user $k$, and $t_s$ is the time duration of using cluster $s$ in content delivering.
Ok, so basically we are summing over all possible clusters, in order to minimize the sum of transmit powers associated to each cluster, weighted by the time duration. We essentially have a total time limit, and assign time to spend on each cluster. But note that there are a LARGE number of clusters (exponential in the cell size). Thus the goal is to use a function approximator (here, a DNN) to help reduce the cost of solving the linear program.
How do they actually do this? Since we are solving for the vector of times $t_s$, one might expect to simply learn a mapping $g: \mathcal{P}\rightarrow \mathbb{R}_+^{|S_n|} $, i.e. given a set of parameters $p\in\mathcal{P}$ (i.e. values for $L_f$, $P_{nk}$, etc...), the network $g$ would give values for the cluster times $\mathcal{T}_S=(t_{s_1},\ldots,t_{s_{|S_n|}})$. However, they do not do this. Quoting from page 3:
In other words, the deep network is unable to directly generate solutions to the problem (at least in their case, for the networks they tried). The high output dimensionality in particular seemed to pose a problem. So instead, page 3 can be quoted as saying
So instead of trying to predict the solution, they attempt to predict which users to ignore when trying to compute the solution using an LP solver. In other words, the DNN outputs a binary vector $\mu$ of size $\mathcal{K}_n$, where $\mu_i$ decides whether user $i$ participates in the clustering. Quote (page 4):
So the deep network tells us which users can be ignored in the solution of the linear program. The parameter $\rho$ controls how many users will be ignored. This greatly reduces the cost of solving the linear program.
In this paper, I don't understand how they solve the optimization problem P2?
They test two methods:
Solving the linear program via a solver in Matlab. Quote (page 4):
You can read a little more about these algorithms here and here. You can also look at Matlab's linear program solver
linproghere.Using the DNN to compute $\mu$, which gives a list of users to remove, then solve the simplified LP as above.
How do they generate the dataset for training the deep neural network from the solution vector of the P2 problem?
Since they have an LP solver, they simply have to run it on instances of the problem to generate the training data. From page 4:
and page 3:
Recall that the deep network is supposed to predict which users will participate in the clustering i.e. in the solution P2. Since we have an optimal LP solver, given a bunch of problem instances, we can simply compute the optimal solution, and then check which users
In other words, given $m$ "problem sets", i.e. linear program parameter sets, $X=(x_1,\ldots,x_m)$, we can use the LP to generate $m$ binary vectors of which users participated in the clustering $Y=(y_1,\ldots,y_m)$. Note that each $x_i$ is a list of parameter values, while $y_j$ is the binary $\mu$ vector for the problem described by $x_j$. The training set is then simply $(X,Y)$, where the neural network is a map $g:X\rightarrow Y$.
Hopefully that helps.