There are several papers I read in which authors are dealing with an optimization problem and in most of the cases, they add an extra term to improve their models. How can this be possibly done and everything still works perfectly ? or is it specific to the problem in hand ? or are there general rules that need to be followed in order to apply that ?
Here is an example: $$\min_{D,C}\|Y-DC\|_F^2 $$ this model denotes for sparse coding, where D represents a dictionary to learn, C the sparse codes and Y the input vector, this is an unsupervised process, the term that was added is the following: $$\gamma J(C,L)$$ where J presents a penalty fucntion and L a matrix of labels. The equation becomes: $$\min_{D,C}\|Y-DC\|_F^2 + \gamma J(C,L)$$ in order to make the problem a supervised one.
Along the lines of the response from Johan, and a bit more related to your example of dictionary learning, you can easily see that the objective that you have without regularization is insensitive to scaling C and D inversely. This could cause a lot of problems for the optimization algorithm that tries to solve such a problem because one of the variables could explode and the other one could just go down to zero in terms of the size. So one natural way to prevent such scenarios is to regularize with the size of one of the variables which prevent such explosion and keeps the iterates of the optimization method bounded.
In general, the regularization terms try to encode the additional information that we have about the solution, which might not be captured by the metric (original objective). In the context of the above example, we might know that our final solution should be balanced in terms of the norm of C and D. Therefore, we try to limit the norm of those matrices.
In other cases you might know that your solution is going to be sparse. You can encourage your method to find sparser solutions by adding a "norm-0" regularizer that simply penalizes the objective for each non-zero entry. But such regularizer makes the problem intractable in general. Therefore, people have replaced such regularizer with a more well-behaved norm-1 regularizer for which there are some nice theoretical results in statistics literature regarding the recovery of the exact support of the true model (People use such norm-1 regularizer to promote sparsity even in the cases where there are no nice statistical recovery results).
So to answer your question in a nutshell: finding the right regularization for a problem requires a bit of intuition and would be very problem specific. But there are some general rules: when trying to add regularizers to a problem you need to make assumptions about the final solution that you want to recover. So you need to know the problem well. Then try to find a regularizer that can promote solutions with such behavior. Then, you need to make sure that the problem is still tractable with the chosen regularizer. One last step which is generally not easy is to choose the weighting for the regularizer, i.e. $\gamma$. People usually use a cross-validation scheme to find $\gamma$.