In order to find the local minima of a scalar function $p(x), x\in \mathbb{R}^3$, I know we can use the gradient descent method: $$x_{k+1}=x_k-\alpha_k \nabla_xp(x)$$ where $\alpha_k$ is the step size and $\nabla_xp(x)$ is the gradient of $p(x)$.
My question is: what if $x$ must be constrained on a sphere, i.e., $\|x_k\|=1$? Then we are actually to find the local minima of $p(x)$ on a sphere. Can gradient descent be applied to constrained optimizations? Can anyone give any suggestions? Thanks.

There's no need for penalty methods in this case. Compute the gradient, $g_k=\nabla_xp(x)$, project it onto the tangent plane, $h_k=g_k-(g_k\cdot x_k)x_k$, and normalize it, $n_k=h_k/|h_k|$. Now you can use $x_{k+1}=x_k\cos\phi_k + n_k\sin\phi_k $ and perform a one-dimensional search for $\phi_k$, just like in an unconstrained gradient search, and it stays on the sphere and locally follows the direction of maximal change in the standard metric on the sphere.
By the way, this can be generalized to the case where you're optimizing a set of $n$ vectors under the constraint that they're orthonormal. Then you compute all the gradients, project the resulting search vector onto the tangent surface by orthogonalizing all the gradients to all the vectors, and then diagonalize the matrix of scalar products between pairs of the gradients to find a coordinate system in which the gradients pair up with the vectors to form $n$ hyperplanes in which you can rotate while exactly satisfying the constraints and still travelling in the direction of maximal change to first order; the angles by which you rotate in the hyperplanes are multiples of a single parameter (with the multipliers determined by the eigenvalues), so this is again reduced to a one-dimensional search.
[Edit:] (In response to the suggestion in the comment to use $x_{k+1}=\frac{x_k-\alpha_kg_k}{\|x_k-\alpha_kg_k\|}$ instead)
This is slightly disadvantageous compared to the great-circle solution. You're effectively searching on the same great circle, except in this approach you can only generate one half of it. If the optimum lies in the other half, then the optimal value of your parameter $\alpha_k$ will be $\pm\infty$, whereas in my formulation the compact search space $\phi_k\in[0,2\pi]$ maps to the entire great circle. Also, this formulation doesn't generalize to the case of $n$ orthonormal vectors, since you'd have to perform an orthonormalization at each step.
How to determine $\alpha_k$ is not a (new) problem; in every gradient search you need to have some way to determine the optimum in a one-dimensional search.