I'm working through the introductory material of a Machine Learning course (Stanford's to be precise), and notice that in the lecture notes (see page 4) given by Stanford's Andrew Ng, the cost function $J(\theta)$ is defined as:
$$ J(\theta) = {1\over 2} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)})^2 $$
Whereas comparing this with what I've found elsewhere, in particular, this online posting of an implementation of a batch gradient descent, by H Kong, is:
$$ J(\theta) = {1\over 2m} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)})^2 $$
The key difference here, is the fact that the later definition multiplies the RHS of former expression by $1\over m$.
So, given the two different definitions of the cost function - both of which claim it is the Least Squares cost function, how do I determine which one to adopt in my own work. What formal criteria are there for choosing between these two, or any other potential variations of the cost function for that matter?
Also, what's the significance of including that $1\over m$ in the later definition?