You have left out one important part of your question! "ATAx= A^Tb" is a "least squares solution" to what problem??
The "problem" is, of course, to solve Ax= b. If this were a basic algebra course, we would say "divide both sides by A". In matrix algebra, that would be "multiply both sides by the inverse of A". Here, we cannot do that because A does NOT have an inverse- its determinant is 0.
In terms of more general linear transformations from a vector space U to a vector space V, Ax= b has a unique solution if and only if A is both "one to one" and "onto". One can show that a linear transformation, from U to V, always maps U into a subspace of V. If A is not "one to one" and "onto", it maps U into some proper subspace. In that case, Ax= b will have a solution (perhaps an infinite number of solutions) if and only if b happens to lie in that subspace.
If A is not invertible, so that it maps U into some proper subspace of V, and b is not in that subspace, there is no solution to Ax= b. In that case, we have to ask "what is the best we can do?"- can we find and x so that Ax is closest to b?
To talk about "closest", we have to have a notion of distance, of course, so we have to have a "metric", perhaps a metric that can be derived from a "norm", and perhaps a norm that can be derived from a an "inner product". Now, let's go back from "linear transformation" to "n by n matrix" because an n by n matrix is a linear transformation from Rn to Rn- and there is a "natural" inner product on Rn, the "dot product".
Given inner products on two vector spaces, U and V, and a linear transformation, A, from U to V, the "adjoint" of A, AT, is defined as the linear transformation from V back to U such that for any u in U and any v in V, <Au, v>= <u, ATv> (notice that since Au is in V and ATv is in U, the first of those inner products is in V and the second in V). When U and V are Rn, the "adjoint" of matrix A is its adjoint, which is why I use the notation "AT".
Now, using the norm defined by that inner product (|u|= \sqrt{<u, u>}) and the metric defined by that norm (distance from u to v is |u- v|= \sqrt{<u-v, u-v>}), we get the usual "geometric" fact that the shortest distance from a point p to a subspace is along the line through p perpendicular to the subspace. That is, if b is the vector in Rn closest to Ax, which is in the subspace, we must have Ax- b perpendicular to the subspace which, in turn, means if v is any vector in the subspace, that is v= Au for some u, we must have <v, Ax- b>= <Au, Ax- b>= 0. That inner product is, of course, in the "range" of A, but we can use the "adjoint" (transpose) to cast that back to the domain: <Au, Ax- b>= <u, AT(Ax- b)>= <u, ATAx- ATb>= 0.
But u could be any member the domain so we must have ATAx- ATb= 0 so that ATAx= ATb.
Of course, this is called the "least squares" solution because we are minmizing the "distance" given by \sqrt{<u, u>} which involves squares.