I've found it helpful to think about the metric tensor in terms of vector dot products, and a cooresponding basis.
You can cut relativity completely out of the question, and ask the same question for Euclidian space, where the metric tensor it the identity matrix when you pick an orthonormal basis.
That diagonality is due to orthogonality conditions of the basis chosen. For, example, in 3D we can express vectors in terms of an
orthonormal frame, but if we choose not to, say picking e_1 + e_2, e_1-e_2, and e_1 + e_3 as our basis vectors then how do we calculate the coordinates?
The trick is to calculate, or assume calculated, an alternate set of basis vectors, called the reciprocal frame. Provided the initial set of vectors spans the space, one can always calculate (and that part is a linear algebra exercise) this second pair such that they meet the following relationships:
<br />
e^i \cdot e_j = {\delta^i}_j<br />
So, if a vector is specified in terms of the e_i
<br />
x = \sum e_j a_j<br />
Dotting with e^i one has:
<br />
x \cdot e^i = \sum (e_j a_j) \cdot e^i = \sum {\delta^j}_i a_j = a_i<br />
It is customary to write a_i = x^i, which allows for the entire vector to
be written in the mixed upper and lower index method where sums are assumed:
<br />
x = \sum e_j x^j = e_j x^j<br />
Now, if one calculates dot product here, say with x, and a second vector
<br />
y = \sum e_j y^j<br />
you have:
<br />
x \cdot y = \sum (e_j \cdot e_k) x^j y^k<br />
The coefficient of this x^j y^k term is symmetric, and if you choose, you
can write g_{jk} = e_j \cdot e_k, and you have the dot product in
tensor form:
<br />
x \cdot y = \sum g_{jk} x^j y^k = g_{jk} x^j y^k<br />
Now, for relativity, you have four instead of three basis vectors, so if you choose your spatial basis vectors orthonormally, and a timelike basis vector normal to all of those (ie: no mixing of space and time vectors in anything but a Lorentz fashion), then you get a diagonal metric tensor. You can choose not to work in an "orthonormal" spacetime basis, and a non-diagonal metric tensor will show up in all your dot products. That decision is perfectly valid, just makes everything harder. When it comes down to why, it all boils down to your choice of basis.
Now, just like you can think of a rotation as a linear transformation that preserves angles in Euclidian space, the Lorentz transformation preserves the spacetime relationships appropriately. So, if one transforms from a "orthonormal" spacetime frame to an alternate "orthonormal" spacetime frame (and a Lorentz transformation is just that) you still have the same "angles" (ie: dot products) between an event coordinates, and the metric will still be diagonal as described. This could be viewed as just a rather long winded way of saying exactly what jdstokes said, but its the explanation coming from somebody who is also just learning this (so I'd need such a longer explanation if I was explaining to myself).