Well I can tell you how dirac function was defined by Dirac, and by mathematicians in Functional Analysis, 40 years ago.
--------------- DIRAC (from Principles of Quantum Mechanics)
First, it's not a function at all. Dirac called it an "improper function", and defined it as 0 everywhere but x=0, where it's undefined; you can call it infinity if you like. The defining characteristic is: the integral (from -inf to +inf, or generally any limits that include 0) is 1. I'll call it dirac(x).
Its most important property is that when integrated against a function f the integral is f(0).
Dirac made the concept (more or less) rigorous by imagining a function which is defined in a small area (bounded by epsilon) around 0, and whose integral is 1. It must have no "unnecessarily wild variations". Parametrize this function by epsilon to get a family of such functions, let epsilon go to 0. The limit is dirac(x). Previous posters have given examples of such families. To make it (more or less) rigorous we integrate a function f against this parametrized family and take the limit as epsilon goes to 0. However we almost never bother, since the answer will be simply f(0).
Dirac also noted it can be considered the differential coefficient of the Heaviside step function.
By the way the derivative of dirac(x) itself is a dipole or "unit doublet".
Dirac emphasized that this improper function only made sense as a kernel. dirac(x) itself is only a shorthand notation meaning: perform the appropriate integration. But, as long as we're careful, we can often treat it like a function:
"In quantum theory whenever an improper function appears, it will be something to be used ultimately in an integrand."
"The use of improper functions does not involve any lack of rigor in the theory, but is merely a convenient notation ..."
"We can often use an improper function as though it were an ordinary continuous function ..."
He gives various identities such as x dirac(x) = 0, always emphasizing that it means: if each expression is used in an integration it will give the same answer.
He introduced dirac(x) to deal with continuous ranges of eigenvalues, like position. In the discrete case dirac(x) is simply the Kronecker delta, about which there's no problem. But we need to generalize to the continuous case. For discrete eigenvectors of course the norm is 1; for different eigenvectors, the inner product is 0. But for continuous the norm can't be made 1. However he wanted it as similar as possible, so using dirac(x) he's able to make the inner product of two different (continuous-range) eigenvectors 0; and the norm of one of them, a finite number "c" which depends on the specific eigenvector. Having introduced it for this purpose, it turns out to be useful in many other contexts as well.
By including dirac(x) as an eigenvector (with infinite length) he's no longer working in L2 space; today we call it a "rigged H space" (nuclear is one of these); but Dirac gave it no name.
By the way somebody said square-integrable functions is not a Hilbert Space?? Sure it is.
Dirac blew by some potential problems by noting that in reality physicists never deal with an exact value of x, it's always an imprecise number covering a small range. So in practice dirac(x) never actually arises. When you actually compute numbers based on continuous eigenvalues you must deal with a (small) range of them.
He also mentions the interesting point that although other vectors can be expressed as an integral over the continuous range (using expanded Identity) the basis vectors themselves can't, being atomic (nor a finite sum of them). So to express any possible vector he uses an integral over the basis vectors, plus a summation.
He uses dirac(x) to deal with generalized matrices, generalized diagonal matrices, relative probability amplitudes, and so forth. He uses d log x / dx = 1/x - i pi dirac (x) in scattering theory.
That summarizes Dirac's work with dirac(x). His attitude was, that's good enough for physicists.
--------------- MATHEMATICIANS
Unfortunately it's not good enough for mathematicians. General comment, there are uncountably many ways to formalize Dirac's brilliant ideas; but for physicists, as far as I know, Dirac's approach is good enough. If you insist on formality, just pick the simplest formalization and leave it at that. 99% of the elaboration of measure theory has absolutely no relevance to Quantum Mechanics.
Measure:
Dirac is very clear about what the (improper) function is: an atomic, or discrete, measure. On Wikipedia I see they actually have something called the "Dirac measure" which is precisely for this purpose. It's a type of Radon measure. There are many other formal definitions one could use.
A measure, of course, only makes sense as a kernel to integrate a function against (see "quadratic forms").
Distribution:
The result of that integration is called a distribution associated with the generalized function, in our case dirac(x).
Now a big problem arises - abuse of notation. People use the same symbol for the distribution as for the kernel (or measure, or generalized function). Thus I see the statement "dirac(f) = f(0)". After all once you start talking about the distribution you don't need to refer to the real dirac(x) function often. But don't get confused! The distribution is not, in fact, the actual dirac function.
At this point I'm almost done because I see at least two of you are on top of this topic (although making some curious errors?) and everyone can go read Wikipedia. But I don't understand, why do physicists care? Since you're dealing only with physical functions, you don't have to prove existence, smoothness, integrability, etc - the proof is physically right in front of you. Dirac's approach seems perfect as it is. Obviously, I've got a lot to learn about modern QM!
Anyway a distribution, dirac(f), is defined on a space of Test Functions. These are C-infinity with compact support. Basically we make them very well-behaved so as to have no problems integrating against generalized functions.
dirac(f) is a linear functional that lives in the continuous dual space, NOT the regular algebraic dual space.
To define its value we use a family of Test Functions. They're like Dirac's families mentioned above, but only the smoothest are used. Test functions converge using a supremum norm, so that dirac(f) is continuous (on the space of Test Functions of course).
dirac(f) can be defined on any L2 function but not directly. The Test functions are dense so you can converge to any L2 function; in fact you can always find one Test function which is close enough to the wave function (or ket). However you have to chop it; it can't go to infinity, even very quickly.
The reason not to define dirac(f) directly on L2 is, it wouldn't be continuous. One can't help, really, considering dirac(f) to be defined on L2 but be careful applying any distribution theorems - they'll usually depend on continuity, other nice properties.
Note there can be a ket f with dirac(f) NOT f(0). That's because L2 could have a "bad" point there which will be smoothed out when you approximate it with Test functions. Another way to see that (as mentioned by posters) is the L2 equivalence classes lose a countable number of "bad" points (of Lebesgue measure zero) since they don't affect square-integrable norm. I can't imagine why a physicist would care about such non-physical functions, though.
I hope this helps clear up confusion.