The math can help to understand what's going on here a lot (math is always good for understanding).
Let's first consider the most simple case of a twoplate capacitor with vacuum between its plates. Then you can think of charging it that you transport charges (say electrons) from one plate to the other. Due to charge conservation both plates carry the same but opposite charges. The charge transport needs energy because the more you charge the plates with opposite charges the larger the electric field between the plates becomes, leading to a force against further charge transport. Due to energy conservation, valid for the motion of charges in static electric fields, it doesn't matter, how you move the charge from one plate to the other, you always need the same energy to reach a certain charge state of the capacitor.
Suppose one plate (sitting at [itex]x=0[/itex]) is already charged by an amount [itex]+Q[/itex]. Then necessarily the other plate (sitting at [itex]x=d[/itex]) parallel to the first plate has to carry the charge [itex]Q[/itex]. In the stationary state both charges are located on the surface of the plates inside the capacitor. Using Gauß's Law to a box parallel to the plates with one side within the conducting capacitor plate and one side inside the vacuum of the capacitor and the symmetry assumption (neglecting the edge effects of the finite plates, assuming that the distance [itex]d[/itex] between the plates is much smaller than their size) leads to an electric field
[tex]\vec{E}=\frac{Q}{4 \pi \epsilon_0 A} \vec{e}_x,[/tex]
where [itex]A[/itex] is the area of the plates.
If you want to transport another infinitesimal amount of (negative!) charge [itex]\mathrm{d} Q[/itex] from the left plate to the right plate you have to do work against the Force [itex]\vec{F}=\mathrm{d} Q \vec{E}[/itex]. Since the path along which you carry the charge doesn't matter, you can take a straight line along [itex]\vec{E}[/itex] to get the work needed to do that:
[tex]\mathrm{d} W=\mathrm{d} Q d E_x =\mathrm{d} Q d \frac{Q}{\epsilon_0 A}.[/tex]
Since you start from an uncharged capacitor, to reach a total charge [itex]Q[/itex] you have to integrate this expression from [itex]0[/itex] to [itex]Q[/itex] wrt. [itex]Q[/itex]:
[tex]W=\int_0^Q \mathrm{d} Q' \frac{Q'd}{\epsilon_0 A} = \frac{Q^2 d}{2 \epsilon_0 A}.[/tex]
This we can easily rewrite in terms of the finally reached electric field within the capacitor:
[tex]E_x=\frac{Q}{4 \pi \epsilon_0 A}.[/tex]
Plugging this in our formula for the total work done when carrying the charges from one to the other plate in favor of [itex]Q[/itex], we find
[tex]W=\frac{\epsilon_0}{2} E_x^2 A d.[/tex]
Now [itex]A d[/itex] is the volume between the plates and [itex]\frac{\epsilon_0}{2} \vec{E}^2[/itex] is the energy density of the electric field! Thus the total work done to carry the charges from one plate to the other, is now stored as field energy in the electric field between the plates.
If you put a dielectricum between the plates for not too high fields the response of this medium is that a polarization by slightly moving the bound charges inside the medium a little bit from their equilibrium place, which needs further work against the binding forces of these charges, which then is stored in this additional electric field, i.e., the polarization of the medium. This leads to an additional factor [itex]\epsilon_r[/itex] in the formula for the work:
[tex]W=\frac{\epsilon_r \epsilon_0}{2} E_x^2 A d.[/tex]
