In a very simplified sense, we define a set of functions that together can be used to reproduce any image. Realistically, this requires an infinite set so we truncate the number of functions to get a good approximation. These functions are the bases and are like interpolating functions. Being orthogonal means that they more or less describe unique aspects of the image from each other. That is, there is no "overlap" in the information in one basis with all the others.
So we have a set of interpolating functions that efficiently describe any image we may have to a good approximation. Then we only need to know the amplitudes of these functions to reconstruct an image.
More to it than this but that's a very basic explanation. You can also note that the Cartesian vectors are a vector basis of the Cartesian space. We have three bases, x, y, and z vectors. They are orthogonal as the dot products between them are all zero. If we want to describe the location of any point, then we simply state the vector coefficients for the three bases.