This is a problem with dimensional regularization. It's very convenient technically but not very intuitive. It's much simpler to look at the renormalization in the way of the BPHZ formalism, i.e., you define a renormalization scheme by fixing the divergent proper vertex function at a certain point in momentum and mass space.
In ##\phi^4## theory (despite the vacuum diagrams that are irrelevant for vacuum QFT) the divergent parts are only the two- and four-point function, giving rise to wave-function normalization and mass as well as coupling-constant counterterms, because the two-point function (self-energy) is quadratically and the four-point function logarithmically divergent.
In momentum space we can write the self-energy as ##\Sigma(s,m^2)##, where ##s=p^2## is the four-momentum squared (using Lorentz invariance, the only available arguments for a scalar function are ##p^2## and the mass (squared), which is a scalar itself, because there's only one momentum argument in the self energy). The divergent piecses are thus ##\Sigma(s)## itself as well as ##\partial_s \Sigma(s)##. Thus the renormalization scheme is defined as soon as I define, how to subtract these pieces. The four-point function can be written as ##\Gamma^{(4)}(s,t,u)## with the usual Mandelstam variables ##(s,t,u)## for the ##2 \rightarrow 2## (elastic) scattering process.
To see, how this works for different schemes, let's discuss some examples
BPHZ scheme (works only for ##m^2>0##)
---------------------------------------------------
The BPHZ scheme subtracts the divergent pieces at the point with all external momenta set to 0, i.e.,
$$\Sigma_{\text{BPHZ}}(s)=\Sigma_{\text{reg}}(s)-\Sigma_{\text{reg}}(s=0)-s \partial_s \Sigma_{\text{reg}}(s=0).$$
$$\Gamma_{\text{BPHZ}}(s,t,u)=\Gamma_{\text{reg}}(s,t,u)-\Gamma_{\text{reg}}(s=t=u=0).$$
This scheme can even be used without any regularization by just subtracting the corresponding expressions from the integrands of the loop integrals and then taking the integral. Of course, for diagrams with more than one loops, you have to also take care of all subdivergences (but not overlapping) according to Zimmermann's forest formula (see my manuscript).
You cannot use this scheme at ##m^2=0##, because at 0 external momenta the logarithmically divergent pieces are IR divergent, and subtracting at this point thus leads to IR divergences of the entire renormalized pieces (here the four-point function and ##\partial_s \Sigma##, which is also logarithmically divergent).
MOM Scheme
----------------
One way out is to subtract at non-zero momenta with all external momenta at some space-like scale ##\Lambda##. This introduces a renormalization scale ##\Lambda## into the game, and this is mandatory for the massless case (leading to violation of scale invariance/conformal symmetry and "dimensional transmutation"). This is called the momentum-subtraction scheme. Of course the original BPHZ for the massive case is also a special MOM scheme.
MIR Scheme
---------------
Mass-independent renormalization schemes introduce a mass scale ##M^2## and subtract (at least) the logarithmic divergence at this scale, i.e., you impose the conditions
$$\Sigma(s=0,m^2=0)=0 \; \Rightarrow \delta m^2=0,$$
$$\partial_{m^2} \Sigma(s=0,m^2=M^2)=0 \; \Rightarrow\; \delta Z_m,$$
$$\partial_s \Sigma(s=0,m^2=M^2)=0 \; \rightarrow \; \delta Z_{\phi},$$
$$\Gamma^{(4)}(s=t=u=0,m^2=M^2)=-\lambda \; \Rightarrow \; \delta \lambda.$$
makes ##Z_{\phi}##, ##Z_m##, and ##\delta \lambda## independent of ##M##, i.e., these quantities become only dependent on ##M## via the renormalized coupling constant, ##\lambda##. All these quantities are independent of ##m## at all. You can even evaluate the effective action for ##m^2<0##, leading to spontaneous symmetry breaking (in the case of simple ##\phi^4## theory of the discrete field-reflection symmetry, which is not too interesting but for the linear ##\sigma## model to an effective description of pions as Goldstone bosons of chiral symmetry).
The MS scheme(s) are not so directly expressible in terms of conditions on the diverging proper vertex functions and thus a bit less transparent to understand, but it's more convenient in practical calculations, at the same time keeping many symmetries valid, particularly (non-chiral) gauge symmetries, and that's why it's used in modern treatments of QFT, particularly in QCD.