Exploring the Benefits of Linearizing Data - Improve Analysis and Visualization

fog37 · Dec 12, 2023

Hello,

There is a physical phenomenon in which the variable ##X## is related to the variable ##Y## by a cubic relationship, i.e. $$Y= k X^3$$
The data I collected, ##(X,Y)##, seems to fit this relationship well: I used Excel to best fit the data to a power law function (3rd power) and there is good agreement....

What would I gain by linearizing the data? That would be achieved by plotting ##Y## versus ##X^3## and the data should follow a linear trend. The best fit line would then be a straight line with slope ##k## and intercept ##0##. I don't think there would be any benefit in linearizing the data since the power law best fit seems to do the job...

Thank you for any input.

FactChecker · Dec 12, 2023

Data is sometimes linearized in order to apply stability theory that applies to linear systems. The aerodynamics of an airplane is linearized at every flight condition (including surface positions) to see what its stability properties are.

fog37 · Dec 12, 2023

I also thing that the R squared value that Excel generates for the power law fit is meaningful since ##Y= k X^3## is a linear model...

So maybe we can linearize a truly nonlinear model (nonlinear in the statistical sense) transforming the data so the best fit line will the a straight line and then calculate the R^2 which would give us an idea of how well the nonlinear model fits the data...

fog37 · Dec 12, 2023

But if we were set on linearizing ##Y=k X^3##, we would take the log of both side of the equation and get $$log(Y) = 3 log(X) + log(k)$$
Couldn't we just linearize the power law ##Y=k X^3## by simply creating two new variables, ##Y_{new}=Y## and ##X_{new}= X^3## and plotting ##Y_{new}## versus ##X_{new}##?

Thanks

FactChecker · Dec 12, 2023

To me, the term "linearize" at a point ##x=x_0## means to approximate the function, ##f(x)## around ##x_0##, with its tangent line, ##f'(x_0)x + f(x_0)##.

pasmith · Dec 13, 2023

fog37 said:

But if we were set on linearizing ##Y=k X^3##, we would take the log of both side of the equation and get $$log(Y) = 3 log(X) + log(k)$$

This formulation assumes that [itex]X[/itex], [itex]Y[/itex] and [itex]k[/itex] are strictly positive.

Couldn't we just linearize the power law ##Y=k X^3## by simply creating two new variables, ##Y_{new}=Y## and ##X_{new}= X^3## and plotting ##Y_{new}## versus ##X_{new}##?

This formulation does not.

Also: if [itex]X[/itex] is not known exactly, is the uncertainly in [itex]X^3[/itex] (proportional to [itex]X^2[/itex]) or in [itex]\ln X[/itex] (proportional to [itex]X^{-1}[/itex]) going to be larger?

FactChecker · Dec 13, 2023

@fog37 , You have not said whether there is any random behavior in your problem. If there is, then one good reason to transform it to linear might be to get the random behavior in the form of an added normal random variable. Then the results of statistical linear analysis can be applied.

Suppose your original problem is of the form ##Y = r X^3##, where ##r## is a random multiplier with a mean of 1. That is, the random behavior is proportional to the size of ##X^3##. If you can transform the problem into the form ## \log(Y) = a_1 \log(X) +a_0 + \epsilon##, where ##\epsilon## is a random normal variable, then you can apply linear statistical analysis to obtain estimators of the parameters and their associated statistical properties. Those results can be applied to the original problem in the form ##Y = e^{\epsilon} e^{a_0} X^{a_1}##

fog37 · Dec 13, 2023

pasmith said:

This formulation assumes that [itex]X[/itex], [itex]Y[/itex] and [itex]k[/itex] are strictly positive.
This formulation does not.

Also: if [itex]X[/itex] is not known exactly, is the uncertainly in [itex]X^3[/itex] (proportional to [itex]X^2[/itex]) or in [itex]\ln X[/itex] (proportional to [itex]X^{-1}[/itex]) going to be larger?

Ok, for simplicity, let's assume we collect some data from an experiment. For specific values of the variable ##X## we obtain certain values for the variable ##Y##. All ##X## and ##Y## values are positive so the log transformation would not a problem.

I guess our data points ##(X,Y)## are to be viewed as a sample from a general population. ##Y## values would slightly change from sample to sample (if we repeated the experiment and collected more than one sample).

Is the change in the collected values of ##Y##, from sample to sample, what @FactChecker refers to as random behavior?

If the data ##(X,Y)##, once plotted, seems to follow a curvilinear polynomial trend like ##Y= a X^3##, OLS can still be used because the model is linear. OLS can be used for polynomial regression, I believe. Confidence intervals, p-value, R-squared still apply for a polynomial regression and would be meaningful results.

Why would we then need to change the model ##Y= a X^3## and linearize it, either by using the log transformation or a change of variables like ##Y=Y_new## and ##X_new=X^3##? I still don't get that part...

Dale · Dec 13, 2023

fog37 said:

TL;DR Summary: why linearizing the data that fits a certain relationship

What would I gain by linearizing the data?

This is already linear.

fog37 said:

Couldn't we just linearize the power law Y=kX3 by simply creating two new variables, Ynew=Y and Xnew=X3 and plotting Ynew versus Xnew?

Yes. That is why it is already linear.

The only reason you might transform the data is if you found evidence of heteroskedascity in the residuals.

fog37 · Dec 14, 2023

Dale said:

This is already linear.

Yes. That is why it is already linear.

The only reason you might transform the data is if you found evidence of heteroskedascity in the residuals.

Thank you. I Indeed, there are two types of linear :

1) linear in the independent variables ##X##
2) linear in the parameters ##b##

Linear regression is linear in both senses while polynomial regression is only linear in the parameters.

What does "linear in the parameters" guarantee? All GLM models are general "linear" models because satisfy linearity in the parameters (ex: the logit in logistic regression is both linear in the variable ##X## and in the parameters ##b##)

Does linearity in the parameters directly imply that OLS can be used to estimate the parameters or not necessarily? I don't think so....What does it guarantee then?

Back to my data ##(X,Y)## following a cubic trend best fit line ##Y=k X^3##. Does polynomial regression has the same assumptions as linear regression (homoscedasticity, gaussian residuals, etc.)?

There are two possible data transformations. Plotting ##Y## vs ##X^3## or ##log(Y)## vs ##log(X)## both produce new transformed data that follow a straight best fit line. But as Dale mentions, one transformation may be better than the other because it allows us to get data that satisfy the required assumptions for linear regression which may not be satisfied by the other transformation... Is that correct?

Dale · Dec 14, 2023

fog37 said:

Linear regression is linear in both senses while polynomial regression is only linear in the parameters.

In statistics, what you are calling "polynomial regression" is still a linear regression. If I have a model ##y=b_0 + b_1 x + b_2 x^2 + b_3 x^3 ## this is a linear model because all of the regression coefficients, ##b_i## are linear. You will use the same underlying algorithm to find the ##b_i## as you would for the model ##y=b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 ##, the residuals will be the same, all of the diagnostic techniques would be the same. They are both linear in every way that counts in statistics. We are focused on what you call "linear in the parameters". From a statistics perspective that is "linear".

The model ##y=b_0 x^{b_1}## is a non-linear model that can be linearized.

fog37 said:

What does "linear in the parameters" guarantee? All GLM models are general "linear" models because satisfy linearity in the parameters (ex: the logit in logistic regression is both linear in the variable X and in the parameters b)

GLM models are not linear in the parameters, they are only linear in the link function of the parameters. (Unless by "parameters" you mean the link function of the parameters, which would be reasonable too)

fog37 said:

Does linearity in the parameters directly imply that OLS can be used to estimate the parameters or not necessarily? I don't think so....What does it guarantee then?

Not directly, no. But together with the usual assumptions about the noise yes. The guarantee is that OLS is the best linear unbiased estimator. That is the Gauss–Markov theorem.

fog37 said:

Does polynomial regression has the same assumptions as linear regression (homoscedasticity, gaussian residuals, etc.)?

Yes, it is the same thing. The same assumptions apply as well as the same diagnostic tools.

fog37 said:

There are two possible data transformations. Plotting ##Y## vs ##X^3## or ##log(Y)## vs ##log(X)## both produce new transformed data that follow a straight best fit line. But as Dale mentions, one transformation may be better than the other because it allows us to get data that satisfy the required assumptions for linear regression which may not be satisfied by the other transformation... Is that correct?

Yes. What I usually do is I fit the first model and I look at my residuals vs ##X## or ##X^3##. If my residuals are fairly independent then I use that model. If my residuals are strongly increasing for larger ##X## or ##X^3## then I will do the log transform.

fog37 · Dec 14, 2023

Thank YOU!

Exploring the Benefits of Linearizing Data - Improve Analysis and Visualization

1. What does it mean to linearize data?

2. Why is linearizing data important in data analysis?

3. How does linearizing data improve visualization?

4. What are the common methods to linearize data?

5. What are the limitations of linearizing data?

Similar threads

Hot Threads

Recent Insights