Exploring the Benefits of Linearizing Data - Improve Analysis and Visualization

  • #1
fog37
1,568
108
TL;DR Summary
why linearizing the data that fits a certain relationship
Hello,

There is a physical phenomenon in which the variable ##X## is related to the variable ##Y## by a cubic relationship, i.e. $$Y= k X^3$$
The data I collected, ##(X,Y)##, seems to fit this relationship well: I used Excel to best fit the data to a power law function (3rd power) and there is good agreement....

What would I gain by linearizing the data? That would be achieved by plotting ##Y## versus ##X^3## and the data should follow a linear trend. The best fit line would then be a straight line with slope ##k## and intercept ##0##. I don't think there would be any benefit in linearizing the data since the power law best fit seems to do the job...

Thank you for any input.
 
Physics news on Phys.org
  • #2
Data is sometimes linearized in order to apply stability theory that applies to linear systems. The aerodynamics of an airplane is linearized at every flight condition (including surface positions) to see what its stability properties are.
 
  • Like
Likes DeBangis21 and fog37
  • #3
I also thing that the R squared value that Excel generates for the power law fit is meaningful since ##Y= k X^3## is a linear model...

So maybe we can linearize a truly nonlinear model (nonlinear in the statistical sense) transforming the data so the best fit line will the a straight line and then calculate the R^2 which would give us an idea of how well the nonlinear model fits the data...
 
  • #4
But if we were set on linearizing ##Y=k X^3##, we would take the log of both side of the equation and get $$log(Y) = 3 log(X) + log(k)$$
Couldn't we just linearize the power law ##Y=k X^3## by simply creating two new variables, ##Y_{new}=Y## and ##X_{new}= X^3## and plotting ##Y_{new}## versus ##X_{new}##?

Thanks
 
  • #5
To me, the term "linearize" at a point ##x=x_0## means to approximate the function, ##f(x)## around ##x_0##, with its tangent line, ##f'(x_0)x + f(x_0)##.
 
  • #6
fog37 said:
But if we were set on linearizing ##Y=k X^3##, we would take the log of both side of the equation and get $$log(Y) = 3 log(X) + log(k)$$

This formulation assumes that [itex]X[/itex], [itex]Y[/itex] and [itex]k[/itex] are strictly positive.

Couldn't we just linearize the power law ##Y=k X^3## by simply creating two new variables, ##Y_{new}=Y## and ##X_{new}= X^3## and plotting ##Y_{new}## versus ##X_{new}##?

This formulation does not.

Also: if [itex]X[/itex] is not known exactly, is the uncertainly in [itex]X^3[/itex] (proportional to [itex]X^2[/itex]) or in [itex]\ln X[/itex] (proportional to [itex]X^{-1}[/itex]) going to be larger?
 
  • Like
Likes fog37
  • #7
@fog37 , You have not said whether there is any random behavior in your problem. If there is, then one good reason to transform it to linear might be to get the random behavior in the form of an added normal random variable. Then the results of statistical linear analysis can be applied.

Suppose your original problem is of the form ##Y = r X^3##, where ##r## is a random multiplier with a mean of 1. That is, the random behavior is proportional to the size of ##X^3##. If you can transform the problem into the form ## \log(Y) = a_1 \log(X) +a_0 + \epsilon##, where ##\epsilon## is a random normal variable, then you can apply linear statistical analysis to obtain estimators of the parameters and their associated statistical properties. Those results can be applied to the original problem in the form ##Y = e^{\epsilon} e^{a_0} X^{a_1}##
 
  • Like
Likes Dale
  • #8
pasmith said:
This formulation assumes that [itex]X[/itex], [itex]Y[/itex] and [itex]k[/itex] are strictly positive.
This formulation does not.

Also: if [itex]X[/itex] is not known exactly, is the uncertainly in [itex]X^3[/itex] (proportional to [itex]X^2[/itex]) or in [itex]\ln X[/itex] (proportional to [itex]X^{-1}[/itex]) going to be larger?
Ok, for simplicity, let's assume we collect some data from an experiment. For specific values of the variable ##X## we obtain certain values for the variable ##Y##. All ##X## and ##Y## values are positive so the log transformation would not a problem.

I guess our data points ##(X,Y)## are to be viewed as a sample from a general population. ##Y## values would slightly change from sample to sample (if we repeated the experiment and collected more than one sample).

Is the change in the collected values of ##Y##, from sample to sample, what @FactChecker refers to as random behavior?

If the data ##(X,Y)##, once plotted, seems to follow a curvilinear polynomial trend like ##Y= a X^3##, OLS can still be used because the model is linear. OLS can be used for polynomial regression, I believe. Confidence intervals, p-value, R-squared still apply for a polynomial regression and would be meaningful results.

Why would we then need to change the model ##Y= a X^3## and linearize it, either by using the log transformation or a change of variables like ##Y=Y_new## and ##X_new=X^3##? I still don't get that part...
 
  • #9
fog37 said:
TL;DR Summary: why linearizing the data that fits a certain relationship

What would I gain by linearizing the data?
This is already linear.

fog37 said:
Couldn't we just linearize the power law Y=kX3 by simply creating two new variables, Ynew=Y and Xnew=X3 and plotting Ynew versus Xnew?
Yes. That is why it is already linear.

The only reason you might transform the data is if you found evidence of heteroskedascity in the residuals.
 
  • Like
Likes fog37
  • #10
Dale said:
This is already linear.

Yes. That is why it is already linear.

The only reason you might transform the data is if you found evidence of heteroskedascity in the residuals.
Thank you. I Indeed, there are two types of linear :

1) linear in the independent variables ##X##
2) linear in the parameters ##b##

Linear regression is linear in both senses while polynomial regression is only linear in the parameters.

What does "linear in the parameters" guarantee? All GLM models are general "linear" models because satisfy linearity in the parameters (ex: the logit in logistic regression is both linear in the variable ##X## and in the parameters ##b##)

Does linearity in the parameters directly imply that OLS can be used to estimate the parameters or not necessarily? I don't think so....What does it guarantee then?

Back to my data ##(X,Y)## following a cubic trend best fit line ##Y=k X^3##. Does polynomial regression has the same assumptions as linear regression (homoscedasticity, gaussian residuals, etc.)?

There are two possible data transformations. Plotting ##Y## vs ##X^3## or ##log(Y)## vs ##log(X)## both produce new transformed data that follow a straight best fit line. But as Dale mentions, one transformation may be better than the other because it allows us to get data that satisfy the required assumptions for linear regression which may not be satisfied by the other transformation... Is that correct?
 
  • #11
fog37 said:
Linear regression is linear in both senses while polynomial regression is only linear in the parameters.
In statistics, what you are calling "polynomial regression" is still a linear regression. If I have a model ##y=b_0 + b_1 x + b_2 x^2 + b_3 x^3 ## this is a linear model because all of the regression coefficients, ##b_i## are linear. You will use the same underlying algorithm to find the ##b_i## as you would for the model ##y=b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 ##, the residuals will be the same, all of the diagnostic techniques would be the same. They are both linear in every way that counts in statistics. We are focused on what you call "linear in the parameters". From a statistics perspective that is "linear".

The model ##y=b_0 x^{b_1}## is a non-linear model that can be linearized.

fog37 said:
What does "linear in the parameters" guarantee? All GLM models are general "linear" models because satisfy linearity in the parameters (ex: the logit in logistic regression is both linear in the variable X and in the parameters b)
GLM models are not linear in the parameters, they are only linear in the link function of the parameters. (Unless by "parameters" you mean the link function of the parameters, which would be reasonable too)

fog37 said:
Does linearity in the parameters directly imply that OLS can be used to estimate the parameters or not necessarily? I don't think so....What does it guarantee then?
Not directly, no. But together with the usual assumptions about the noise yes. The guarantee is that OLS is the best linear unbiased estimator. That is the Gauss–Markov theorem.

fog37 said:
Does polynomial regression has the same assumptions as linear regression (homoscedasticity, gaussian residuals, etc.)?
Yes, it is the same thing. The same assumptions apply as well as the same diagnostic tools.

fog37 said:
There are two possible data transformations. Plotting ##Y## vs ##X^3## or ##log(Y)## vs ##log(X)## both produce new transformed data that follow a straight best fit line. But as Dale mentions, one transformation may be better than the other because it allows us to get data that satisfy the required assumptions for linear regression which may not be satisfied by the other transformation... Is that correct?
Yes. What I usually do is I fit the first model and I look at my residuals vs ##X## or ##X^3##. If my residuals are fairly independent then I use that model. If my residuals are strongly increasing for larger ##X## or ##X^3## then I will do the log transform.
 
Last edited:
  • Like
  • Informative
Likes jbergman, FactChecker, hutchphd and 1 other person
  • #12
Thank YOU!
 
  • Like
Likes Dale

1. What does it mean to linearize data?

Linearizing data involves transforming a dataset from a non-linear relationship into a linear one. This is often done through mathematical transformations such as logarithmic, square root, or reciprocal transformations. The purpose is to simplify the relationship between variables, making it easier to analyze and visualize trends, patterns, and predictive modeling.

2. Why is linearizing data important in data analysis?

Linearizing data is crucial because many statistical methods assume linear relationships between variables. By transforming data to approximate linearity, these methods, such as linear regression and correlation analysis, become more effective and accurate. This process helps in identifying key insights and making reliable predictions more straightforwardly.

3. How does linearizing data improve visualization?

When data is linearized, it often spreads out points more evenly across the visualization, reducing skewness and outliers. This makes trends clearer and easier to identify at a glance. It enhances the interpretability of plots like scatter plots and line graphs, where non-linear patterns might be harder to discern and understand.

4. What are the common methods to linearize data?

The most common methods to linearize data include logarithmic transformations (useful for exponential data), power transformations like square or cube roots (helpful for skewed data), and polynomial or spline transformations (for complex, non-linear relationships). The choice of method depends on the specific characteristics of the data and the underlying relationships between variables.

5. What are the limitations of linearizing data?

While linearizing data can be very beneficial, it also has limitations. It may not be suitable for all types of data or relationships, and inappropriate transformations can lead to misleading results. Moreover, the interpretation of results can become more complex, as the scale of the data changes, potentially confusing the true nature of the relationships among variables. It's crucial to understand the original data distribution and consider these factors when choosing to linearize.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
895
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
495
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
981
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
22
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
Back
Top