Brute force regression software?

Click For Summary

Discussion Overview

The discussion revolves around the search for software that can perform brute force regression on a large dataset, aiming to explore various combinations of variables and mathematical expressions to minimize the error between observed and predicted values. Participants consider the implications of fitting complex models to scattered data and the challenges associated with overfitting and model complexity.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Peter suggests the need for a program that can explore complex relationships in data through brute force regression.
  • One participant argues that an (N-1) dimensional polynomial can fit N data points exactly, but questions the physical meaning of such a model.
  • Another participant challenges the idea that the fit depends solely on the number of data points, suggesting that variable dependencies also play a role.
  • Concerns are raised about overfitting when the number of parameters in a model matches the number of data points.
  • Participants discuss the need for defining model complexity and the importance of incorporating physical insight into model selection.
  • Suggestions are made to consider Fourier analysis and other sophisticated modeling approaches that balance model complexity with fit quality.
  • Questions are posed regarding the specifics of the data collected, including the number of observations and the relationship between input variables and outputs.
  • References to software options, such as ANOVA and PCA, are provided as potential tools for data analysis.

Areas of Agreement / Disagreement

Participants express differing views on the feasibility and implications of brute force regression, with no consensus on the best approach or the validity of certain modeling strategies. The discussion remains unresolved regarding the optimal methods for analyzing the data.

Contextual Notes

Participants highlight limitations related to the definitions of model complexity, the need for physical insight, and the potential for overfitting. There is also uncertainty regarding the specific nature of the data and the appropriate statistical methods to apply.

pslarsen
Messages
23
Reaction score
1
Hi all

I have a lot of data, and was thinking if there exists a program that will apply a type of brute force regression tool to basically try any thinkable combination of variables and mathematical expressions to minimize the error between Y and Y_predicted.

The data [(x1 vs Y) (x2 vs Y)... (xn vs Y)] is very scattered data (random example below), so I will need something significantly more complicated than linear terms to get a nice Y to Y_predicted plot.

Br,
Peter
 

Attachments

  • regression.jpg
    regression.jpg
    33.4 KB · Views: 530
Engineering news on Phys.org
I don't think what you've asked for is really what you want. If you have N data points, an (N-1) dimensional polynomial will fit the data exactly, with zero error. However, such a model is probably physically meaningless. Usually you provide some physical insight of what the model should look like. What is your data, and what relationship do you expect between X and Y?
 
So you are saying that the fit depends only on the number of points in a variable, and not on the number of variables nor the variable dependencies - that sounds pretty strange.

Doesn't the N-1 rule apply to only a single variable with N values?

The purpose of the program would be to discover corrections between variables and start with simple relationships and try minimizing the error using various combinations. One would need some level of tolerance on the error, and the program would scale its complexity until an acceptable solution is found.

If there is enough data I guess that the correlation would eventually be meaningful, or what? One would obviously validate the model against some test data.

Is this in reality of problem for a neutral network? Problem with that is that I don't know if I have enough data..

Br, Peter
 
The point is that if I have some number of data points, I can always find a model where the number of free parameters in the model equals the number of parameters in the data. Then the model can fit the data exactly. We usually call this "overfitting" (see the example below). To do what you are proposing, you would have to quantitatively define the following:
(1) What makes a model "simple"? How do you measure the "complexity" a model? Which is more complex, a cubic polynomial, or an error function with one free parameter? Is it the number of free parameters? Or is a linear polynomial model simpler than some highly non-linear function.

I still think in order to do what you are trying to do, you need to inject some physical insight, and not search randomly through the infinite number of mathematical relationships between the variables.

Overfitted_Data.png
 

Attachments

  • Overfitted_Data.png
    Overfitted_Data.png
    4 KB · Views: 629
  • Like
Likes   Reactions: DrClaude
So, you have x1, x2 ... xn input variables.
For each value of y that you recorded, did you record what the values of all x1, x2 … xn were?

How many values of y did you record all the xi inputs?
Where is that data table?
 
Might be a good idea to try Fourier analysis, see if the findings hint at plausible processes, build hypotheses from such, test the resulting models...

I don't remember much from my Statistics courses, but one lecturer's dire warnings about 'Lies, Damned Lies and Inappropriate Correlations' still echo !
 
pslarsen said:
to basically try any thinkable combination of variables and mathematical expressions to minimize the error between Y and Y_predicted.

As others have pointed out, doing that literally would produce nonsensical results. However, there are more sophisticated approaches to fitting models that try to find a trade-off between the number of parameters in the model and the error in fit. This prevents getting an model that fits the data well but has a zillion parameters. The specific software to do this will depend on the general form of model you want. For example, look up info on ANOVA (Analysis of Variance) software.
 
Baluncore said:
How many values of y did you record all the xi inputs?
We still do not know if this is a Baysian or a least squares fitting problem.
There are software packages in the cloud that seek to find optimum equations and parameters for big data sets.
 
pslarsen said:
I have a lot of data
Sure. Why not tell us how big your n is ?

The pictorial example looks pretty unreal.

If n is fairly small you might try the http://www.sigmaplot.co.uk/products/tablecurve2d/tablecurve2d.php program

Else try PCA
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
5K
Replies
1
Views
1K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 2 ·
Replies
2
Views
4K
  • · Replies 21 ·
Replies
21
Views
162K
  • · Replies 0 ·
Replies
0
Views
3K
  • · Replies 4 ·
Replies
4
Views
9K
  • · Replies 10 ·
Replies
10
Views
5K
  • · Replies 4 ·
Replies
4
Views
3K