Find Function/Transform for signal that minimizes CV of data

Click For Summary

Discussion Overview

The discussion revolves around finding a function or transformation that minimizes the coefficient of variation (CV) of a set of calculated values derived from experimental data. The data consists of time-series measurements recorded at a frequency of 7.5Hz, and the goal is to identify a transformation for the measured values that meets specific statistical constraints.

Discussion Character

  • Exploratory
  • Technical explanation
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • One participant describes the data structure and the formula needed to calculate the values, emphasizing the need for a transformation function "F" that minimizes the CV of the resulting values.
  • Another participant warns against overfitting the data with a function that only works for the specific dataset, suggesting that a generalized approach using nth powers might be more effective.
  • A participant acknowledges previous attempts with basic transformations, indicating a need for more complex, flexible nonlinear transformations that could generalize better to larger datasets.
  • Various transformation functions are proposed, including polynomial forms, logarithmic, exponential, and trigonometric functions, suggesting a wide range of potential approaches to explore.

Areas of Agreement / Disagreement

Participants express differing views on the complexity and nature of the transformation required. While there is agreement on the need for a transformation to minimize CV, there is no consensus on the specific form or approach that should be taken.

Contextual Notes

Participants note the importance of avoiding overfitting and the need for transformations that maintain physical meaning and generalizability across different datasets. The discussion highlights the potential for multiple competing models without resolving which is most appropriate.

johnpjust
Messages
22
Reaction score
0
Warning...this requires scripting and iteration, and is not theoretical -- it is a real problem I haven't been able to solve, but I'm sure someone here can... :-)

Data: each .csv file is a test recorded at a time interval of 7.5Hz and each file has 3 columns. The first column is time in seconds, the second column is a multiplier (see formula below), and third column is the measured value (to be "transformed"). There is also a corresponding value for each log in the "W.csv" file.

Formula (to produce a value for each file): R = [W_log_Val] / [#.log_val] -->
  • W_log_Val is the corresponding value for that file located in the "W.csv" file.
  • the #.log_val = ∑(col2_val)*(F(col3_val)) (a summation over the rows in the file)
  • F(col3_val) is the function/transformation of the measured value to be found
Goal: A function/transform "F" that can be applied on the variable recorded in the third column ("col3_val") such that the coefficient of variation of all the "R" values is minimized. The CV should definitely be less than 3%, but I expect a good solution could easily make it less than 1%.

Additional Constraint: A plot of the R values against VF values should show no trend/pattern, where -->
  • VF = [#.log_val]/[Ts]
  • "Ts" = the total time for each log (i.e., the value in the last row of the first column for each log)
See an example of the trend when no transform is applied in attached jpg.

Example:
applying a SQRT transformation to the measured value "col3_val" helps significantly, but the CV is still around 8-9% and does not satisfy the constraint.

See example of trend after applying sqrt in attached PDF
 

Attachments

  • data.zip
    data.zip
    71.7 KB · Views: 474
  • Fit R by VF.jpg
    Fit R by VF.jpg
    7.6 KB · Views: 496
  • sqrtPlot.pdf
    sqrtPlot.pdf
    34.3 KB · Views: 299
Last edited:
Physics news on Phys.org
Without additional constraints the problem will have a mathematical solution you clearly don't want: some weird jumping function that works exactly with the dataset used to produce it (and nothing else), based on tuning the function value for specific col3_val appearing in your dataset.

To generalize the sqrt attempt, you can take the nth power of the values and see which n works best (for real n).

The unchanged function looks close to $$ R \approx \frac{1}{VF} = \frac{Ts}{[\#.log\_val]}$$The dependence on #.log_val follows its definition: $$R=\frac{[W\_log\_Val]}{ [\#.log\_val]}$$ which is suspicious.
 
Thanks for your response.

I've tried the obvious stuff already (including a sweep over n powers). While I agree with your implicit point (that one does not want to over-fit the data) in trying to generalize the fit, I'm at the point where I need to "step up" to another level of complexity because I know this same issue exists in a larger data set. I'm not sure if that means a sigmoidal function or something similar, but my experience with the problem tells me if someone knows of a more "flexible" nonlinear transformation (more tune-able parameters that still produce monotonically increasing values post-transformation), then I think a solution is possible that will generalize well to the larger data set.

Also, I do realize the stipulation/constraint I have looks somewhat suspicious at first, but it does have physical meaning and the added "W_log_Val" term is what makes it an independent term (so it turns out OK).
 
- (x+c)^n
- log(x+c)
- e^(cx)
- atan(x/c + d)
- arbitrary, scaled sums of the options above (including n=0 which gives a constant)
- sine, cosine, if there is a motivation to expect oscillations
 

Similar threads

  • · Replies 12 ·
Replies
12
Views
11K
  • · Replies 13 ·
Replies
13
Views
4K
  • · Replies 2 ·
Replies
2
Views
4K
  • · Replies 13 ·
Replies
13
Views
10K
  • · Replies 2 ·
Replies
2
Views
3K