How to calculate max/min scales on a scatter plot

expertalmost · Mar 25, 2014

Good morning!

I have 3 log scatter plots that I want to establish smooth maximum and minimum lines. What is the usual mathematical method to do that? (Image and excel file links below.)

The black lines on the scatter plot images are hand drawn. The third scatter plot is especially tricky and not amenable to a moving average plus stddev because of the data clumping. Note: This is time series data so new data constantly comes in. In other words, I cannot just use the whole data population in one shot.

Any ideas would be greatly appreciated.

Excel File: https://dl.dropboxusercontent.com/u/44057708/Three%20Scatters.xls
Image at: https://dl.dropboxusercontent.com/u/44057708/ThreeScatters.jpg

Ackbach · Mar 25, 2014

Can you give us a little more context? Here are some questions I have:

1. How is this data generated? What are you measuring?

2. Is it important that every single data point in one cluster lies between your smooth max and min lines? Or is it enough that the vast majority lie between the two lines?

3. What is the data rate of this data? That is, how fast is the data coming in?

4. Are there any other features you'd like to know about the data? Local peaks, for example?

expertalmost · Mar 25, 2014

Thank you for your time and questions! I appreciate your efforts. Here are some brief answers to your questions.

1) I use financial market analysis and these are the log-normal values of them. Whether the data is actually/truly log-normal is not really a concern as extremes are clipped and indicated as such. Using mean/stddev analysis on the third series does not work well due to the data clumping. I am looking for a solution elegant/roboust/general enough for all three data sets. And I have many groups of three data sets.

2) Not every point needs to lie between my max/min. I was targetting 80% on the minimum due to the paucity of points there and zero is a less critical component. Targeting 95% on the maximum side.

3) The data is coming in slowly. Only using daily analysis now.

4) In this case, not interested in local peaks other than how well they get smoothed in the final scaling.

Hope this helps define the problem more clearly :)

Thank you again for your interest.

Ackbach · Mar 25, 2014

You say that the mean/std dev approach doesn't work. What if you computed a moving average on the basis of a lot more data points? For example:

1. Fit a cubic polynomial to the data. Excel will do this quite readily. Suppose the result to be $f(t)$.
2. Compute the maximum deviation from the cubic, and construct an envelope around $f(t)$ thus: $f(t) \pm \text{max dev}$. That would guarantee all the data would be in the envelope.

However, the envelope might not be tight enough. To help you more, I think I still need to know your design requirements better. By what criteria would you judge the "goodness" of the envelope?

expertalmost · Mar 25, 2014

Thank you for your suggestions! You obviously know considerably more math than me and I appreciate your insights and experience. I will have to investigate cubic polynomials.

Right now I take the 25 largest/smallest of the last 100 elements and average that. I also add a stddev amount to the max. (ad hoc...yes!) Then smooth by damping (multiplying the change by .1 and using it). I was hoping there was a more elegant/roboust/general solution as I have to tweak the stddev and damping factors for different data sets.

The issue I have with using a larger data sample is the the lag introduced.

The use is quite simple. I use previous data to establish stable max/min levels so I can scale new values as they come in. Gives me a 0-1 range that is consistent and meaningful across data sets. As far as goodness, again, ad hoc. No more than 10-15 percent of the values should be clipped above/below my max/min scale. So the upper black line is my 1 and the lower is my 0 and as new data comes in, it is scaled to the most recent 0-1 range. And then it is used to update the population sample. Standard time-series analysis (hopefully).

My apologies for the lengthy replies. Not having much formal training in this, I end up using more words than probably necessary. Thank you for your patience! Hope it helps clarify what I'm trying to do. :)

expertalmost · Mar 26, 2014

I've generated a working solution and just wanted to post it for future generations ;)

Someone pointed out that because the data is not really fall under "standard error" due to the gaps, that medians +/- stddev would not really work. I have confirmed this with many days of attempts.

Therefore, the other typical solution, as far as I can tell, is called rank-type smoothing or moving quantiles; with damping of the end result. Basically, using a sample size of 100, I take an average of the 10 largest items for the maximum and an average of the 25 smallest items for the minimum. I then dampen the changes to 10% for smoothing. This gives me a smooth enough maximum and minimum line to use as a 0-1 scale.

Hope this helps!

Ackbach · Mar 27, 2014

Glad you found something that worked! I can't say I understand it - this shows that you might not be so far below me in mathematical knowledge as you thought. ;)

How to calculate max/min scales on a scatter plot

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad A variant of the Monty Hall problem

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad How do E[X] and E[|X|] relate?

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight