# Comparing curves using gaussian process regression

Hi guys,

I have run multiple simulations on networks that are all slightly perturbed from each other. They produce slightly different curve outputs onto an x-y graph which I need to now analyse (it has been about 5 years since I did statistical analysis hence why I am here). A couple of the perturbed networks have quite out lying curves (by comparing them on a graph together) but the majority are all fairly similar. I was wondering what would be the best statistical method for defining an out lying curve from the rest of the curves? Someone mentioned to me a method called gaussian process regression but I am finding it hard to understand.

Cheers

Stephen Tashi
I have run multiple simulations on networks
You should explain whether you are running stochastic simulations or deterministic simulations.

that are all slightly perturbed from each other.

One guess about what that means is that the network is defined by a set of parameters and you vary these on each run of the simulation. But it isn't clear whether you vary them by picking numbers at random and then run a deterministic simulation or whether running the simulation has further simulation of random "errors".

I was wondering what would be the best statistical method for defining an out lying curve from the rest of the curves?

To give a mathematical answer to that, you would have to define "best" precisely. To answer that informally, you need to informally state what you are trying to accomplish. Are you trying to publish a paper? - reject defective components at a factory?

You should explain whether you are running stochastic simulations or deterministic simulations.

The simulations are stochastic.

One guess about what that means is that the network is defined by a set of parameters and you vary these on each run of the simulation. But it isn't clear whether you vary them by picking numbers at random and then run a deterministic simulation or whether running the simulation has further simulation of random "errors".

The network is a set of 200-300 nodes. Each simulation is missing one particular node. The aim is to identify the importance any nodes in the network.

To give a mathematical answer to that, you would have to define "best" precisely. To answer that informally, you need to informally state what you are trying to accomplish. Are you trying to publish a paper? - reject defective components at a factory?

The end game of the results will be to publish a paper.

Hope this helps.

Stephen Tashi
The aim is to identify the importance any nodes in the network.

Is there any notion of what is a "correct" x-y graph for a network? For example, if a network of nodes controls something, then the correct behavior might be rhythmic.

Is there any stochastic variation in the x-output? For example, if x = k means we executed the k-th step of an algorithm then x is precise. However, if x[kl] were "pressure in the tank" at k seconds and y[k] were temperature of the liquid at time k seconds then both x and y data might have some stochastic variation.

It is essentially a random walker on a network. The x-y output is essentially time (iterations of random walker from node to node) against a value that I define as 'the likelihood of getting trapped in a subnetwork within the major network'. The network does not control anything, the simulation merely looks at exploring a random network which has minor networks within it. It looks at the strength of particular subnetworks essentially. I can go into more depth if it helps?

Thanks again for the replies.

Stephen Tashi
A common way to measure the "distance" between two curves given by a discrete sets of data $(x_a, y_a)$ and $(x_b,y_b)$ is use the mean square difference $d = (1/n) \sum_{i=1}^n\sqrt{ (y_a - y_b)^2 }$.

However if you consider two curves that are offset by a small time displacement "about the same" as each other, this way of measuring doesn't agree with that idea. For example (0,0),(1,10),(2,-10),(3,10),(4,-10).... and (0,10),(1,-10),(2,10),(3,-10),(4,10).... have a visual similarity but a large mean square difference.

One of the first questions to ask is whether the mean square difference expresses your opinion of how close or far apart two curves are. What properties do curves that you call outliers have that makes them different from the majority of curves?

Last edited:
Thanks for the reply again. I'll try my best to explain the outlieing curves from the majority.

All the results are positive. The results all begin at (0,0) and all end up at about (103,0) on a log-graph. In between these two x-values they tend to increase in a general hump formation with small peaks protuding every now and then, however, between x = 102 and 103 the outlieing curves exhibit an extra peak whereas the rest of the curves tend towards zero without exhibiting this extra peak. I was hoping to show that this extra peak, that the outliers exhibit, are statistically relevant and not just random.

Cheers, I do apologise, it has been so long since I used statistics....

Stephen Tashi
The two main methods in statisics are 1) estimation and 2) hypothesis testing.

In estimation, you assume that data is generated by a stochastic model with unknown parameters. (i.e. You assume the data is generated by a member of a specific "family" of stochastic models). Then you estimate the parameters of the model from the data.

In hypothesis testing, you assume the data is generated by a specific stochastic model. (This is the "null hypothesis") Then you compute the probability of some aspect of the observed data based on that assumption. If the probability of that aspect of the data is "small", then you "reject" the null hypothesis.

The phrasing of the null hypothesis often sounds vague, but it actually must be specific enough to allow computing the probability of the aspect of the data that is under consideration. There are "non-parametric" methods of statistics that make only very general assumptions about how the data is generated - but you always must have enough assumptions to be able to compute the probability of the aspect of the observed data that is of interest.

Both methods involve subjective decisions. So if you are writing this up to publish, it's a good idea to browse the journals you are submitting-to and see what statistical methods the editors of those journals accepted.

To fit your problem into a textbook application of statistical hypothesis testing, you need to select some scalar aspect of the data to analyze. Some possibilities: 1) Maximum value of the curve between 10^2 and 10^3 2)Number of local maxima in the curve between 10^2 and 10^3. 3) Distance between the curve and some reference curve - where distance is defined by some algorithm that produces a scalar value.

How would a reader react If you wrote a statistical analysis that focused on whether the curve had an extra peak between 10^2 and 10^3 ? Would they say "That's important" or would they say "He just cherry-picked that aspect of the curve to justify the idea that some nodes are important"?

I think the distance between the curve and some reference curve would be a suitable method of analysis. So for example knowing all the y-values for the different networks at a particular x-value, I can generate a reference curve from the mean. Then using a cut-off such as 2 or 3 standard deviations I could show the statistical importance of a particular curve at that x-value? So my null hypothesis would be along the lines of "mutation N does not perturb the network".

Stephen Tashi
So for example knowing all the y-values for the different networks at a particular x-value, I can generate a reference curve from the mean.

Whether that works depends on your audience. Some may wonder whether you cherry-picked a pariticular y-value. Do you have a theortical argument why a particular y value is very important?

One null hypothesis is:

The perturbed network N_1 produces curves (or a certain aspect of a curve) by the same stochastic process as the network N_0.

If you want to compare the network N1 to a collection of networks, you'll need a hypothesis like:

The perturbed network N_1 produces curves by the same stochastic process as randomly selecting a network from the collection C and producing a curve from the selected network.

For a given population of curves, you could estimate the curve that is the "mean trace" of that collection of curves by estimating the computing the sample mean y value at each x value. Then you could define the "distance" between a curve and the mean trace by some procedure that produces a scalar value. Whether this makes sense depends on your field of study. For example, the mean trace of a collection of curves might be a curve that is physically implausible.

Can the same network produce different curves on different runs of the simulation? Are you getting the (x,y) data for a given network by averaging many different (x,y) curves for the same network?

----

Keep in mind that hypothesis testing is a procedure - it isn't a proof. It doesn't quantify the probability that the null hypothesis is true and it doesn't quantify the probability that the null hypothesis is false. It's simply a procedure. "Statistical significance" likewise does not quantify the probability that the null hypothesis is false or quantify the probability that it is true. Because of these limitations, whether an audience accepts a hypothesis test as evidence is a subjective matter.

Can the same network produce different curves on different runs of the simulation? Are you getting the (x,y) data for a given network by averaging many different (x,y) curves for the same network?

A single network technically does produce different curves on every run of the simulation, however, in general these curves are almost indistinguishable from each other. Technically I am running multiple simulations (about 100) for each particular network but not averaging, however, I will explain this further.

So the simulation essentially identifies a good partition of the network. However, to define the partitions of the network I am using a greedy louvain algorithm that doesn't always give the same partitions of the network (unless the partitions are really obvious). So the y-value is essentially a measure of how similar the different partitions (as provided by the louvain algorithm) are to each other. So whilst I am running multiple simulations of the same network, I am not plotting a mean value!

I imagine most of that didn't make sense... I am not good at explaining!

Thanks for the reply, and also for the further information on procedures and proofs.

Stephen Tashi
A single network technically does produce different curves on every run of the simulation, however, in general these curves are almost indistinguishable from each other. Technically I am running multiple simulations (about 100) for each particular network but not averaging,

There are various types of "outlier" situations that might exist.

1) An outlier might be a curve F1 from a network N1 that is drastically different from the other curves produced by the same network N1.

2) An outlier might be a curve F1 from a network N1 that is drastically different from other curves produced by N1 and other curves produced by other networks.

3) An outlier might be a network N2 such that some or most of the curves produced by N2 are drastically different than the curves produced by other networks.

4) An outlier might be a group of curves G2 such that some or most of the curves they produce are drastically different than the curves produced by the other networks.

Generally when people use the term "outlier" they do so with the intention discarding outliers from an analysis, but methods use to identify outliers often amount to hypothesis tests.