Bayesian Statistics - obtaining parameters for model from real data

1. Jan 5, 2013

trelek2

Hello,

I've got some data on an epidemic in various locations - the total number of agents and number killed by the infection after 1 year. -This gives gives me a distribution of percentages of the populations that have been killed by the infection. (but all the percentage values are relatively small)

I wrote a mathematical ODE model for the disease spread within a population with 3 free parameters:

p1 - probability of getting infected externally from the environment
p2 - probability of infecting a new agent once at least one is already sick
p3 - once an agent dies, it is replaced with a new one, the probability that the new one is already infected is given by p3.

Now I need to choose values for p1,p2 and p3 so that the model generates data distributed as closely to the original distribution as possible.
The trouble is that I have never done anything like this before and have very little experience with any sort of statistics.

How should I define the original data distribution - a list of percentages of killed agents? a continuous function somehow?

Then should I choose values for p1,p2,p3 by trial and error and run simulations multiple times to also generate distributions of data?

Lastly, is there a proper way of comparing the obtained data with the original set? I've seen somewhere something about distance functions, what would be the best way of implementing this?

2. Jan 5, 2013

Stephen Tashi

That goal doesn't quite define a mathematical problem. To define a mathematical problem you must reveal how you intend to measure the "closeness" of the model and the data.

Also, your goal is not necessarily related to Bayesian Statistics. A typical goal of using Bayesian statistics would be to find the values p1,p2,p3 that have maximum posterior probability given the data. To do that, you must specifiy a prior probability distribution for p1,p2,p3.

3. Jan 5, 2013

trelek2

That's what I was mainly asking about - how should I go about measuring the closeness of the data? I don't have ideas for that.

Since I don't know what p1,p2,p3 might be I could set them to be uniform distributions between 0 and 0.1, since I expect them to be small.

4. Jan 5, 2013

Stephen Tashi

There is no universal law of statistics that tells you how to measure the closeness of a model to data.

In special situations people know what decisions will be made by using a model with particular parameters. Definite economic (or other) rewards and costs can be assigned for "errors" between the models predictions and actual events.

However, its is more common that people fitting models to data can't specify how the model will be put to practical use. In that case there are certain methods of fitting that are traditional - and tradition isn't any objective justification for the methods being "best".

One traditional method is "maximium liklihood" fitting. You search for the values of p1,p2,p3 that make the data "most likely". If you are dealing with discrete probability distributions, you can say that you search for the values of p1,p2,p3 that make the observed data the most probable sample. If you are dealing with continuous distributions, the work "liklihood" is used instead of "probability" since the probability of picking an exact sample value v from a continuous distribution is usually zero, even if the value of the probability density f(x) of the distribution is not zero at x = v. We can call f(v) a "liklihood", but not a probability.

You can do "maximum liklihood" fitting without using a Bayesian prior, but you can't claim the parameters that maximize the liklihood are "the most probable values of the parameters".

If you assume a Bayesian prior distirbutions for the parameters then you can (in theory) compute a posterior distribution for the parameters and pick, as your estimate for the parameters, the value that is "most likely" in the posterior distribution. ( Often the "most likely" Bayesian posterior parameters are nearly the same as the parameters that maximize the liklihood of the data, but there is no theorem that this must always be true.)

To give more specific advice, we need to know how your model is implemented. How hard would it be to compute the liklihood of the data for a given set of parameters p1,p2,p3? For example, if you model is implemented as a Monte-Carlo simulation, it might be rather hard. Is it possible to compute the liklihood of the data in a deterministic way?

5. Jan 5, 2013

Stephen Tashi

It occurs to me that since you used an ODE model, you're predictions for the data might be deterministic - i.e. you might predict only one possibility for the data instead of a probability distribution of possibilities. If you are using a deterministic model then probability and statistics can't help you unless you put some probabilistic feature into the scenario. You can assume there is something that introduces error in measuring the data or you can assume there is something that introduces variation in the model.

6. Jan 5, 2013

trelek2

Sorry, it is not an ODE model. That was my initial idea, but I ended up with a Monte Carlo type simulation without a lattice.

That's why I am not sure how to proceed. All I can really do with my current knowledge is run sample simulations using guessed parameters.

7. Jan 5, 2013

Stephen Tashi

What is the exact format of the data you have? Is it a a single value ( number of people infected)? Or is it a sequence of values ( number of people infected at time t = 1,2,...N)?

8. Jan 6, 2013

trelek2

I do have the time points for each of the values.

9. Jan 6, 2013

Stephen Tashi

So an output of the model is a vector of values. There are probably too many possible output vectors for you to run the model enough times to estimate the probability of each possible output vector occuring.

I think we must get into the details of you model to find a way to fit it to data. At a given time step, what are the variables that define the state of the process? What algorithm produces the next state from the current state? Which of the state variables are recorded in the real data and which are not known from the real data. (For example, you mention "agents". I'd guess the numberof agents is a state variable in the model but not something that was measured in the real data.)

10. Jan 7, 2013

Stephen Tashi

11. Jan 10, 2013

Stephen Tashi

trelek2,

What is your definition of "agent"? In a quick look at Google hits for "epidemic agent", I only see "agent" mentioned as an object in computer software that simulates epidemics. Does "agent" have a medical definition?