# HMM with feature vector

Tags:
1. Aug 31, 2015

### Srecko

Hi all,

Not sure if this would be the right place for this question, but I know it bothers me for some time already and would really appreciate any kind of help. I am trying to fit an HMM, but here for every observation in the sequence I have feature vector - probability distribution that given observation appears in different groups.

I have seen this approach http://www.cs.jhu.edu/~mpaul/files/emnlp2012-m4.pdf that basically uses log-linear function to translate a vector of continuous values into a discrete value. However, I'm not sure whether I understood it correctly nor how to apply it. My idea was to apply this (or similar) algorithm on the data I have and then use any existing HMM implementation (probably in R) to fit a model.

Thanks for any ideas, suggestions...

2. Sep 5, 2015

### Greg Bernhardt

Thanks for the post! This is an automated courtesy bump. Sorry you aren't generating responses at the moment. Do you have any further information, come to any new conclusions or is it possible to reword the post?

3. Sep 6, 2015

### mathman

My expertize is in probability theory (not statistics). I was unable to respond because your terminology was too obscure. (HMM?)

4. Sep 6, 2015

### Staff: Mentor

Hidden markov model.
I don't even understand what Srecko wants to do and where the problem is.

5. Sep 7, 2015

### Srecko

Thanks mathman and mfb,

Really sorry for the confusion. Let me try to provide better explanation. The sample from my dataset would look like this:

Code (Text):
sequence parent_i week    student cluster1 cluster2 cluster3 cluster4 cluster5 cluster6 cluster7
0       0 w2          1111   0.032   0.007   0.056   0.027   0.056   0.072       0
1      -1 w1          2222   0.073   0.114   0.089   0.004   0.044   0.165       0
2       0 w6          3333   0.003   0.038       0  0.0004  0.0002   0.138   0.045
3       0 w2          4444       0       0       0   0.004       0       0       0
4      -1 w0          5555   0.212   0.016   0.011   0.011   0.004   0.071  0.0006
5      -1 w1          5555   0.013  0.0004   0.016       0       0       0       0
6    1329 w2          5555   0.008   0.011   0.099    0.04   0.096   0.158       0
7    1442 w3          5555   0.098       0  0.0006       0    0.08   0.021   0.136
8    1829 w4          5555   0.129   0.175   0.147   0.003   0.007    0.06   0.025
That is, instead of having discrete value for each observation, I have a feature of continuous values that represent to what extent each student belong to a certain cluster (community). What I am trying to do is to fit a Hidden Markov model in order to examine what are the most common transitions. Easier scenario would be if I had:

student, week, cluster - where cluster is a value between 1 and 10 (for example).

Please let me know is this is still unclear.

6. Sep 7, 2015

### Stephen Tashi

I suggest you begin by relying on the mathematical definition of a Markov process. A Markov process has "states". What defines a "state" in your problem?

The "states" in a Markov process are generally given by a vector of variables. A variable can simply denote a discrete category (like x = 1 for "cheerful" x = 2 for "sad"). A variable can denote a continuous quantity (like t = temperature). If the "state" of a Markov process is to be defined by a particular function then, in practice, you would define the function by a finite number of parameters (like A,B,C for the the function Ax^2 + Bx + C).

If "students" is to be a state variable, you need to explain whether the state of the process is defined by information about one student or whether it indicates information about a set of several students.

A Markov process has "transition rules" that give the probability that one state transitions to another states as the process moves forward one time step (in the case of a discrete time process). If you have a finite number of states, the transition rules can be specified by a matrix. Do you have a finite number of states? If a state is specified by a value of a continuous variable (like temperature) or by a function then you need a way to specify the transition rules as functions.

Data from one time series of a Markov process can be called data about a "trajectory" of the process. Does you data consist of one trajectory? Or do you have data from several different trajectories?

7. Sep 7, 2015

### Srecko

Thank you for the great explanation, Stephen.

I agree that state is not clearly defined here and I'm trying to apply HMMs in a situation that is not very well defined. Let's make clear several things. Here I have trajectories for several students. When I specified the model, I basically defined the length of each trajectory. On the other hand, let's say that each observation is described with single discrete value (from 1 to 6, indicating different clusters of students) and forget the vectors for the moment. We would basically have student 1, and his sequence 1, 3, 5, 4; student 2 - 3, 2, 1, 5, etc. After fitting the model, each state would be defined by one or more clusters. If there are 3 states, I will have state 1 is described by clusters 2 and 3, state 2 with cluster 5, and state 3 with clusters 1 and 4. Given that, I should be able to say that during the course, students tend to transition from clusters 2 and 3 to cluster 5 (for example). Of course, I would have to identify and "describe" those clusters in order to be able to interpret such results.

Now, let's get back to vectors. I don't have discrete values (1-6), but instead of that each emission is described by a vector of probabilities. The approach I shared earlier (http://www.cs.jhu.edu/~mpaul/files/emnlp2012-m4.pdf) does something like that. Instead of students, they deal with words and topics instead of clusters. Eventually, we are able to say words transition from this topic to another topic.

Finally, regarding transitions. If I understood correctly, I don't have to define transitions, I guess that should be inferred by the model? At least, when we fit a model using depmix (depmixS4), usually it's enough to define number of states...

8. Sep 8, 2015

### Stephen Tashi

You're using the terms "cluster" and "emission" without explaining what they mean and without explaining their relation to the definition of "state".

I suppose would be advisors must look at the documentation for depmix: http://raptor1.bizlab.mtsu.edu/S-Drive/TEFF/Rlib/library/depmix/doc/depmix-intro.pdf [Broken]

Last edited by a moderator: May 7, 2017
9. Sep 8, 2015

### Srecko

You are right, Stephen.

Each state should be "described" with one or more "clusters". Those probabilities are actually obtained using Mixed Membership Stochastic Blockmodels (MMSB), applied on the graph of learners. I didn't want to go into all the details since I thought that what I said before would be enough. Obviously I was wrong. So, everything begins with a graph. Once I had a graph, I applied MMSB to identify communities of learners (clusters/groups). MMSB is "soft clustering" method, and as an output it gives you a probability that node (i.e., student) belongs to each of the identified communities. Those probabilities describe each of the observations.

Finally, each state in HMMs would be (ideally) described with one or more cluster(s) (i.e., communities obtained using MMSB).

I did check the depmix documentation, but I wasn't able to find implementation that would support fitting HMM with vector (instead of single discrete/continuous value).

10. Sep 8, 2015

### Stephen Tashi

In mathematics, there are "graphs" that are studied in Graph Theory. There are also graphs of functions. In business presentations, there are graphs in the sense of bar graphs and pie charts. If you are talking about the kind of graph that is studied in Graph Theory then such a graph has "vertices" and "edges". The vertices can have "labels". Perhaps the "labels" define "states"?

It's is ambiguous to mention "probabilities" without defining them. A probability is the the probability of some event happening. If you just call it a "probability", it isn't clear what event is associated with the probability.

It's hard to understand the problem from your writing - especially for people who aren't familiar with the software you mention. If you don't yet have a clear idea of the problem as a mathematical problem, try describing the problem in practical terms. Don't describe it merely by saying your are applying certain software.

11. Sep 8, 2015

### Srecko

Hi Stephen,

I am familiar with graph theory (at least more than HMMs), and the only reason I posted here was to ask whether there is an HMM implementation that allows for describing each observation in a sequence/trajectory with feature vector, instead of a single (discrete or continuous) value. From practical perspective (at least how I see it), I am not sure that understanding what states are (or any other concept) is necessary.

I really appreciate your help, and really would like to discuss all the implementation details and why I did what I did. Currently, I am only concerned with HMM part. If I understood correctly, standard HMM implementation assumes that each observation in a trajectory can be described with a single value (e.g., Rain - 0.8, Sun - 0.3, Snow - 0.4). However, each observation in my case is described with a feature vector (e.g., Rain - (0.2, 0.3, 0.5), Sun - (0.4, 0.6, 0.000)).

I thought it would be enough and didn't want to overwhelm everyone here with all the details. However, if you think that would be necessary, I'm happy to share all the details.

12. Sep 8, 2015

### Stephen Tashi

Let's straighten that out first! My idea of trajectory data for weather would be like this:
day 1 rain
day 2 sun
day 3 sun
day 4 rain
day 5 snow
...etc.

Your idea of trajectory data in that example has the numbers 0.8, 0.3, 0.4 in it. What do those numbers mean?

13. Sep 8, 2015

### Srecko

That example would work as well. The difference between the problem you explained and the one I'm trying to solve is the following: I don't have "rain", "sun", "snow". Also, I don't have "1", "2", "3". Instead of those discrete values, I have

day 1 (0.2, 0.3, 0.5, 0.0)
day 2 (0.8, 0.0, 0.0, 0.2)
...

14. Sep 8, 2015

### Stephen Tashi

OK, that makes things clearer. Since you mentioned students, is the model supposed to apply to each student? Or are their state variables that make the model behave differently for different students - for example do things depend on a student's age, weight etc?

I gather you have a continuum of states rather than a finite number of states. Is the purpose of "clusters" to create a model that uses only a finite number of states?

For example if the data has the format: [ day, (A,B,C,D) ] then do we want to define states as certain ranges of values for A,B,C,D. For example: State 1 might be the condition " A^2 + B^2 < 3 and 0.02 < B < C < 2.1"

15. Sep 8, 2015

### Srecko

Each student has his/her own trajectory, but the model doesn't depend on student's age or other "properties". Also, it should be a finite number of states. I didn't think about conditions, but each states should be defined with certain values of A, B, C, and D. For example, State 1 is described with C, State 2 with A and D, while State 3 is defined with B.

Thinking of conditions, that does seem to make sense.

16. Sep 9, 2015

### Stephen Tashi

Is State 1 actually a single state? Or is it a different state for each different value of C ?

17. Sep 9, 2015

### Srecko

It is a single state. It just that we can say that cluster 1 represents this state. So, when we say (for example) that students most commonly transition from state 1 and state 3 to state 2, we should be able to say that students usually transition from C and B to A and D (from the previous example).

18. Sep 9, 2015

### Stephen Tashi

Then what does the numerical value associated with C represent?

19. Sep 9, 2015

### Srecko

Numerical value simply indicate to what extent student "belongs" to group C. That is, observing his interactions with other students, how much of his activity is associated with other students in the group C.

Here we are talking about clustering again. There are (at least) to cluster students (or detect communities) in the network of students. The first approach "hard clustering" would tell us that student1 belongs to group/cluster/community C (depends what we are trying to observe). "Soft" clustering, on the other hand, will tell us that student1 belongs to group C 22%, group A 60%, etc.

Finally, when we say that state 1 is described with cluster C, we need to define cluster C. How we are going to do that? We will define a threshold and select all the students in that cluster that have the value of "belonging" above the threshold. Those students (with their properties - e.g., grade, gpa, count of posts) "describe" the cluster.

20. Sep 9, 2015

### Stephen Tashi

I understand. So, as time passes, a student "degree of belonging" to cluster C can change - correct?

I would not call C a "state" of the particular Markov process that produces your example data. I'd call C a "state variable". A set of definite values of the state variables (like (0.5, 0.0, 0.2, 0.3) ) is a "state".

You mention an "emission" and the link you gave deals with the "emission of tokens". Are "emissions" relevant to your data? For example if the data for student number 6 begins: [day 1 , (0.5, 0.0, 0.2, 0.3)], [day 2, (0.3, 0.1, 0.5. 0.5)], .... is there some sort of "emission" produced by student 6 as he goes from day 1 to day 2 ?