HMM training with variable length data

malina · Jun 12, 2012

Hi All,

I need to train an HMM using data with sequences of variable length (5 - 500 symbols per input sequence).

From what I've seen thus far, all (or most) trainings are perfirmed on data-sets of a fixed size, although there is no explicit demand for this in the HMM structure.

So, first of all - what am I missing and is it indeed not advised to train HMM with variable-length data? Does this violate the stochastic assumptons of the EM/Viterbi algorithms?

Next, for the model that I receive, I have "good" performance for "short" sequences, but as the sequence gets longer, the perfromance decreases (and sometimes increases back). I can relate this to two possible causes:
1) Longer sequences have dynamics uncaptured by the HMM since they are not the majority of the training set hence the "random" prediction behavior
2) HMM gets stuck on short-length model (which is another way to rephrase (1), but not exactly).

Can someone please advise on the matter?
Thanks!

viraltux · Jun 13, 2012

malina said:

Hi All,

I need to train an HMM using data with sequences of variable length (5 - 500 symbols per input sequence).

From what I've seen thus far, all (or most) trainings are perfirmed on data-sets of a fixed size, although there is no explicit demand for this in the HMM structure.

So, first of all - what am I missing and is it indeed not advised to train HMM with variable-length data? Does this violate the stochastic assumptons of the EM/Viterbi algorithms?

Next, for the model that I receive, I have "good" performance for "short" sequences, but as the sequence gets longer, the perfromance decreases (and sometimes increases back). I can relate this to two possible causes:
1) Longer sequences have dynamics uncaptured by the HMM since they are not the majority of the training set hence the "random" prediction behavior
2) HMM gets stuck on short-length model (which is another way to rephrase (1), but not exactly).

Can someone please advise on the matter?
Thanks!

Hi Malina,

Different length sets may have an impact since some hidden states might be difficult to reach with small samples. If this is the scenario, using many short length sequences will stress the model in the initial steps and downsize the relevance of hidden states which influence appears only in long sequences.

This is likely the reason why your model works so well for short sequences but increasingly fails for long ones.

Now, if you would not have these issues you could use all the data you have to train the model, but in this case you might be better off ignoring the short sequences altogether and just working with medium/long one to see how it works.

malina · Jun 18, 2012

Thanks Viraltux,

The assumptuion is that the model is reflected similarly in long/short sequences, i.e., short sequences teach the model about relations later on seen in longer sequences. Think about partial sequences' availability. Hence, supposedly you should not feel the difference between the sequences (unless if the states transitions are captured incorrectly which can happen with partila training data :-().
Unfortunately, i don't have enought long sequences for training :-(Malina.

viraltux · Jun 18, 2012

malina said:

Thanks Viraltux,

The assumptuion is that the model is reflected similarly in long/short sequences, i.e., short sequences teach the model about relations later on seen in longer sequences. Think about partial sequences' availability. Hence, supposedly you should not feel the difference between the sequences (unless if the states transitions are captured incorrectly which can happen with partila training data :-().
Unfortunately, i don't have enought long sequences for training :-(Malina.

OK then, you can treat this as a problem of missing data. One little trick you can try is the following; imagine you have 100 short sequences and only 10 long ones of length 500. Then for every short sequence you cut randomly one of your 10 long ones and paste it to the short sequence so that the result is 100 equal sized long sequences mixing short and long sequences.

For example:
if you have short sequences
A,B,B,C
D,A,B,A
and one long sequence B,C,D,D,E,A,B,C,C,E,A,A,B

So instead that data to train your model you use

A,B,B,C,E,A,B,C,C,E,A,A,B
D,A,B,A,E,A,B,C,C,E,A,A,B
B,C,D,D,E,A,B,C,C,E,A,A,B

By doing this the short sequence performance will remain the same but the long sequences parameters will not be underestimated by the lack of data. Now, this is not ideal, and there is a world out there on how to treat missing data, but I think the best you can do is to think about the best strategy to complete the sequences in the problem you are dealing with to avoid an over fitting of the model parameters for short sequences.

Good Luck Malina!

malina · Jun 18, 2012

Thanks!
Will keep you updated if something extremely cool will work out of this.
M.

viraltux · Jun 18, 2012

Sure! please do!

HMM training with variable length data

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad Understanding permutations and combinations in a coin toss experiment

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect