I need to train an HMM using data with sequences of variable length (5 - 500 symbols per input sequence).

From what I've seen thus far, all (or most) trainings are perfirmed on data-sets of a fixed size, although there is no explicit demand for this in the HMM structure.

So, first of all - what am I missing and is it indeed not advised to train HMM with variable-length data? Does this violate the stochastic assumptons of the EM/Viterbi algorithms?

Next, for the model that I receive, I have "good" performance for "short" sequences, but as the sequence gets longer, the perfromance decreases (and sometimes increases back). I can relate this to two possible causes:

1) Longer sequences have dynamics uncaptured by the HMM since they are not the majority of the training set hence the "random" prediction behavior

2) HMM gets stuck on short-length model (which is another way to rephrase (1), but not exactly).

Can someone please advise on the matter?

Thanks!

# HMM training with variable length data

