HMM training with variable length data

  • Context: Graduate 
  • Thread starter Thread starter malina
  • Start date Start date
  • Tags Tags
    Data Length Variable
Click For Summary

Discussion Overview

The discussion revolves around training Hidden Markov Models (HMMs) using datasets that contain sequences of variable lengths, ranging from 5 to 500 symbols. Participants explore the implications of using variable-length data on model performance, particularly focusing on the challenges faced when the model is trained predominantly on shorter sequences.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant questions whether training HMMs with variable-length data is advisable and if it violates the stochastic assumptions of the EM/Viterbi algorithms.
  • Another participant suggests that using many short sequences may lead to difficulties in reaching certain hidden states, which could explain why the model performs well on short sequences but poorly on longer ones.
  • A participant proposes that short sequences should ideally teach the model about relationships that appear in longer sequences, but notes a lack of sufficient long sequences for training.
  • One participant offers a strategy to address the issue of missing long sequences by combining short sequences with segments from long sequences to create a balanced dataset for training.

Areas of Agreement / Disagreement

Participants express varying opinions on the effectiveness of training HMMs with variable-length data, with some suggesting it may lead to performance issues while others propose strategies to mitigate these challenges. There is no consensus on the best approach to take.

Contextual Notes

Participants acknowledge limitations related to the availability of long sequences for training and the potential for overfitting when the model is primarily trained on short sequences. The discussion highlights the complexity of modeling with incomplete data.

Who May Find This Useful

Researchers and practitioners working with Hidden Markov Models, particularly those dealing with variable-length sequence data in fields such as machine learning, data science, and computational biology.

malina
Messages
3
Reaction score
0
Hi All,

I need to train an HMM using data with sequences of variable length (5 - 500 symbols per input sequence).

From what I've seen thus far, all (or most) trainings are perfirmed on data-sets of a fixed size, although there is no explicit demand for this in the HMM structure.

So, first of all - what am I missing and is it indeed not advised to train HMM with variable-length data? Does this violate the stochastic assumptons of the EM/Viterbi algorithms?

Next, for the model that I receive, I have "good" performance for "short" sequences, but as the sequence gets longer, the perfromance decreases (and sometimes increases back). I can relate this to two possible causes:
1) Longer sequences have dynamics uncaptured by the HMM since they are not the majority of the training set hence the "random" prediction behavior
2) HMM gets stuck on short-length model (which is another way to rephrase (1), but not exactly).

Can someone please advise on the matter?
Thanks!
 
Physics news on Phys.org
malina said:
Hi All,

I need to train an HMM using data with sequences of variable length (5 - 500 symbols per input sequence).

From what I've seen thus far, all (or most) trainings are perfirmed on data-sets of a fixed size, although there is no explicit demand for this in the HMM structure.

So, first of all - what am I missing and is it indeed not advised to train HMM with variable-length data? Does this violate the stochastic assumptons of the EM/Viterbi algorithms?

Next, for the model that I receive, I have "good" performance for "short" sequences, but as the sequence gets longer, the perfromance decreases (and sometimes increases back). I can relate this to two possible causes:
1) Longer sequences have dynamics uncaptured by the HMM since they are not the majority of the training set hence the "random" prediction behavior
2) HMM gets stuck on short-length model (which is another way to rephrase (1), but not exactly).

Can someone please advise on the matter?
Thanks!

Hi Malina,

Different length sets may have an impact since some hidden states might be difficult to reach with small samples. If this is the scenario, using many short length sequences will stress the model in the initial steps and downsize the relevance of hidden states which influence appears only in long sequences.

This is likely the reason why your model works so well for short sequences but increasingly fails for long ones.

Now, if you would not have these issues you could use all the data you have to train the model, but in this case you might be better off ignoring the short sequences altogether and just working with medium/long one to see how it works.
 
Thanks Viraltux,

The assumptuion is that the model is reflected similarly in long/short sequences, i.e., short sequences teach the model about relations later on seen in longer sequences. Think about partial sequences' availability. Hence, supposedly you should not feel the difference between the sequences (unless if the states transitions are captured incorrectly which can happen with partila training data :-().
Unfortunately, i don't have enought long sequences for training :-(Malina.
 
malina said:
Thanks Viraltux,

The assumptuion is that the model is reflected similarly in long/short sequences, i.e., short sequences teach the model about relations later on seen in longer sequences. Think about partial sequences' availability. Hence, supposedly you should not feel the difference between the sequences (unless if the states transitions are captured incorrectly which can happen with partila training data :-().
Unfortunately, i don't have enought long sequences for training :-(Malina.

OK then, you can treat this as a problem of missing data. One little trick you can try is the following; imagine you have 100 short sequences and only 10 long ones of length 500. Then for every short sequence you cut randomly one of your 10 long ones and paste it to the short sequence so that the result is 100 equal sized long sequences mixing short and long sequences.

For example:
if you have short sequences
A,B,B,C
D,A,B,A
and one long sequence B,C,D,D,E,A,B,C,C,E,A,A,B

So instead that data to train your model you use

A,B,B,C,E,A,B,C,C,E,A,A,B
D,A,B,A,E,A,B,C,C,E,A,A,B
B,C,D,D,E,A,B,C,C,E,A,A,B

By doing this the short sequence performance will remain the same but the long sequences parameters will not be underestimated by the lack of data. Now, this is not ideal, and there is a world out there on how to treat missing data, but I think the best you can do is to think about the best strategy to complete the sequences in the problem you are dealing with to avoid an over fitting of the model parameters for short sequences.

Good Luck Malina!
 
Last edited:
Thanks!
Will keep you updated if something extremely cool will work out of this.
M.
 
Sure! please do! :smile:
 

Similar threads

  • · Replies 21 ·
Replies
21
Views
4K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 4 ·
Replies
4
Views
1K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 10 ·
Replies
10
Views
3K
  • · Replies 1 ·
Replies
1
Views
5K
  • · Replies 9 ·
Replies
9
Views
3K
Replies
1
Views
3K
Replies
15
Views
2K
Replies
3
Views
2K