Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

HMM training with variable length data

  1. Jun 12, 2012 #1
    Hi All,

    I need to train an HMM using data with sequences of variable length (5 - 500 symbols per input sequence).

    From what I've seen thus far, all (or most) trainings are perfirmed on data-sets of a fixed size, although there is no explicit demand for this in the HMM structure.

    So, first of all - what am I missing and is it indeed not advised to train HMM with variable-length data? Does this violate the stochastic assumptons of the EM/Viterbi algorithms?

    Next, for the model that I receive, I have "good" performance for "short" sequences, but as the sequence gets longer, the perfromance decreases (and sometimes increases back). I can relate this to two possible causes:
    1) Longer sequences have dynamics uncaptured by the HMM since they are not the majority of the training set hence the "random" prediction behavior
    2) HMM gets stuck on short-length model (which is another way to rephrase (1), but not exactly).

    Can someone please advise on the matter?
    Thanks!
     
  2. jcsd
  3. Jun 13, 2012 #2
    Hi Malina,

    Different length sets may have an impact since some hidden states might be difficult to reach with small samples. If this is the scenario, using many short length sequences will stress the model in the initial steps and downsize the relevance of hidden states which influence appears only in long sequences.

    This is likely the reason why your model works so well for short sequences but increasingly fails for long ones.

    Now, if you would not have these issues you could use all the data you have to train the model, but in this case you might be better off ignoring the short sequences altogether and just working with medium/long one to see how it works.
     
  4. Jun 18, 2012 #3
    Thanks Viraltux,

    The assumptuion is that the model is reflected similarly in long/short sequences, i.e., short sequences teach the model about relations later on seen in longer sequences. Think about partial sequences' availability. Hence, supposedly you should not feel the difference between the sequences (unless if the states transitions are captured incorrectly which can happen with partila training data :-().
    Unfortunately, i don't have enought long sequences for training :-(

    Best,
    Malina.
     
  5. Jun 18, 2012 #4
    OK then, you can treat this as a problem of missing data. One little trick you can try is the following; imagine you have 100 short sequences and only 10 long ones of length 500. Then for every short sequence you cut randomly one of your 10 long ones and paste it to the short sequence so that the result is 100 equal sized long sequences mixing short and long sequences.

    For example:
    if you have short sequences
    A,B,B,C
    D,A,B,A
    and one long sequence B,C,D,D,E,A,B,C,C,E,A,A,B

    So instead that data to train your model you use

    A,B,B,C,E,A,B,C,C,E,A,A,B
    D,A,B,A,E,A,B,C,C,E,A,A,B
    B,C,D,D,E,A,B,C,C,E,A,A,B

    By doing this the short sequence performance will remain the same but the long sequences parameters will not be underestimated by the lack of data. Now, this is not ideal, and there is a world out there on how to treat missing data, but I think the best you can do is to think about the best strategy to complete the sequences in the problem you are dealing with to avoid an over fitting of the model parameters for short sequences.

    Good Luck Malina!
     
    Last edited: Jun 18, 2012
  6. Jun 18, 2012 #5
    Thanks!
    Will keep you updated if something extremely cool will work out of this.
    M.
     
  7. Jun 18, 2012 #6
    Sure! please do! :smile:
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook