written 4.5 years ago by |

**Hidden Markov Model (HMM)** is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (i.e. *hidden*) states.

The hidden Markov model can be represented as the simplest dynamic Bayesian network. In simpler Markov models (like a Markov chain), the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters, while in the hidden Markov model, the state is not directly visible, but the output (in the form of data or "token" in the following), dependent on the state, is visible. Each state has a probability distribution over the possible output tokens. Therefore, the sequence of tokens generated by an HMM gives some information about the sequence of states; this is also known as pattern theory, a topic of grammar induction.

Hidden Markov models are especially known for their application in reinforcement learning and temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.

A hidden Markov model can be considered a generalization of a mixture model where the hidden variables (or latent variables), which control the mixture component to be selected for each observation, are related through a Markov process rather than independent of each other.

Hidden Markov Model (HMM) is a parameterized distribution for sequences of observations. An intuitive way to explain HMM is to go through an example. Suppose that Taylor hears (a.k.a. observes) a sequence of T sounds o1, o2, ..., oT and he wants to reason something about this sequence. He makes the assumption that the sequence of sounds that he heard depends on a sequence of T words s1, s2, ..., sT , which he never gets to see and which is why they are called the hidden states. HMM gives Taylor a method which, under certain assumptions, allows him to assign appropriate probabilities to sound sequences O’s and word sequences S’s and to make reasonable deductions about them.

HMM makes several assumptions about the data it models:

- The observations o1, o2, ..., oT come from a known, finite sets V, called the observation space.
- The hidden states s1, s2, ..., sT come from a known, finite sets Q, called the hidden state space.
- The HMM assigns a probability to any given sequence s1, s2, ..., sT .

The chain rule says that

$P(s_1,s_2,s_3,....,s_T)=\prod_{t=1}^{T} P(s_t|s_0,s_1,s_2,...s_{t-1})=\prod_{t=1}^{T} P(s_t|s_{\lt t})..................(1.1)$

Here, we are being a little sloppy in terms of notation by assuming that there is a starting state s0, just so that the indexing is valid. In practice, such state s0 is sometimes called the starting state, and one can assign a probability distribution to it too. HMM goes further than Equation 1.1 by assuming that

$P(s_t|s_{\lt t})=P(s_t|s_{t-1})...............................(1.2)$

This is called the Markov assumption. With this assumption, Equation 1.1 can be rewritten as

$P(s_1,s_2,s_3,....,s_T)=\prod_{t=1}^{T} P(s_t|s_{ t-1})....................................(1.3)$

Note that the Markov assumption generally does not hold. In our example, sequences of words usually have dependencies that go back longer than one immediate step.

- Each observation $o_t$ only depends on the immediate hidden state $s_t$. Formally

$P(o_t|o_{\lt t},s_{\lt t})=P(o_t|s_t).....................................(1.4)$

This is called the Independence assumption.

**Terminology in HMM**

The term hidden refers to the first order Markov process behind the observation. Observation refers to the data we know and can observe. Markov process is shown by the interaction between “Rainy” and “Sunny” in the below diagram and each of these are **HIDDEN STATES**.

OBSERVATIONS are known data and refers to “Walk”, “Shop”, and “Clean” in the above diagram. In machine learning sense, observation is our training data, and the number of hidden states is our hyper parameter for our model.

T=length of the observations sequence

N=number of states in the model

M=number of observations symbols

Q={$q_0,q_1,...,q_{N-1}$=distinct states of the Markov process

V={0,1,............,M-1}=set os possible observations

A=state transitions probablities

B=observation probability matrix

$ \pi$=initial state distribution

O=($O_0,O_1,...,O_{T-1}$)=observation sequence

T = don’t have any observation yet, N = 2, M = 3, Q = {“Rainy”, “Sunny”}, V = {“Walk”, “Shop”, “Clean”}

**State transition probabilities** are the arrows pointing to each hidden state. **Observation probability matrix** are the blue and red arrows pointing to each observation from each hidden state. The matrix is row stochastic meaning the rows add up to 1.

$a_{ij}=P(\text{ state }q_j \text{ at }t+1|\text{state }q_i \text{ at }t)$

$b_{j(k)}=P(\text{observation k at t|state }q_j \text{ at t})$

The matrix explains what the probability is from going to one state to another or going from one state to an observation.

**Initial state distribution** gets the model going by starting at a hidden state.

Full model with known state transition probabilities, observation probability matrix, and initial state distribution is marked as,

$\lambda =(A,B,\pi)$