## Kullback–Leibler divergence

**Moderators:** Leonid, amiradm, BioTeam

### Kullback–Leibler divergence

Can someone give me a sample calculation using the Kullback–Leibler divergence?I'm a bit miffed by all the symbols...

Living one day at a time;

Enjoying one moment at a time;

Accepting hardships as the pathway to peace;

~Niebuhr

Enjoying one moment at a time;

Accepting hardships as the pathway to peace;

~Niebuhr

Hi mith,

The context in which I learned KL-distance (or "relative entropy") is where you have a probabilistic model - a probability distribution - and you want to contrast it with the distribution suggested by your data. So for example, you could have a single six-sided loaded die, set up so that the following distribution describes how it rolls:

1: 30%

2: 15%

3: 5%

4: 5%

5: 10%

6: 35%

And if you knew nothing about this die, you might pick it up and assume it was a fair die, in which case you'd expect its distribution to be even:

1: 16.7%

2: 16.7%

3: 16.7%

4: 16.7%

5: 16.7%

6: 16.7%

The first distribution, which we shall denote P, is real - it's what you'd approximate if you began to sample by rolling the die a bunch of times. The second distribution, which we shall denote Q, is a model - a possibly incorrect estimate of the distribution made using all the information we have to date.

This is a toy example, but bear with me!

To determine "how good" the model is, you can look at its relative entropy (K-L divergence, or distance) to the real distribution. To do this, you would use the formula:

Distance = SUM[across all outcomes i, of the quantity P(i) * ln(P(i)/Q(i))]

In other words, the SUM decomposes into six terms, since there are six outcomes (each side of the die). The first term would be 0.3*ln(0.3/0.167), since that is P(1)*ln(P(1)/Q(1)). The whole equation can be written out as:

D = 0.3*ln(0.3/0.167)+0.15*ln(0.15/0.167)+0.05*ln(0.05/0.167)+

0.05*ln(0.05/0.167)+0.10*ln(0.10/0.167)+0.35*ln(0.35/0.167)

Within each ln(), you have something from the actual distribution P divided by something from the model distribution Q (and since every probability in Q is 16.7%, the divisor is always 0.167). And this sums up to 0.247, thereabouts.

Clear?

The context in which I learned KL-distance (or "relative entropy") is where you have a probabilistic model - a probability distribution - and you want to contrast it with the distribution suggested by your data. So for example, you could have a single six-sided loaded die, set up so that the following distribution describes how it rolls:

1: 30%

2: 15%

3: 5%

4: 5%

5: 10%

6: 35%

And if you knew nothing about this die, you might pick it up and assume it was a fair die, in which case you'd expect its distribution to be even:

1: 16.7%

2: 16.7%

3: 16.7%

4: 16.7%

5: 16.7%

6: 16.7%

The first distribution, which we shall denote P, is real - it's what you'd approximate if you began to sample by rolling the die a bunch of times. The second distribution, which we shall denote Q, is a model - a possibly incorrect estimate of the distribution made using all the information we have to date.

This is a toy example, but bear with me!

To determine "how good" the model is, you can look at its relative entropy (K-L divergence, or distance) to the real distribution. To do this, you would use the formula:

Distance = SUM[across all outcomes i, of the quantity P(i) * ln(P(i)/Q(i))]

In other words, the SUM decomposes into six terms, since there are six outcomes (each side of the die). The first term would be 0.3*ln(0.3/0.167), since that is P(1)*ln(P(1)/Q(1)). The whole equation can be written out as:

D = 0.3*ln(0.3/0.167)+0.15*ln(0.15/0.167)+0.05*ln(0.05/0.167)+

0.05*ln(0.05/0.167)+0.10*ln(0.10/0.167)+0.35*ln(0.35/0.167)

Within each ln(), you have something from the actual distribution P divided by something from the model distribution Q (and since every probability in Q is 16.7%, the divisor is always 0.167). And this sums up to 0.247, thereabouts.

Clear?

Vi veri veniversum vivus vici

Thank you very much! I figured out how to calculate it(it was due monday lol) but your explanation really helped me understand why I was doing it like that..

btw, we used log base 2, is there any significant difference if ln is used?

btw, we used log base 2, is there any significant difference if ln is used?

Living one day at a time;

Enjoying one moment at a time;

Accepting hardships as the pathway to peace;

~Niebuhr

Enjoying one moment at a time;

Accepting hardships as the pathway to peace;

~Niebuhr

Hi mith,

The KL divergence is a measure of how much information your model gives you about the data you are modeling. In its original conception, it was designed with log-base-2, because information (in information theory) is measured in bits. When the statistical biologists got their hands on it, they changed it to the natural log because it makes the math for certain statistics (which I can't remember now) easier.

The KL divergence is a measure of how much information your model gives you about the data you are modeling. In its original conception, it was designed with log-base-2, because information (in information theory) is measured in bits. When the statistical biologists got their hands on it, they changed it to the natural log because it makes the math for certain statistics (which I can't remember now) easier.

Vi veri veniversum vivus vici

### Re: Kullback–Leibler divergence

G-Do,

Thanks for the explanation. That is the easiest-to-understand explanation of KL div and thus IMHO the best.

I have related question. Suppose the die outcomes are continous, i.e. it can be any number between 1 to 6 how do I calculate the probability in this case?

Thanks

Thanks for the explanation. That is the easiest-to-understand explanation of KL div and thus IMHO the best.

I have related question. Suppose the die outcomes are continous, i.e. it can be any number between 1 to 6 how do I calculate the probability in this case?

Thanks

### Who is online

Users browsing this forum: No registered users and 1 guest