Steven R. Dunbar
Department of Mathematics
203 Avery Hall
University of Nebraska-Lincoln
Lincoln, NE 68588-0130
http://www.math.unl.edu
Voice: 402-472-3731
Fax: 402-472-8466

Topics in
Probability Theory and Stochastic Processes
Steven R. Dunbar

__________________________________________________________________________

Large Deviations

_______________________________________________________________________

Note: These pages are prepared with MathJax. MathJax is an open source JavaScript display engine for mathematics that works in all browsers. See http://mathjax.org for details on supported browsers, accessibility, copy-and-paste, and other features.

_______________________________________________________________________________________________

Rating

Rating

Mathematicians Only: prolonged scenes of intense rigor.

_______________________________________________________________________________________________

Section Starter Question

Section Starter Question

From the proof of the Weak Law of Large Numbers, we know

n Sn n p > 𝜖 < p(1 p) n𝜖2

or more roughly, the deviation of the sample mean of Bernoulli trials from p is O(1n). Is it possible to do provide a more refined estimate of the probability of such a deviation? Why do you think so? What would be the possible statement of a better result?

_______________________________________________________________________________________________

Key Concepts

Key Concepts

  1. Large Deviations Theory is about the remote tail of a probability distribution. Roughly speaking, large deviation theory estimates the exponential decay of the probability of extreme events, as the number of observations grows arbitrarily large.
  2. For 0 < 𝜖 < 1 p and n 1,
    n Sn n > p + 𝜖 enh+(𝜖)

    where h+(𝜖) = (p + 𝜖) ln p+𝜖 p + (1 p 𝜖) ln 1p𝜖 1p

__________________________________________________________________________

Vocabulary

Vocabulary

  1. Large Deviations Theory is about the remote tail of a probability distribution. Roughly speaking, large deviation theory estimates the exponential decay of the probability of extreme events, as the number of observations grows arbitrarily large.

__________________________________________________________________________

Mathematical Ideas

Mathematical Ideas

The Large Deviations Estimate

Large Deviations Theory is about the remote tail of a probability distribution. Roughly speaking, large deviation theory estimates the exponential decay of the probability of extreme events, as the number of observations grows arbitrarily large. Large deviations theory was perhaps discovered by Swedish mathematician Harald Cramér who applied it to model the insurance business. More general results were later obtained by H. Chernoff, among other people. An incomplete list of mathematicians who have made important advances would include S. S. R. Varadhan, D. Ruelle and O. E. Lanford.

Given parameter value p, define h+(𝜖) for 0 < 𝜖 < 1 p as a function of 𝜖

h+(𝜖) = (p + 𝜖) ln p + 𝜖 p + (1 p 𝜖) ln 1 p 𝜖 1 p .

See Figure 1.

Note that p ln(p) + (1 p) ln(1 p) is the entropy of the Bernoulli distribution with probability p and complementary probability 1 p. Therefore h+(𝜖) is the entropy of the distribution with probability p + 𝜖.


Graph of h_+


Figure 1: Graph of h+(𝜖) for p = 0.6.

The first two lemmas are Taylor expansions of h+(𝜖) and enh+(𝜖) in 𝜖 to give a sense of the asymptotics of these functions.

Lemma 1.

h+(𝜖) = 1 2p(1 p)𝜖2 + 2p 1 6p2(1 p)2𝜖3 + 1 3p + 3p2 12p3(1 p)3 𝜖4 + O(𝜖5)

See also Figure 1.

Lemma 2.

enh+(𝜖) = 1 n 2p(1 p)𝜖2 n(2p 1) 6p2(1 p)2𝜖3+2n 6np + 6np2 3n2p + 3n2p2 24p3(1 p)3 𝜖4+O(𝜖5)

See also Figure 2 which displays enh+(𝜖) as a function of 𝜖 for p = 0.6 and n = 10, 20, 30.


Graph of function in Large Deviations Estimate


Figure 2: Graph of enh+(𝜖) with p = 0.6 and n = 10 (red), n = 20 (green) and n = 30 (blue).

Theorem 3 (Large Deviations Estimate). For 0 < 𝜖 < 1 p and n 1,

n Sn n p + 𝜖 enh+(𝜖).

Proof. Fix t > 0 and then

n Sn n p + 𝜖 = n et(Snnpn𝜖) 1 .

Now apply Markov’s Inequality to this positive random variable:

n et(Snnpn𝜖) 1 𝔼 et(Snnpn𝜖) = ent(p+𝜖)𝔼 etSn = ent(p+𝜖) k=0netkn kpk(1 p)nk = ent(p+𝜖)(1 p + pet)n = en(t(p+𝜖)ln(1p+pet)).

Now to complete the estimation, we only need to find

sup t>0(t(p + 𝜖) ln(1 p + pet)).

For notation, let g(t) = (t(p + 𝜖) ln(1 p + pet)) for parameters 0 < p < 1 and 0 < 𝜖 < 1 p. Then g(0) = 0 and g(t) = p + 𝜖 pet(1 p + pet), so g(0) = 𝜖. Furthermore, lim tg(t) = p + 𝜖 1 < 0. Thus the supremum is attained at some strictly positive value. The derivative is zero only at

t = ln p + p2 𝜖 + 𝜖p p p + 𝜖 1 = ln (p + 𝜖)(1 p) p(1 p 𝜖) .

Therefore, g(t) has a maximum value of h+(𝜖) since by plugging in the critical point, we obtain:

(p + 𝜖) ln p + 𝜖 p + (p + 𝜖) ln 1 p 1 p 𝜖 ln 1 p + p(p + 𝜖)(1 p) p(1 p 𝜖) = (p + 𝜖) ln p + 𝜖 p + (p + 𝜖) ln 1 p 1 p 𝜖 ln (1 p) 1 + (p + 𝜖) (1 p 𝜖) = (p + 𝜖) ln p + 𝜖 p + (p + 𝜖) ln 1 p 1 p 𝜖 ln 1 p 1 p 𝜖 = h+(𝜖)

Define h(𝜖) = h+(𝜖) for 0 < 𝜖 < p.

Corollary 1. For 0 < 𝜖 < p and n 1,

n Sn n p 𝜖 enh(𝜖).

Proof. Interchange the designation of success and failure, so now “success” occurs with probability 1 p and “failure” occurs with probability p. Consider the probability that the proportion of “successes” in this complementary process, Sncn exceeds the mean by 𝜖:

n Snc n > 1 p + 𝜖 enh+c(𝜖)

where h+c(𝜖) = (1 p + 𝜖) ln 1p+𝜖 1p + (p 𝜖) ln p𝜖 p = h+(𝜖) = h(𝜖). The inequality is then

n 1 Snc n < p 𝜖 = n Sn n < p 𝜖 enh(𝜖).

Corollary 2. For 0 < 𝜖 < min(p, 1 p) and n 1,

n Sn n p 𝜖 enh+(𝜖) + enh(𝜖).

See Figure 3. Note that this estimate is correct but empty for 𝜖 = 0 and a neighborhood of 𝜖 = 0.


Graph of function in Cor. 2


Figure 3: Graph of exp(nh+(𝜖)) + exp(nh(𝜖)) for p = 0.6.

The Large Deviations Estimate Cannot Be Improved

We will frequently use the following lemma to estimate the value of a binomial probability for large values of n and proportionally large values of k.

Lemma 4 (Binomial Asymptotics). If kn as n then

n Sn = kn 1 2π n kn(n kn) np kn kn n(1 p) n kn nkn .

as n .

Proof. We know that

n Sn = kn = n! kn!(n kn)!pkn (1 p)nkn .

so

1 = lim n n! kn!(nkn)!pkn(1 p)nkn n Sn = kn . (1)

Use Stirling’s approximation in the form m! 2πmm+12em. Find the asymptotics of the binomial probability by replacing each factorial with the Stirling approximation. That is

1 = lim n2πnn+12en n! = lim n2πnnknnnkneknen+kn n! (2)

and

1 = lim n (n kn)! 2π(n kn)(nkn)+12e(nkn) = lim n (n kn)! 2πn kn(n kn)(nkn)en+kn (3)

and

1 = lim n (kn)! 2π(kn)kn+12ekn = lim n kn! 2πkn(kn)knekn. (4)

Then multiply the terms on the left and right sides of equations (1), (2), (3), and (4) and cancel the factorials, two of the 2π terms, the exponentials and combine the powers to obtain:

n Sn = kn 1 2π n kn(n kn) np kn kn n(1 p) n kn nkn .

The estimate given by the Large Deviations result is as good as it can be in the following sense:

Theorem 5. For all 0 < 𝜖 < 1 p,

lim n1 n ln n Sn n p + 𝜖 = h+(𝜖).

Proof.

First, the Large Deviations Estimate can be written as

1 n ln n Sn n p + 𝜖 h+(𝜖)

so immediately

limsup n1 n ln n Sn n p + 𝜖 h+(𝜖).

Let kn = n(p + 𝜖) be the least integer greater than or equal to n(p + 𝜖). Then

kn 1 < n(p + 𝜖) kn.

Therefore, kn as n . Also kn n(p + 𝜖).

Note that

n Sn n(p + 𝜖) n Sn = kn .

We will show that

lim n1 n ln(n Sn = kn) = h+(𝜖).

Then

liminf n1 n ln(n Sn n(p + 𝜖)) h+(𝜖)

and we will have established the theorem.

We already know from the binomial probability

n Sn = kn n! kn!(n kn)!pkn (1 p)nkn . (5)

Now kn goes to + as n . Note that the probability on the left side goes to 0 as n by the Weak Law of Large of Numbers. By the Zero Asymptotic Limits Lemma in Asymptotic Limits., the right side must also go to 0 as n . Since the two sequences are asymptotic, by the Logarithm Lemma in Asymptotic Limits., their logarithms must be asymptotic.

Since kn n(p + 𝜖) and n kn n(1 p 𝜖), then

n kn(n kn) 1 (p + 𝜖)(1 p 𝜖)n

so

lim n1 n ln 1 2π n kn(n kn) = 0.

Next,

kn ln np kn = n(p + 𝜖) ln p p + 𝜖 + n(p + 𝜖) ln n(p + 𝜖) kn + (kn n(p + 𝜖)) ln np kn .

Now recall that kn n(p + 𝜖) is bounded by 1, so

lim n1 n ln np kn kn = (p + 𝜖) ln p p + 𝜖. (6)

Using the same basic calculations, we can show

lim n1 n ln n(1 p) n kn nkn = (1 p 𝜖) ln 1 p 1 p 𝜖. (7)

Then in the binomial probability (5) replace the powers with the asymptotic approximations (6) and (7) and the theorem is established. □

An Alternative Large Deviation Inequality

Say that a sequence an = Θ(bn) if there are constants Cl and Cu such that Clbn an Cubn.

Theorem 6. For p = 12 and 0 < α < 12

Sn αn = Θ 1 n 2h2(α)1n

where hb(x) = x log bx (1 x) log b(1 x).

Remark. Note the alternative definition of the entropy-style function hb(x) = x log bx (1 x) log b(1 x) including the base-b logarithm.

Remark. By symmetry, X (1 α)n = X αn, so that this provides an inequality similar to the Large Deviations result above.

Proof.

  1. For 1 k αn,
    n k 1n k = k n k + 1 k n k α 1 α < 1.

  2. So
    n αn k=0αnn k n αn i0 α 1 αi = 1 α 1 2α n αn

    The first inequality is very crude, essentially

    n αn n αn + k=0αn1n k.

    The proof principle here is that the terms n k are relatively dominated by the greatest term n αn.

  3. In particular Sn αn = k=0αnn k2n 1 α 1 2α n αn2n = Θ n αn2n = Θ S n = αn.

    Again the proof principle is that the probability Sn αn is essentially dominated by Sn = αn.

  4. Note that 1 (1 + O(1n))n (1 + (Cn))n eC, so (1 + O(1n))n = Θ(1).
  5. By Stirling’s Formula and
    n αn = Θ 1 n 1 αα (1 α)1α + O(1n) n = Θ 1 n 2h2(α)n.

  6. Dividing by 2n finishes the proof.

Remark. Let X be an arbitrary random variable. Then for every t

X a = etX eta 𝔼 etX eta = etaM X(t)

where MX(t) is the moment-generating function of X. The inequality follows from Markov’s Inequality. Likewise,

X a = X a etaM X(t).

Remark. If moreover, X = i=1nX i with X1,Xn independent,

X a eta i=1n𝔼 etXi = eta i=1nM Xi(t).

Theorem 7 (Chernoff-Hoeffding). Let Sn = i=1nX i where Xi are independent Bernoulli random variables each with success probability pi, and set p = 1 n i=1np i, so that 𝔼 X = np. For any α with p < α < 1,

X αp eDKL(αp)n

where DKL(xy) = x log x y + (1 x) log 1x 1y. Similarly for 0 < α < p, X αn eDKL(αp)n

Proof.

  1. Let qi = 1 pi and q = 1 n i=1nq i = 1 p.
  2. Let f(x) = log(xet + 1 x). Then f(x) is concave down since f(x) = (et 1)2(xet + 1 x)2.
  3. Therefore
    1 n i=1nf(p i f(p)

    by Jensen’s Inequality.

  4. Sn αn etαn i=1𝔼 etXi = etαn i=1𝔼 epiet+qi

    since ef(pi) = p iet + q i. Then etαn i=1(piet + q i) etαn(pet + q)n since (pet + q)n = enf(p).

  5. Now let β = 1 α and take et = αq βp. Then Sn αn pβ αqαn αq β + qn = pβ αqαn q ββn = eDKL(αp)n

Corollary 3 (Multiplicative Chernoff bound). Let Sn = i=1nX i where Xi are independent Bernoulli random variables each with success probability pi, and set p = 1 n i=1np i, so that μ = 𝔼 X = np. For any 𝜖 > 0,

Sn (1 + 𝜖)μ e𝜖 (1 + 𝜖)1+𝜖 μ e𝜖2μ(2+𝜖)

and for 0 < 𝜖 < 1,

Sn (1 𝜖)μ e𝜖 (1 𝜖)1𝜖 e𝜖2μ2.

Proof.

  1. Set α = (1 + 𝜖)p and β = 1 α = q 𝜖p).
  2. Sn αn pβ αqαn q ββn = 1 (1 + 𝜖)αn 1 + 𝜖p β βn 1 (1 + 𝜖)αne𝜖pn = e𝜖 (1 + 𝜖)1+𝜖 e𝜖2pn(2+𝜖).

    The last inequality follows from some analysis.

  3. The argument for Sn αn is similar by setting α = (1 𝜖)p and β = 1 α = q + 𝜖p.

Corollary 4. For μ = 𝔼 Sn and 0 < 𝜖 < 1,

X = μ 𝜖μ 2e𝜖2μ3

Remark. The Chernoff bounds are not asymptotic. In particular, they also hold when 𝜖 = 𝜖(n) and p = p(n) are functions of n.

Sources

This section is adapted from: Heads or Tails, by Emmanuel Lesigne, Student Mathematical Library Volume 28, American Mathematical Society, Providence, 2005, Chapter 6, [1].

The alternative theorems and proofs are adapted from class notes by Xavier Perez.

Some history and definitions are also adapted from the Wikipedia article on Large Deviations Theory.

_______________________________________________________________________________________________

Problems to Work

Problems to Work for Understanding

__________________________________________________________________________

Books

Reading Suggestion:

References

[1]   Emmanuel Lesigne. Heads or Tails: An Introduction to Limit Theorems in Probability, volume 28 of Student Mathematical Library. American Mathematical Society, 2005.

__________________________________________________________________________

Links

Outside Readings and Links:

  1. Virtual Laboratories in Probability and Statistics ¿ Binomial.
  2. Wikipedia, Large deviations theory.

__________________________________________________________________________

I check all the information on each page for correctness and typographical errors. Nevertheless, some errors may occur and I would be grateful if you would alert me to such errors. I make every reasonable effort to present current and accurate information for public use, however I do not guarantee the accuracy or timeliness of information on this website. Your use of the information from this website is strictly voluntary and at your risk.

I have checked the links to external sites for usefulness. Links to external websites are provided as a convenience. I do not endorse, control, monitor, or guarantee the information contained in any external website. I don’t guarantee that the links are active at all times. Use the links here with the same caution as you would all information on the Internet. This website reflects the thoughts, interests and opinions of its author. They do not explicitly represent official positions or policies of my employer.

Information on this website is subject to change without notice.

Steve Dunbar’s Home Page, http://www.math.unl.edu/~sdunbar1

Email to Steve Dunbar, sdunbar1 at unl dot edu

Last modified: Processed from LATEX source on November 9, 2018