Steven R. Dunbar
Department of Mathematics
203 Avery Hall
Lincoln, NE 68588-0130
http://www.math.unl.edu
Voice: 402-472-3731
Fax: 402-472-8466

Stochastic Processes and

__________________________________________________________________________

Testing Financial Data for Fat Tails

_______________________________________________________________________

Note: These pages are prepared with MathJax. MathJax is an open source JavaScript display engine for mathematics that works in all browsers. See http://mathjax.org for details on supported browsers, accessibility, copy-and-paste, and other features.

_______________________________________________________________________________________________

### Rating

Mathematically Mature: may contain mathematics beyond calculus with proofs.

_______________________________________________________________________________________________

### Section Starter Question

How can you test data for having a normal distribution?

_______________________________________________________________________________________________

### Key Concepts

1. Analysis of actual ﬁnancial data for normality using standard statistical methods.
2. Making and using a quantile-quantile plot to judge normality of data.

__________________________________________________________________________

### Vocabulary

1. The Wilshire 5000 is an index of the market value of all stocks actively traded in the United States. The index is intended to measure the performance of publicly traded companies.
2. A quantile-quantile plot is a graphical method for determining if two data sets come from populations with a common distribution.

__________________________________________________________________________

### Mathematical Ideas

#### Testing Data

The purpose of this section is to demonstrate testing real ﬁnancial data for the presence of fat tails. More precisely, in statistical terms, we take as an hypothesis that the data comes from a normal distribution, and then consider the consequences. We use a stock market index known as the Wilshire 5000 to test this hypothesis.

The Wilshire 5000 Total Market Index, or more simply the Wilshire 5000, is an index of the market value of all stocks actively traded in the United States. The index is intended to measure the performance of publicly traded companies headquartered in the United States. Stocks of extremely small companies are excluded.

In spite of the name, the Wilshire 5000 does not have exactly 5000 stocks. Developed in the summer of 1974, the index had just shy of the 5,000 issues at that time. The membership count has ranged from 3,069 to 7,562. The member count was 3,818 as of September 30, 2014.

The index is computed as

$W=\alpha \sum _{i=1}^{M}{N}_{i}{P}_{i}$

where ${P}_{i}$ is the price of one share of issue $i$ included in the index, ${N}_{i}$ is the number of shares of issue $i$, $M$ is the number of member companies included in the index, and $\alpha$ is a ﬁxed scaling factor. The base value for the index was $1404.60$ points on base date December 31, 1980, when it had a total market capitalization of $1,404.596 billion. On that date, each one-index-point change in the index was equal to$1 billion. However, index divisor adjustments due to index composition changes have changed the relationship over time, so that by 2005 each index point reﬂected a change of about $1.2 billion in the total market capitalization of the index. The index was renamed the “Dow Jones Wilshire 5000” in April 2004, after Dow Jones & Company assumed responsibility for its calculation and maintenance. On March 31, 2009 the partnership with Dow Jones ended and the index returned to Wilshire Associates. The Wilshire 5000 is the weighted sum of a large number of variables, each of which we may reasonably assume is a random variable presumably with a ﬁnite variance. If the random variables are independent, then the Central Limit Theorem would suggest that the index should be normally distributed. Therefore, a reasonable hypothesis is that the Wilshire 5000 is a normal random variable, although we do not know the mean or variance in advance. Data for the Wilshire 5000 is easy to obtain. For example, the Yahoo Finance page for W5000. provides a download with the Date, Open, Close, High, Low, Volume and Adjusted Close values of the index in reverse order from today to April 1, 2009, the day Wilshire Associates resumed calculation of the index. (The Adjusted Close is an adjusted price for dividends and splits that does not aﬀect this analysis.) The data comes in the form of a comma-separated-value text ﬁle. This ﬁle format is well-suited as input for many programs, especially spreadsheets and data analysis programs such as R. This analysis uses R. The data from December 31, 2014 back to April 1, 2009 provides 1449 records with seven ﬁelds each. Focusing on the Close prices and reversing them and then taking the diﬀerences gives 1448 daily changes. The changes are then normalized by subtracting the mean and dividing by the standard deviation of the 1448 changes. The maximum of the 1448 normalized changes is $4.12$ and the minimum is $-6.22$. Already we have a hint that the distribution of the data has fat tails, since the likelihood of seeing normally distributed data which varies $4$ to $6$ standard deviations from the mean is negligible. In R, the hist command on the normalized data gives an empirical density histogram. For simplicity, here the histogram is taken over the 14 one-standard-deviation intervals from $-7$ to $7$. For this data, the density histogram of the normalized data has values 0.0006906077 0.0000000000 0.0013812155 0.0048342541 0.0290055249 0.1001381215 0.3390883978 0.4005524862 0.1056629834 0.0138121547 0.0041436464 0.0006906077 0.0000000000 0.0000000000 This means, for example, that $0.00069$ of the 1448 points, that is 1, occurred in the interval $\left(-7,6\right]$ and a fraction $0.40055$, or 580 points, fall in the interval $\left(0,1\right]$. The normal distribution gives the expected density on the same intervals. The ratio between the empirical density and normal density gives an indication of the deviation from normality. For this data, the ratio on the interval $\left(-7,6\right]$ is approximately $70,000$, and the ratio on the interval $\left(4,5\right]$ is approximately $22$. Each of these is much greater than we expect, that is, the tails of the empirical density are fatter than expected. #### Quantile-Quantile Plots A quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution. A q-q plot is a plot of the quantiles of the ﬁrst data set against the quantiles of the second data set. By a quantile, we mean the fraction (or percent) of points below the given value. That is, the $0.3$ (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value. As another example, the median is the $0.5$ quantile. The q-q plot is formed by plotting estimated quantiles from data set 2 on the vertical axis against estimated quantiles from data set 1 on the horizontal axis. Both axes are in units of their respective data sets. For a given point on the q-q plot, we know the quantile level is the same for both points, but not what the quantile level actually is. If the data sets have the same size, the q-q plot is essentially a plot of sorted data set 1 against sorted data set 2. If the data sets are not of equal size, the quantiles are usually picked to correspond to the sorted values from the smaller data set and then the quantiles for the larger data set are interpolated from the data. The quantiles from the normal distribution are the values from the inverse cumulative distribution function. These quantiles are tabulated, or they can be obtained from statistical software. For example, in R the function qnorm gives the quantiles, e.g. qnorm(0.25) = -0.6744898, and qnorm(0.90) = 1.281552. Estimating the quantiles for the normalized Wilshire 5000 data is more laborious but is not diﬃcult in principle. Sorting the data and then ﬁnding the value for which $100k$% of the data is less and $100\left(1-k\right)$% of the data is greater gives the $k$ percentile for $0\le k\le 1$. For example, the $0.25$-quantile for the scaled and normalized daily changes is $-0.4882337$ and the $0.90$-quantile for the scaled and normalized daily changes is $1.1371821$. Then for the q-q plot, plot the values $\left(-0.6744898,-0.4882337\right)$ and $\left(1.281552,1.1371821\right)$. Using many quantiles gives the full q-q plot. In R, to create a q-q plot of the normalized Wilshire 5000 data with a reference line against the normal distribution use the two commands qqnorm(zscoreChanges), qqline(zscoreChanges) In the q-q plot in Figure 1, the “twist” of the plot above and below the reference line indicates that the tails of the normalized Wilshire 5000 data are more dispersed than the standard normal data. The low quantiles of the normalized Wilshire 5000 quantiles occur at more negative values than the standard normal distribution. The high quantiles occur at values greater than the standard normal distribution. However, for quantiles near the median, the data does seem to follow the normal distribution. The plot is a graphical representation of the fact that extreme events are more likely to occur than would be predicted by a normal distribution. #### Sources This section is inspired by the example of Lions [1]. Information about the Wilshire 5000 comes from [3]. The explanations of q-q plots is adapted from the NIST Engineering Statistics Handbook, [2]. _______________________________________________________________________________________________ ### Algorithms, Scripts, Simulations #### Algorithm Take the closing value of ﬁnancial data from a comma-separated-variable ﬁle, and put the values in chronological order. Find the daily changes by taking diﬀerences. The changes are normalized to a Z-score by subtracting the mean and dividing by the standard deviation. Then the normalized scores are summarized with the R histogram command. The density summary from histogram is compared to probabilities computed from the normal cdf. #### Scripts 1mydata <- read.csv("table.csv") 2closingValue <- mydata$Close
3## These two lines are particular to the data file
4## and its format from finance.yahoo.com
5
6closingValue <- rev(closingValue)
7changes <- diff(closingValue)
8zscoreChanges <- (changes - mean(changes))/sd(changes)
9
10m <- ceiling(max( c(abs( max(zscoreChanges)) , abs( min(zscoreChanges))) ));
11
12test <- hist(zscoreChanges, -m:m, plot=FALSE)
13
14expectz <- diff( c( 0, pnorm( (-m+1):(m-1)), 1) )
15
16ratios <- test\$density/expectz
17print(ratios)

### Problems to Work for Understanding

1. Explain the diﬀerence between the famous index known as the Dow Jones Industrial Average (DJIA) and the Wilshire 5000. Discuss the merits of testing the DJIA for normality.
2. Do the tests of normality against other broad stock market indices such as the:
• S & P 500;
• CRSP U.S. Total Market Index (ticker CRSPTM1);
• Wilshire 4500;
• NASDAQ;
• S & P 1500; and
• European indices such as the DAX or CDAX, the FTSE or FTSE 250.

Provide a description of each index and the modeling reasons for believing that a normal distribution is reasonable or not.

__________________________________________________________________________

### References

[1]   Gaetan Lions. Black swan(s) – the fat tail issue. http://www.slideshare.net/gaetanlion/black-swan-the-fat-tail-issue, December 2009.

[2]   National Institute of Standards and Technology. Engineering statistics handbook. http://www.itl.nist.gov/div898/handbook/index.htm, October 2013.

[3]   Robert Waid. Wilshore 5000: Myths and misconceptions. http://wilshire.com/media/34276/wilshire5000myths.pdf, November 2014.

__________________________________________________________________________

__________________________________________________________________________

I check all the information on each page for correctness and typographical errors. Nevertheless, some errors may occur and I would be grateful if you would alert me to such errors. I make every reasonable eﬀort to present current and accurate information for public use, however I do not guarantee the accuracy or timeliness of information on this website. Your use of the information from this website is strictly voluntary and at your risk.

I have checked the links to external sites for usefulness. Links to external websites are provided as a convenience. I do not endorse, control, monitor, or guarantee the information contained in any external website. I don’t guarantee that the links are active at all times. Use the links here with the same caution as you would all information on the Internet. This website reﬂects the thoughts, interests and opinions of its author. They do not explicitly represent oﬃcial positions or policies of my employer.

Information on this website is subject to change without notice.