Chapter 1. Excel and Statistics
Several chapters of this book solve common business problems by creating complete applications using Excel features like VBA, forms, and array formulas . Before we start building applications, we need to cover some key basics. In this first chapter we look at array formulas and indexed addresses, two key Excel features used throughout the book. Next we will look at Excel’s ability to handle common statistical calculations. These features will be the building blocks for the later applications. This is not a statistics textbook and we will not examine all of Excel’s 50+ statistical functions; instead, we will look at the most commonly used functions with particular attention to those used in other parts of the book.
Array Formulas
Excel array formulas give you the ability to work on ranges of cells all at once. Suppose we have a list of numbers, and we want to know the average amount of change from one number to the next. This situation is illustrated in Figure 1-1.
The normal way to approach a problem like this is to add a new column with an intermediate calculation. In cell B2
we calculate the difference between A2
and A1
. Then we fill this formula down to cell B10
. In cell B12
we simply take the average.
With an array formula we can get the same answer using only one cell. The formula in B15
is:
=AVERAGE(A2:A10-A1:A9)
This makes sense. We want the average of the differences, and that is what the formula is asking for. However, this returns a value error! The error appears because we need to enter the formula in a special way that tells Excel the formula applies to the ranges and not to individual cells.
This is done by pressing Ctrl-Shift-Enter, all at the same time while entering the formula. Excel then displays the formula in brackets, as shown in cell B17
:
{=AVERAGE(A2:A10-A1:A9)}
The values returned in cells B12
and B17
are the same. Excel does all of the intermediate calculations and returns the result.
Array formulas save space on the worksheet; often whole columns of formulas can be eliminated. By reducing visual complexity, they make things easier to understand, and can speed up sheet recalculation.
Tip
Array formulas can also contain truth values . Excel considers true to be a binary 1 and false to be a binary 0, and sometimes this can be helpful.
In Figure 1-2 we need to find the average of the odd numbers in the list.
In the non-array approach, we would start by identifying the odd values. If a number divided by two is not equal to the integer value of itself divided by two, it is odd. The formula in cell B1
is:
=A1/2<>INT(A1/2)
This fills down to B10
and returns a value of true (binary 1) if the number is odd and false (binary 0) if it is even.
In cell C1, this is the formula:
=B1*A1
If B1
is false, it equals zero and the returned value will be zero. But if the value is true, as it will be if A1
is odd, it is a one and the returned value will be equal to A1
.
In column D we do the same thing, but this time we multiply by one using this formula:
=B1*1
This is useful because many Excel functions do not recognize binary TRUE
and FALSE
values. This simple formula converts the binary value into a number that can be used in other calculations.
Tip
We could avoid this issue, and the need for the formulas in the D column in Figure 1-2, by using the COUNTIF
function. But it is still important to know how to get truth
values
into other calculations.
The formulas in C12
and D12
take the sum of the cells above, and in cell E12
we divide C12
by D12
. Since C12
has the sum of the odd values and D12
has the count, this gives the average.
We can get the same result with this array formula:
{=(SUM((A1:A10/2<>INT(A1:A10/2))*A1:A10))/(SUM((A1:A10/2<>INT(A1:A10/2))*1))}
It does the work of two columns, two sums, and a division function. And, it does it exactly the same way. The difference is that the formula is built on ranges, not individual cells. All of the same logical steps are in the array formula.
Addressing Cells Indirectly
The INDEX
, INDIRECT
, ADDRESS
, and OFFSET
functions are used extensively in this book, providing formulas with much greater flexibility than cell references can provide.
All cells on a worksheet have a unique address (e.g, A1) and Excel lets you refer to any cell by its address. But what if you don’t know what cell you are going to want? This can happen if the address of the required piece of information changes based on other values on the worksheet.
If there is a list of items you need to be able to select from, the INDEX
function will do the job. The list can be in a column or a row. An example is shown in Figure 1-3.
We have a list of seven colors. In cell C4, the formula is:
=INDEX(A1:A7,C2)
The first entry in the formula is the range where the names of the colors are. The number in cell C2
tells the formula which color is required. In this case it is color number five. The fifth color, the one in cell A5
, is Blue. The formula in cell C4
returns a value of Blue as text.
The INDEX
function works in most cases but sometimes you may not know the row or column of the required value. It could be anywhere on the worksheet, or even on another worksheet. Figure 1-4 contains just such an example.
This time the data extends over several columns. The value in cell D3
is 373. D3
is the fourth column and the third row. The row and column numbers are in cells D12
and D11
. The ADDRESS
function in cell D14
uses them to build the address.
If you need an address to include the sheet name, and you will if you are referencing data on a different sheet, use the version in cell D17
. It lets you include the sheet name and builds it into the returned value.
The address alone does not help much, but the INDIRECT
function returns the value corresponding to an address. In cell D20
the INDIRECT
function uses the address built in cell D14
to retrieve the value in cell D3
.
The formula in cell D20
could thus also be written as:
=INDIRECT(ADDRESS(D12,D11))
Although this may seem a little complex, these functions are very useful when referencing data using both row and column values. It takes a little practice to become proficient with ADDRESS
and INDIRECT
, but it’s worth it because they provide great flexibility in referencing data in Excel.
Excel, as usual, offers more than one way to do things. The OFFSET
function does the same thing that the INDIRECT
and ADDRESS
functions do, using a different approach. The example in Figure 1-5 shows how the OFFSET
function handles the same situation.
INDIRECT
uses a text string containing the address of the required cell. OFFSET
uses row and column numbers from a given starting point to locate the cell. In this case the starting point is cell A1
and the offset is two rows down and three columns to the right. This is the same cell we referenced in Figure 1-4.
In most cases the choice between OFFSET
and INDIRECT
is a matter of personal preference. INDIRECT
lets you reference a different sheet, making it better for applications that have several worksheets. The advantage of OFFSET
is that it works with numbers rather than a text string. This means it does not need the ADDRESS
function and can be easier to use.
Statistical Functions
Statistical functions are used to describe data. When working with a list of numbers, we need to be able to find the average, to understand any trends, and to describe the distribution. In this section we look at Excel’s most commonly used statistical functions .
The Average
The average is the most common measure of the center of a group of numbers, and Excel makes taking an average easy. The basic approach is shown in Figure 1-6.
Five is the average of the numbers from one to nine (in column A, rows 1 through 9). Usually numbers will be in a column, as in item 1, but it really doesn’t make any difference as long as you know where the numbers are. In item 2 the average formula is looking at a three by three range, and in item 3 the numbers are mixed with non-numeric information.
Excel ignores cells that do not contain valid numbers, and item 4 uses this behavior to simplify taking the average of scattered numbers. The AVERAGE
function finds the numbers in the range. I don’t have to tell Excel where they are.
AVERAGEA
The AVERAGE
function is usually convenient, but fails in Figure 1-7.
On Wednesday we did not get any orders. The data contains the word “None,” but the value is really zero. In item 1, the AVERAGE
function gets the wrong answer because it ignores the cell with “None” in it. The AVERAGEA
function is made for this situation. AVERAGEA
works just like AVERAGE
, except it considers non-numeric cells to have a value of zero. In item 1 formula 2, the AVERAGEA
function calculates the average correctly.
AVERAGEA
ignores cells that are not used. In item 2, the Orders cell for Wednesday is empty and the AVERAGEA
function does not consider the cell to have a value of zero.
DAVERAGE
Excel can also calculate the average of selected items in a list using the DAVERAGE
function. Figure 1-8 shows how this function is used.
The range A1:B18
contains sales amounts and regions. This worksheet allows the user to get the average of a selected region.
First we give the user a way to choose a region. We enter the region names in D7:D10
. Then we add a combo box using the forms tool bar shown in Figure 1-9.
If the forms tool bar is not visible, go to the View → Toolbars menu and check Forms.
We click on the combo box icon and drag it to the sheet. Then we right click on it and select Format Control. This displays the dialog in Figure 1-10.
The Input range tells the combo box what values to display. The region names are in the range D7:D10
, so that is the Input range.
The Cell link is D4
. When the user selects one of the four regions, the combo box puts a number in the Cell link telling which region was selected. The Cell link does not get the region name, only a number. In Figure 1-8, East is selected. East is the third item in the list of regions, therefore the Cell link (D4) contains the number three.
We need the name of the selected region, and to get it we use the formula in cell D5
in Figure 1-8:
=INDEX(D7:D10,D4)
The INDEX
function returns an item from a range. The range D7:D10
contains our region names. Cell D4
has the number three. The INDEX
function returns the third item in the range, the name “East”.
The DAVERAGE
function is in cell D15
and has this formula:
=DAVERAGE(A1:B18,B1,D1:D2)
There are three parameters. The first is the range A1:B18
, containing all of the data including column headings. Next is B1
. This is the heading of the column we want to average. The formula could have been entered as:
=DAVERAGE(A1:B18,"Sales Amount",D1:D2)
The third parameter is the criteria range. It consists of a heading and a logical test. The heading tells the function what column to apply the test to. DAVERAGE
takes the average of the rows where the test is true.
We are only averaging sales amounts for East. The criteria range is D1:D2
. The formula in D2
is:
="=" & D5
This uses the result of the combo box selection in cell D5
and builds the text string "=East"
.
This makes the sheet interactive without using a macro. The DAVERAGE
function in cell D15
calculates the average for the selected region as soon as the user changes the selection.
DAVERAGE
is a database
function. Database functions allow you to perform common calculations on worksheet data. But the data has to be set up correctly for them to work. They require headings and a criteria range. This is fine if you are designing an Excel solution from scratch, but can lead to complications if you are adding calculations to an existing sheet.
Excel has a wide range of functions, and in most situations there is more than one way to get the desired result. In this case, I could replace the DAVERAGE
function in cell D15
with this formula:
=SUMIF(A2:A18,D2,B2:B18)/COUNTIF(A2:A18,D2)
It uses the SUMIF
and COUNTIF
functions to get the same answer, avoiding the formatting requirements of database functions
. There is no best way to do this. It depends on how the sheet is designed and what you are trying to accomplish.
Trimmed average
All numbers are not created equal. In examples, like the ones in this chapter, the numbers are made up. They can’t be wrong because they don’t mean anything. In the real world things are different.
Numbers can be recorded incorrectly, unusual events can produce odd values that confuse our view of the past and make the average inaccurate. These troublesome values are called outliers or anomalies .
When we use data that might have outliers we can increase the accuracy of calculations by ignoring the highest and lowest values. This is also sometimes called a filtered
average. Excel does this with the TRIMMEAN
function.
In Figure 1-11 you can see how this works.
We want the average number of calls per day handled by a call center. Figure 1-11 has the call count for the 25 most recent days. But in cell A21
something is wrong. A network problem on that day incorrectly routed tens of thousands of calls to our call center.
If we just take the average we get 16,265. But that is not good estimate of the real average, since it includes the problem data. We can get a better estimate by using this formula:
=TRIMMEAN(A2:A26,2/COUNT(A2:A26))
The first entry in the TRIMMEAN
function is the range of cells to be averaged. The second entry is the percentage of cells to ignore. In this case I want to eliminate the maximum and the minimum values. Therefore, I need to set the value to trim only one item at the top and bottom of the distribution. That is equal to two divided by the number of items.
This is equivalent to this formula:
=(SUM(A2:A100)-(MAX(A2:A100)+MIN(A2:A100)))/(COUNT(A2:A100)-2)
This takes the sum of all the numbers, subtracts the maximum and minimum, and then divides by two less than the item count.
It would be nice to have a function that just removes the maximum and minimum, and a custom function can do just that.
First we go to the Tools → Macros menu and select Visual Basic Editor. The editor can also be started by pressing Alt-F11. This code is then entered:
Public Function FAverage(MyRange As Excel.Range) As Double 'Find the sum of the range FAverage = Application.WorksheetFunction.Sum(MyRange) 'Subtract the maximum value FAverage = FAverage - Application.WorksheetFunction.Max(MyRange) 'Subtract the minimum value FAverage = FAverage - Application.WorksheetFunction.Min(MyRange) 'Divide by two less than the count of values in the range FAverage = FAverage / (Application.WorksheetFunction.Count(MyRange) - 2) End Function
We do not have to actually code the logic to calculate the sum, maximum, or minimum, because the application object already knows how to do these things. In effect, we are using normal Excel functions inside our code. With this code in place, our sheet has a new function called FAverage
. Its only parameter is the range of cells we are working with, and it returns the filtered average.
A custom function is fine for a one-off problem, but you may want to reuse your solution later, or there may be other Excel users doing a similar job who could use the same function. If this is the case, an Excel Add-In is an easy way to publish one or more custom functions.
We start with a new blank workbook. We bring up the Visual Basic Editor as before and add the same code. We save the workbook with the name StatHelper as an Excel Add-In. It will be saved with an extension of .xla in your Application Data/Microsoft/AddIns folder.
Then exit and restart Excel, and bring up another blank workbook. Go to the Tools → Add-Ins menu and you will see StatHelper on the list of available Add-Ins. Make sure it is checked, and your new workbook will have the FAverage
function.
You can distribute the StatHelp.xla file to others. It needs to be placed in their Application Data/Microsoft/AddIns folder. You can add additional functions to your XLA file, or even make it a group project with functions added by other users.
Moving average
Tracking changes in the average is an easy way to find trends in data. A moving average is shown in Figure 1-12.
In this example we have 65 days of closing prices for a stock. The formula in cell B21
takes the average of the preceding 20 days. The formula fills down, creating a moving average.
Moving averages are often used in charts to make trends easy to understand.
Changes in the Average
The simplest change is a linear
trend. In business this means the value is going up or down by the same amount each time period (e.g., each month sales are going up by $10,000). The TREND
function can identify the points on the trend line or forecast future values. Figure 1-13 shows how the TREND
function can build a trend line.
In Step 1 are the closing stock prices for 25 days. In column B is a heading for the trend.
In Step 2, we select the range B2:B26
and click on the formula bar. In the formula bar we enter the following:
=TREND(A2:A26,,)
The TREND
function has three parameters. First is the range of the Y values. In this case these are the stock prices. The next parameter is the range of the X values. When we are working with a value that is changing over time, like a stock price, these are just the numbers 1,2,3...to the number of values, 25 in this case. This is the default, so we don’t have to enter anything except a comma as a place holder. The third parameter is the X values we want the function to return. Here we also take the default. This is an array formula and must be entered with Ctrl-Shift-Enter. The result is shown in Step 3.
The result is a line. Each value in column B goes up by the same amount, in this case by 0.202538. If we build a chart using the data in columns A and B, it will look like Figure 1-14.
The TREND
function can also forecast future
values
. The next value can be predicted with this formula:
=TREND(A2:A26,,COUNT(A2:A26)+1)
The COUNT
function returns the number of cells in the range. Adding one tells the TREND
function that we want the next value.
Forecasting the next value is something that we might want to do from time to time, and it would be convenient if we had a function to do this based only on the range.
We can create a custom function with this code:
Public Function TrendNext(MyRange As Excel.Range) As Variant 'Find the next value from a range using the TREND function TrendNext = Application.WorksheetFunction.Trend(MyRange, , _ Application.WorksheetFunction.Count(MyRange) + 1) End Function
Once again, we use the application object to get to Excel’s built-in functions. If this code is added to the StatHelper.xla, the TrendNext
function will be ready to use. Figure 1-15 shows an example.
Both formulas return the same value, but TrendNext
is specialized and easier to use.
Growth
Trends are not always linear. Values can go up or down in ways that cannot be described with a line. Exponential growth is based on multiplication. In a linear trend each item changes by the same amount. In growth each item changes by the same ratio. Excel has a GROWTH
function that models exponential growth
. It works just like the TREND
function but returns different results.
Figure 1-16 demonstrates the difference between the GROWTH
and TREND
functions.
Here the values in column A are changing exponentially. We use both the GROWTH
and TREND
functions to try to model the values. In the chart it is clear that the GROWTH
function does a better job. The trend line doesn’t really explain what the values are doing, and even starts with a negative value.
If we predict the next few values in this series, we will get better results if we use GROWTH
.
Column E shows how the TREND
function works. It starts with -162.08 and adds 112.45 each time. This gives the best fitting line possible for the values. In Column F the GROWTH
function starts with 23.06 and multiplies by 1.557 each time. This produces the best fitting exponential curve.
In most business situations the TREND
function works fine. But GROWTH
makes more sense when working with returns on investments, inflation rates, or other values that change by a percentage over time.
It is easy to see how the data is behaving in these examples, but in the real world trends are likely to be subtle and the data less consistent. It is important to understand the data before you choose which function to use. It is helpful to look at a chart before deciding between GROWTH
and TREND
.
Distributions
The average number of calls coming into a call center is 12,534 per day. But a day with exactly 12,534 is rare. If we are going to manage this process, we need to know how spread out the data is and how it is distributed.
Excel offers functions to handle the kinds of distribution most frequently encountered in business.
Normal distributions
A normal distribution has a bell shaped curve. Most of the values are near the middle. The tails of the distribution contain uncommonly high or low values. This is the most common distribution in business situations.
The simplest measure of a normal distribution is the variance
. To calculate the variance, each value is subtracted from the average and the difference is squared. Then the squared differences are summed and the sum is divided by one less than the number of values. In Excel all this is done by the VAR
function. Figure 1-17 shows how this works.
We will calculate the variance of the 20 numbers in column N. The formula in cell O1
is:
=(N1-AVERAGE(N$1:N$20))^2
It takes the difference between the value in cell N1 and the average of all 20 cells. This fills down to row 20.
Cell Q6
contains this formula:
=SUM(O1:O20)/19
It sums the squared differences and divides by 19 (one less than the number of values).
In cell Q8
we let Excel do the work by using the VAR
function and get the same result.
The most commonly used measure of spread is the standard deviation. Standard deviation is just the square root of the variance, and Excel has the STDEV
function to calculate it.
The standard deviation is used in many statistical calculations. Once you have the average and the standard deviation, Excel has functions that can tell you how a value fits into the distribution. This is illustrated in Figure 1-18.
We want to know where 46.75 fits in the distribution, and we know the average and standard deviation. The NORMDIST
function gives us the answer. In Figure 1-17 cell Q1
has the average and Q2
has the standard deviation. The formula is:
=NORMDIST(46.75,Q1,Q2,TRUE)
The first entry is the value we want to test (46.75). Next are the average and standard deviation. Finally, we use the option TRUE
to get the cumulative probability for 46.75. If we entered FALSE
for this option, the function would return the height of the distribution curve at 46.75.
The NORMDIST
function returns a value of 0.9751. This means that 97.51% of the distribution is below 46.75.
The NORMINV
function works in the opposite direction. Suppose we need to know what values are in the top 25% of the distribution. The value we need is the one at the 75% point. For the data in Figure 1-17 this formula will find the answer:
=NORMINV(0.75,Q1,Q2)
The first entry is the percentage we are looking for. Next are the average and standard deviation. This function has no TRUE
/FALSE
option. The formula returns a value of 46. The values 46 and above are the top 25% of the distribution.
Exponential distributions
Our call center gets three calls every ten minutes. The time between calls has a distribution illustrated in Figure 1-19.
The curve can start at any point in time. At the instant it begins there is no call, so zero percent of the calls happen in zero time. Half the time the next call comes in within 3 1/3 minutes. The next call arrives with 15 minutes nearly all the time. A wait of 16 minutes is possible but almost never occurs.
In this case we are looking at the time interval between events, and this produces an exponential distribution
. Excel lets you calculate probabilities
in this distribution with the EXPONDIST
function.
Before we can use this function we have to calculate a parameter called Lambda
. Lambda is the inverse of the average time. So, if our call center gets 3 calls every 10 minutes, the average time between calls is 3.33... minutes. Lambda is 1 divided by 3.33... which equals 0.3. We used minutes to determine Lambda. Therefore, the results we get from the EXPONDIST
function will be in minutes.
Suppose we need to know what percentage of the time the next call will arrive within 4 minutes. The formula is:
=EXPONDIST(4,0.3,TRUE)
The first entry is 4 because we are interested in the 4 minute interval. Next is Lambda (0.3). The option TRUE
tells the function that we want the cumulative probability. This returns a value of 0.698. This means that about 70% of the time the next call arrives within 4 minutes.
This distribution looks forward only. It has no memory. This means that even if there has been no call for the last five minutes, there is still a 70% probability that the next call will arrive in the next 4 minutes. The past does not matter, only the future.
Gamma distribution
The gamma distribution is similar to the exponential, but more general. It allows you to calculate the probability for multiple events.
We get the same result as with EXPONDIST
if we use this formula:
=GAMMADIST(4,1,3.3333,TRUE)
There are two differences from the EXPONDIST
entries. In this case Lambda is 0.3, but the GAMMADIST
function requires 1/Lambda. The entry is 3.3333. The 1 is the number of occurrences we want. Here we are interested in the next call, so just one call is involved. If we need to know the probability of getting three calls in the next four minutes, we would enter the formula like this:
=GAMMADIST(4,3,3.3333,TRUE)
This formula returns a value of 0.1205. There is a 12% chance of getting three calls in the next four minutes.
The GAMMAINV
function lets you go the other way. If we want to be 90% sure that the next call will arrive in a certain number of minutes, the formula is:
=GAMMAINV(0.9,1,3.333)
The 0.9 is the probability we want, and the number of occurrences is one, and 1 divided by Lambda is 3.333. The result is 7.67. That means that there is a 90% chance that the next call will come in during the next 7.67 minutes.
Binomial distribution
Some things have only two possible outcomes. A coin toss is heads or tails, a collections call is successful or not, a new account winds up being uncollectible or it doesn’t. For this kind of situation, probability is calculated using a binomial distribution .
If the probability that a collections call will result in a payment is 0.1 and a collector makes 75 calls in a day, what is the probability that the calls will result in exactly 10 payments?
Excel has a function that answers this question. The formula is:
=BINOMDIST(10,75,0.1,TRUE)
The ten is the number of calls we are looking for. There are 75 calls made, and the probability of success on each call is 0.1. The option TRUE
tells the function that we want the probability.
The formula returns a value of 0.873. There is an 8.73% chance that exactly 10 payments will come from the 75 calls.
Correlation
Sometimes we need to know if two groups of numbers are related. Correlation is a measure of how much similarity there is between two groups of numbers. If both groups go up or down by the same amount at the same time there is a positive correlation . If they change at the same time but in the opposite direction there is a negative correlation .
Correlation is always a number between one and negative one, and the groups of numbers being tested must have the same numbers of members. In Excel you can calculate correlation using the CORREL
function. An example is shown in Figure 1-20.
Here we have two groups of numbers. The formula in cell E1 is:
=CORREL(A2:A12,B2:B12)
As the first chart illustrates, the numbers, while not equal, tend to move together. The formula returns a value of 0.68. This is a fairly strong correlation and confirms what we see in the chart; the numbers are related. In the bottom chart the numbers are moving at the same time but in the opposite direction. This suggests a negative correlation, and the formula in cell E21
indeed returns a value of -0.805.
A correlation near zero indicates that the number groups are not related.
Get Analyzing Business Data with Excel now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.