Bowman’s Website

May 4, 2009

Statistics Notes — Prediction Intervals

Filed under: Statistics — bowman @ 7:36 pm

Prediction intervals

A prediction interval is an estimate of an interval in which future observations will fall, with certain probability, based upon present or past background samples taken.

Prediction intervals are useful for predicting, for a given x, the y value of the next experiment.

For instance… A 95% prediction interval is the y range for a given x where there is a 95% probability that the next experiment’s y value will occur.

Given a linear regression equation ŷ = mx + b and x0, a specific value of x, a c-prediction interval for y is

ŷ – E < y < ŷ + E

where

The point estimate is ŷ and the maximum error of estimate is E. The probability that the prediction interval contains y is c.

.

Example. Construct a 95% prediction interval for the company sales when the advertising expenditures are $2100. What can you conclude?

Here’s the data.

Advertising expenses (1000s of $), x

Company sales (1000s of $), y

2.4

225

1.6

184

2.0

220

2.6

240

1.4

180

1.6

184

2.0

186

2.2

215

Here’s what we have figured out thus far…

.

r = 0.9129

m = 50.7287

b = 104.0608

x = 1.975

y = 204.25

The regression line is ŷ = mx + b = 50.7287x + 104.0608

The total variation is 3813.5.

The explained variation is 3178.1502.

The unexplained variation is 635.3441.

r2 = 0.8334

This means that 83.3395% of the variation of y can be explained by the relationship between x and y.

The remaining 16.6605% of the variation is unexplained and is due to other factors or to sampling error.

se = 10.2903

That means the standard deviation of the company sales for a specific advertising expenditure is about $10,290.30.

.

This above is very impressive.

.

Let’s continue and find a 95% prediction interval when the advertising expenditures are $2100.

.

We need some of the above information and these next few items.

c = 0.95

x0 = $2100, for this application we use 2.1

n = 8

d.f. = 6

tc = 2.447 This comes from our t-distribution table. The same values as a confidence interval. Remember this is a t-distribution, so the degrees of freedom are very important.

.

Find the point estimate.

ŷ = 50.7287x + 104.0608

For x0 = 2.1, the point estimate ŷ = 50.7287(2.1) + 104.0608

Find E.

Find the interval.

Using ŷ = 210.5911 and E = 26.8576

ŷ – E < y < ŷ + E

Left endpoint 183.7335

Right endpoint 237.4487

Prediction Interval: (183.7335, 237.4487)

So, you can conclude with 95% confidence that when the advertising expenditures are $2100, the company sales will be between $183,733.50 and $237,448.70.

Statistics Notes — Measures of Regression — Another Formula for finding Standard Error

Filed under: Statistics — bowman @ 7:36 pm

There is another formula for finding the standard error of estimate for a regression line.

Statistics Notes — Measures of Regression

Filed under: Statistics — bowman @ 7:35 pm

Three types of variation about a regression line

1. The total variation about a regression line is the sum of the squares of the differences between the y-value of each ordered pair and the mean of y.

Total variation =(yiy)2

2. The explained variation is the sum of the squares of the differences between each predicted y-value and the mean of y.

Explained variation =iy)2

3. The unexplained variation is the sum of the squares of the differences between the y-value of each ordered pair and each corresponding predicted y-value.

Unexplained variation =(yi – ŷ)2

The sum of the explained and unexplained variation is equal to the total variation.

Total variation = Explained variation + Unexplained variation

As its name implies, the explained variation can be explained by the relationship between x and y.

The unexplained variation cannot be explained by the relationship between x and y and is due to chance or other variables.

.

Example. Find the values of the three types of variation about a regression line determined from this data.

Advertising expenses (1000s of $), x

Company sales (1000s of $), y

2.4

225

1.6

184

2.0

220

2.6

240

1.4

180

1.6

184

2.0

186

2.2

215

This example is from a previous days work. On that day we determined the coefficient of variation and the regression line.

Advertising expenses (1000s of $), x

Company sales (1000s of $), y

xy

x2

y2

2.4

225

540

5.76

50625

1.6

184

294.4

2.56

33856

2.0

220

440

4

48400

2.6

240

624

6.76

57600

1.4

180

252

1.96

32400

1.6

184

294.4

2.56

33856

2.0

186

372

4

34596

2.2

215

473

4.84

46225

∑x = 15.8

∑y = 1634

∑xy = 3289.8

∑x2 = 32.44

∑y2 = 337,558

.

Using these sums and n = 8, the correlation coefficient is

Since r is close to 1, there is a strong positive linear correlation. As the amount of spending on advertising increases, the company sales also increase.

Using these sums and n = 8, the slope is

and the y-intercept is

Here’s what we have…

r = 0.9129

m = 50.7287

b = 104.0608

x = 1.975

y = 204.25

The regression line is ŷ = mx + b = 50.7287x + 104.0608

.

Advertising expenses (1000s of $), x

Company sales (1000s of $), y

ŷi

(yiy)2

iy)2

(yi – ŷ)2

2.4

225

225.80968

430.5625

464.81980

.65558

1.6

184

185.22672

410.0625

361.88518

1.50484

2.0

220

205.5182

248.0625

1.60833

209.72253

2.6

240

235.95542

1278.0625

1005.23365

16.35863

1.4

180

175.08098

588.0625

850.83173

24.19676

1.6

184

185.22672

410.0625

361.88518

1.50484

2.0

186

205.5182

333.0625

1.60833

380.96013

2.2

215

215.66394

115.5625

130.27803

.44082

∑y = 1634

∑ = 3813.5

∑ = 3178.15024

∑ = 635.34413

= 204.25

To find ŷ, substitute each x into the equation of the regression line.

L3 = 50.7287*L1+104.0608

L4=(L2-204.25)^2

L5=(L3-204.25)^2

L6=(L2-L3)^2

.

The total variation is 3813.5.

The explained variation is 3178.1502.

The unexplained variation is 635.3441.

.

The coefficient of determination r2 is the ratio of the explained variation to the total variation.

.

Example. Using the data found in the previous example, determine the coefficient of determination.

The total variation is 3813.5.

The explained variation is 3178.1502.

So…

This means that 83.3395% of the variation of y can be explained by the relationship between x and y.

The remaining 16.6605% of the variation is unexplained and is due to other factors or to sampling error.

.

Using the Super TI.

Nice…

.

The standard error of estimate, se is the standard deviation of the observed yi-values about the predicted ŷ-value for a given xi-value.

.

It is given by

where n is the number of ordered pairs in the data set.

.

The (yi – ŷ)2 is the sixth column in our table earlier. It is the unexplained variation.

.

Example. Using the data found in the previous example, determine the standard error of estimate.

From earlier…n = 8 and (yi – ŷ)2 = 635.34413

The standard error of estimate is about 10.2903. That means the standard deviation of the company sales for a specific advertising expenditure is about $10,290.30.

Blog at WordPress.com.