Logistic Regression
Using SPSS
See
www.stattutorials.com/SPSSDATA
for files mentioned in this tutorial © TexaSoft, 2008
These SPSS statistics tutorials briefly explain the use and
interpretation of standard statistical analysis techniques for Medical,
Pharmaceutical, Clinical Trials, Marketing or Scientific Research. The examples
include how-to instructions for SPSS Software.
Logistic
Regression in SPSS
This
example is adapted from information in Statistical Analysis Quick Reference
Guidebook (2007).
A sales director
for a chain of appliance stores wants to find out what circumstances encourage
customers to purchase extended warranties after a major appliance purchase.
The response variable is an indicator of whether or not a warranty is
purchased. The predictor variables they want to consider are
There are several strategies you can take to develop the “best” model for the
data. It is recommended that you examine several models before determining
which one is best for your analysis. (In this example we allow the computer
to help specify important variables, but it is inadvisable to accept a
computer designated model without examining alternatives.) Begin by examining
the significance of each variable in a fully populated model.
1. Open the data set named WARRANTY.SAV (downloadable from the data
section) and choose Analyze/Regression/Binary Logistic.
2. Select Bought as the dependent variable and Gender, Gift, Age,
Price, and Race as the covariates (i.e. the independent or predictor)
variables.
3.
Click on the Categorical checkbox (It is a button in SPSS version 16)
and specify Race as a categorical variable. Click Continue and then
OK. This produces the following SPSS output table.
|
Variables in the Equation |
|
|
|
B |
S.E. |
Wald |
df |
Sig. |
Exp(B) |
|
Step
1 |
Gender |
-3.772 |
2.568 |
2.158 |
1 |
.142 |
.023 |
|
Price |
.001 |
.000 |
3.363 |
1 |
.067 |
1.001 |
|
Age |
.091 |
.056 |
2.638 |
1 |
.104 |
1.096 |
|
Gift |
2.715 |
1.567 |
3.003 |
1 |
.083 |
15.112 |
|
Race |
|
|
2.827 |
3 |
.419 |
|
|
Race(1) |
3.773 |
13.863 |
.074 |
1 |
.785 |
43.518 |
|
Race(2) |
1.163 |
13.739 |
.007 |
1 |
.933 |
3.199 |
|
Race(3) |
6.347 |
14.070 |
.203 |
1 |
.652 |
570.898 |
|
Constant |
-12.018 |
14.921 |
.649 |
1 |
.421 |
.000 |
The “Variables in the Equation” table shows the output resulting from
including all of the candidate predictor variables in the equation. Notice
that the Race variable, which was originally coded as 1=White,
2=African American, 3=Hispanic and 4=0ther has been changed (by the SPSS
logistic procedure) into three (4 - 1) indicator variables called Race(1),
Race(2), and Race (3). These three variables each enter the
equation with their own coefficient and p-value and there is an overall
p-value given for Race.
The significance of each variable is measured using a Wald statistic. Using
p=0.10 as a cutoff criterion for not including variables in the equation, it
can be seen that Gender (p=0.142) and Race (p=0.419) do not
appear to be important predictor variables. Age is marginal (p=0.104),
but we’ll leave it in for the time being. Rerun the analysis again after
taking out Gender and Race as predictor variables. The analysis is rerun
without these “unimportant” variables, yields the following output:
|
Variables in the Equation |
|
|
|
B |
S.E. |
Wald |
df |
Sig. |
Exp(B) |
|
Step 1 |
Price |
.000 |
.000 |
6.165 |
1 |
.013 |
1.000 |
|
Age |
.064 |
.032 |
4.132 |
1 |
.042 |
1.066 |
|
Gift |
2.339 |
1.131 |
4.273 |
1 |
.039 |
10.368 |
|
Constant |
-6.096 |
2.142 |
8.096 |
1 |
.004 |
.002 |
This
reduced model indicates that there is a significant predictive power for the
variables Gift (p=0.039), Age (p=0.042), and Price
(p=0.013). Although the p-value for Price is small, notice that the OR
= 1 and the coefficient for Price is zero to three decimal places.
These seemingly contradictory bits of information (i.e. small p-value but OR =
1.0, etc.) are suggestive that the values for Price are hiding the
actual Odds Ration (OR) relationship. If the same model is run with the
variable Price100, which is Price divided by 100, the odds ratio
for Price100 is 1.041 and the estimated coefficient for Price100 is
0.040 as shown below.
|
Variables in the Equation |
|
|
|
B |
S.E. |
Wald |
df |
Sig. |
Exp(B) |
|
Step
1 |
Age |
.064 |
.032 |
4.132 |
1 |
.042 |
1.066 |
|
Gift |
2.339 |
1.131 |
4.273 |
1 |
.039 |
10.368 |
|
Price100 |
.040 |
.016 |
6.165 |
1 |
.013 |
1.041 |
|
Constant |
-6.096 |
2.142 |
8.096 |
1 |
.004 |
.002 |
All of the other
values in the table remain the same. All we have done is to recode Price
into a more usable number. Another tactic often used is to standardize values
such as Price by subtracting the mean and dividing by the standard
deviation. Using standardized scores eliminates the problem observed with the
Price variable, and also simplifies the comparison of odds ratios for
different variables.
The
result is that we can now see that the odds that a customer who is offered a
gift will purchase a warranty is 10 (see Exp(B) for Gift) times greater than
the corresponding odds for a customer not offered a gift. We also observe
that for each additional $100 in Price, the odds that a customer will
purchase a warranty increases by about 4%. This tells us that people tend to
be more likely to purchase warranties for more expensive appliances. Finally,
the OR for age, 1.066, tells us that older buyers are more likely to purchase
a warranty.
One
way to assess the model is by the Hosmer-Lemeshoi criteria. To product this
information:
4.
Rerun the analysis and click on the Options checkbox and select the
select the Hosmer-Lemeshow goodness-of-fit. Click Continue and OK.
|
Hosmer and Lemeshow Test |
|
Step |
Chi-square |
df |
Sig. |
|
1 |
1.792 |
8 |
.987 |
This test divides
the data into several groups based on
values, then computes a chi-square from observed and
expected frequencies of subjects falling in the two categories of the binary
response variable within these groups. Large chi-square values (and
correspondingly small p-values) indicate a lack of fit for the model. In the
table above we see that the Hosmer-Lemeshow chi-square test for the final
warranty model yields a p-value of 0.987 thus suggesting a model with good
predictive value. Note that the Hosmer and Lemeshow chi-square test is not a
test of importance of specific model parameters (which may also appear in your
computer printout). It is a separate post-hoc test performed to
evaluate a specific model.
Interpretation
of the multiple logistic regression model
Once
we are satisfied with the model, it can be used for prediction just as in the
simple logistic example above. For this model, the prediction would be

(For more details in predicting see Statistical Analysis Quick
Reference Guideboo (Elliott, 2007.)
Using this
equation it would be reasonable to predict that a person with the
characteristics (Age = 54, Price = $3,850, and Gift = 1)
would purchase a warranty because
and the person where no gift is offered would not be
predicted to purchase a warranty because
. The typical cutoff for the decision would be 0.5 (or 50%).
Thus, using this cutoff anyone whose score was higher than 0.5 would be
predicted to buy the warranty and anyone with a lower score would be predicted
to not buy the warranty. However, there may be times when you want to adjust
this cutoff value. Neter et al (1996) suggests three ways to select a
cutoff value for predicting:
-
Use the
standard 0.5 cutoff value.
-
Determine a
cutoff value that will give you the best predictive fit for your sample
data. This is usually determined through trial and error.
-
Select a
cutoff value that will separate your sample data into a specific proportion
of your two states based on a prior known proportion split in your
population.
For example, to
use the second option for deciding on a cutoff value, examine the model
classification table that is part of the SPSS logistic output
|
Classification Tablea |
|
|
Observed |
Predicted |
|
|
Bought |
|
|
No |
Yes |
Percentage Correct |
|
Step
1 |
Bought |
No |
12 |
2 |
85.7 |
|
Yes |
1 |
35 |
97.2 |
|
Overall Percentage |
|
|
94.0 |
|
a. The
cut value is .500 |
|
|
|
This table
indicates that the final model correctly classifies 94% of the cases
correctly. The model used the default 0.5 cutoff value to classify each
subject’s outcome. (Notice the footnote on the table “The cut value is
.500.”) You can rerun the analysis with a series of cutoff values such as
0.4, 0.45, 0.55 and 0.65 to see if the cutoff value could be adjusted for a
better fit. For this particular model, these alternate cutoff values do not
lead to better predictions. In this case, the default 0.5 cutoff value is
deemed sufficient. (For more information about classification see
Statistical Analysis Quick Reference Guidebook, 2007.)
References
-
Cohen, J.,
Cohen, P. West, S.G., and Aiken, L.S. (2002) Applied Multiple
regression/Correlation Analysis for the Behavioral Sciences, Third Edition,
Lawrence Erlbaum Associates, Publishers.
-
Elliott, A.,
and Woodward, W. (2007) Statistical Analysis Quick Reference Guidebook,
Thousand Oaks: Sage.
-
Hosmer, D.W.
and Lemeshow, S. (2000). Applied Logistic Regression, 2nd edition, New
York: John Wiley and Sons, Inc.
-
Neter, J.,
Wasserman, W., Nachtsheim, C. J., & Kutner, M. H. (1996) Applied Linear
Regression Models (3rd Ed.).Chicago: Irwin.
See
www.stattutorials.com/SPSSDATA
for files mentioned in this tutorial © TexaSoft, 2008