Two-Way Frequency Table Analysis
PROC FREQ, Part 2
See
www.stattutorials.com/SASDATA for files mentioned in this tutorial
© TexaSoft, 2006
Analyzing Two-Way Tables
To create a
table in PROC FREQ comparing two variables, use the TABLES statement with
both variables listed and separated by an asterisk (*). (i.e., A *
B), PROC FREQ will produce a crosstabulation table (also called a two-way
table).
When you create
a two-way crosstabulation, you may want to know the statistics associated
with this table. The /CHISQ option in the TABLES statement is used to
request that statistics be reported. For example:
PROC FREQ; TABLES
GENDER*GP/CHISQ;
will create a
two-way crosstabulation table and will also cause SAS to report a battery of
statistics associated with the table.
Test Assumptions: For the Chi-square statistic, the observed data are
assumed to be counts of qualitative/categorical data such as hair color,
presence of a condition (i.e., a disease or not) etc.
A crosstabulation table (also sometimes called a contingency table) is
formed by counting the number of occurrences in a sample across two grouping
variables (specified in TABLES). The number of columns in a table is usually
denoted by c and the number of rows by r. Thus, a table is
said to have r x c "cells." For example, if in a dominate-hand
(left-right) by hair color table, (with 5 hair colors used) the table would
be referred to as a 2 x 5 table. Two types of tests are commonly associated
with an r x c table. They are the test of independence and the test of
homogeneity. The hypotheses for the test of independence are:
Ho: The variables are
independent (no association between the two variables)
Ha: The variables are not
independent
Thus, in the
“hair” example, the null hypothesis would mean that there is no association
between dominant hand and hair color (each hand dominance category has the
same distribution of hair color). The alternative hypothesis would mean that
left and right-handed people have difference distributions of hair color --
perhaps left-handed people are more likely to be brunette.
Another test that can be performed for a contingency table is a test of
homogeneity. In this case, the table is built of data from two
populations and tests whether the populations come from the same
distribution. In this case the hypotheses are:
Ho: The populations are
homogeneous.
Ha: The populations are not
homogeneous.
Rows (or columns) represent data from different populations, and the other
variable represents data observed on the population. The
c2
(Chi-square) test of homogeneity or independence is reported (the tests are
mathematically equivalent.) Also included in the output is a likelihood
ratio chi-square, Mantel-Hantzel chi-square, phi, contingency coefficient,
and Cramer’s V. For a 2*2 table, a Fisher’s exact test is also performed.
For example, you could create a two-by-two table of GENDER by GP by using
the following statements from the SOMEDATA data set (PROCFREQ4.SAS):
* ASSUMES YOU HAVE A SAS LIBRARY NAMED MYDATA;
ODS
RTF;
PROC
FREQ
DATA=MYDATA.SOMEDATA;
TABLES
GENDER*GP/CHISQ;
TITLE
'Chi Square Analysis of a Contingency Table';
RUN;
* RUN IT AGIN, REQUESTING EXPECTED VALUES;
PROC
FREQ
DATA=MYDATA.SOMEDATA;
TABLES
GENDER*GP/CHISQ
EXPECTED
NOROW
NOCOL
NOPERCENT;
RUN;
ODS
RTF
CLOSE;
The output for the first two-way table in this job (in part) follows:
|
Table of GENDER by GP |
|
GENDER |
GP(Intervention
Group) |
Total |
|
Frequency
Percent
Row Pct
Col Pct |
A |
B |
C |
|
Female |
6
12.00
20.00
54.55 |
16
32.00
53.33
55.17 |
8
16.00
26.67
80.00 |
30
60.00
|
|
Male |
5
10.00
25.00
45.45 |
13
26.00
65.00
44.83 |
2
4.00
10.00
20.00 |
20
40.00
|
|
Total |
11
22.00 |
29
58.00 |
10
20.00 |
50
100.00 |
The four numbers in each cell are the frequency,
the total percent, percent by row and percent by column. The statistic for
this table are given in the next table:
Statistics for Table of GENDER by GP
|
Statistic |
DF |
Value |
Prob |
|
Chi-Square |
2 |
2.0846 |
0.3526 |
|
Likelihood Ratio
Chi-Square |
2 |
2.2433 |
0.3257 |
|
Mantel-Haenszel
Chi-Square |
1 |
1.3157 |
0.2514 |
|
Phi Coefficient |
|
0.2042 |
|
|
Contingency Coefficient |
|
0.2001 |
|
|
Cramer's V |
|
0.2042 |
|
|
WARNING: 33% of the cells have expected counts less
than 5. Chi-Square may not be a valid test. |
Sample Size = 50
The Chi-Square
value is 2.08 with p=.3526. This provides evidence to not reject
the null hypothesis – thus you would conclude that there is no relationship
between gender and group. However, notice the warning at the bottom of the
table. It tells you that 33% of the cells have expected values of 5 or less,
which may make the Chi-Square test invalid. To check this out you look at
the version of the table you requested in the second PROC FREQ – this one
which requested that the expected values be included in the analysis using
TABLES
GENDER*GP/CHISQ
EXPECTED
NOROW
NOCOL
NOPERCENT;
|
Table of GENDER by
GP |
|
GENDER |
GP(Intervention
Group) |
Total |
|
Frequency
Expected |
A |
B |
C |
|
Female |
6
6.6 |
16
17.4 |
8
6 |
30
|
|
Male |
5
4.4 |
13
11.6 |
2
4 |
20
|
|
Total |
11 |
29 |
10 |
50 |
The TABLES
statement also requested that ROW, COLUMN and total PERCENTS be excluded
from the table. From the resulting table you can see that two of the cells
have expected values less than 5 (4.4 and 4). Viewing the expected values
can also help you understand why a Chi-Square statistic is significant by
observing which observed values depart most from expected values.
EXERCISE: Add FISHERS to the TABLES statement to get Fishers Exact
statistic.
TABLES
GENDER*GP/CHISQ
FISHERS
EXPECTED
NOROW
NOCOL
NOPERCENT;
Fisher’s Exact test is often preferred over the Chi-Square when the numbers
in the table are small or when the table contains expected values less than
5 (as is true in this example.)
Creating a Contingency Table from Summarized Data
If your data are already summarized into counts, you can use the programming
features of SAS to create a dataset appropriate for the analysis. (PROCFREQ5.SAS)
The 2x2 table contains the values 12,15,18, and 3:
In the
following SAS code, the DO LOOP statements are used to enter this data into
a dataset in the proper format for the PROC FREQ statement.
DATA;
DO
A =
1
TO
2;
DO
B =
1
TO
2;
INPUT
WT @@;
OUTPUT;
END;
END;
DATALINES;
12 15
18 3
;
ODS
RTF;
PROC
FREQ;
WEIGHT
WT;
TABLES
A*B /CHISQ;
TITLE
'CHI-SQUARE ANALYSIS FOR A 2X2 TABLE';
RUN;
ODS
RTF
CLOSE;
The output for this program follows. The basic table is the same as in the
previous example. The Chi-Square statistic is 8.58 (1 df) and p=0.0034. From
this evidence you would reject the null hypothesis and conclude that the
observations for variable B are influenced by A. For example, looking at the
row percentages for A=1, notice that B goes up from 44% to 56%. Whereas when
A=2, B goes down from 86% to 14% -- the pattern of B is different across
categories of A.
In the 2x2 case, SAS automatically also includes Fisher’s Exact Test. Most
commonly, the two-sided Fishers p-value (p=.006) would be reported. Fisher’s
is often preferred over the Chi-Square when the numbers in the table are
small or when the table contains expected values less than 5.
The output for these test are given below:
|
Table of A by B |
|
A |
B |
Total |
|
Frequency
Percent
Row Pct
Col Pct |
1 |
2 |
|
1 |
12
25.00
44.44
40.00 |
15
31.25
55.56
83.33 |
27
56.25
|
|
2 |
18
37.50
85.71
60.00 |
3
6.25
14.29
16.67 |
21
43.75
|
|
Total |
30
62.50 |
18
37.50 |
48
100.00 |
Statistics for Table of A by B
|
Statistic |
DF |
Value |
Prob |
|
Chi-Square |
1 |
8.5841 |
0.0034 |
|
Likelihood Ratio
Chi-Square |
1 |
9.1893 |
0.0024 |
|
Continuity Adj.
Chi-Square |
1 |
6.9136 |
0.0086 |
|
Mantel-Haenszel
Chi-Square |
1 |
8.4053 |
0.0037 |
|
Phi Coefficient |
|
-0.4229 |
|
|
Contingency
Coefficient |
|
0.3895 |
|
|
Cramer's V |
|
-0.4229 |
|
|
Fisher's
Exact Test |
|
Cell (1,1) Frequency
(F) |
12 |
|
Left-sided Pr <= F |
0.0036 |
|
Right-sided Pr >= F |
0.9996 |
|
|
|
|
Table Probability (P) |
0.0032 |
|
Two-sided Pr <= P |
0.0061 |
EXERCISE:
Include RELRISK as an option in the TABLE statement:
TABLES
A*B /CHISQ
RELRISK;
This yields
these additional statistics:
|
Estimates of the
Relative Risk (Row1/Row2) |
|
Type of Study |
Value |
95% Confidence Limits |
|
Case-Control (Odds
Ratio) |
0.1333 |
0.0316 |
0.5621 |
|
Cohort (Col1 Risk) |
0.5185 |
0.3285 |
0.8184 |
|
Cohort (Col2 Risk) |
3.8889 |
1.2937 |
11.6902 |
It is important
to note that the Odds Ratio is based on Row1/Row2. If you switch rows, the
Chi-Square statistics are all the same, but the Odds Ratio is the inverse.
(1/.1333 = 7.5).
End of tutorial
See
http://www.stattutorials.com/SAS