Descriptive Statistics using SAS
PROC UNIVARIATE
See
www.stattutorials.com/SASDATA
for files mentioned in this tutorial © TexaSoft, 2006
These SAS statistics tutorials briefly explain the use and
interpretation of standard statistical analysis techniques for Medical,
Pharmaceutical, Clinical Trials, Marketing or Scientific Research. The examples
include how-to instructions for SAS Software.
If the PROC MEANS
procedure does not produce the statistic you need for a data set then PROC
UNIVARIATE may be your choice. Although it is similar to PROC MEANS, its
strength is in calculating a wider variety of statistics, specifically useful
in examining the distribution of a variable.
Use PROC
UNIVARIATE to examine the distribution of your data, including an assessment
of normality and discovery of outliers.
The syntax of the
PROC UNIVARIATE statement is:
PROC UNIVARIATE
<options>; <statements>;
Commonly used
options for PROC UNIVARIATE include:
DATA= - Specifies
data set to use
NORMAL - Produces a
test of normality
FREQ – Produces a
frequency table
PLOT – Produces
stem-and-leaf plot
Commonly used
statements used with PROC UNIVARIATE include:
BY variable list;
VAR variable list;
OUTPUT OUT =
datasetname;
The BY‑group
specification causes UNIVARIATE to calculate statistics separately for groups
of observations (i.e., treatment means). The OUTPUT OUT= statement allows you
to output the means to a new data set. The following SAS program (PROCUNI1.SAS)
produces a large number of statistics on the variable AGE:
DATA
EXAMPLE;
INPUT
TREATMENT LOSS @@;
DATALINES;
;
PROC
UNIVARIATE
NORMAL
PLOT
data=example;
var
age;
HISTOGRAM
age/normal
(color=red
w=5);
TITLE
'PROC UNIVARIATE Example';
FOOTNOTE
'Evaluate distribution of variables';
run;
The output from this program
follows. The first table gives standardized descriptive statistics (Moments).
These statistics allow you to gain an idea of the distribution of data
within the variable AGE.
|
Moments |
|
N |
50 |
Sum Weights |
50 |
|
Mean |
10.46 |
Sum Observations |
523 |
|
Std Deviation |
2.42613323 |
Variance |
5.88612245 |
|
Skewness |
-0.5119219 |
Kurtosis |
-0.2610615 |
|
Uncorrected SS |
5759 |
Corrected SS |
288.42 |
|
Coeff Variation |
23.1943903 |
Std Error Mean |
0.34310705 |
The next table provides basic
measures of central tendency and spread.
|
Basic Statistical
Measures |
|
Location |
Variability |
|
Mean |
10.46000 |
Std Deviation |
2.42613 |
|
Median |
11.00000 |
Variance |
5.88612 |
|
Mode |
12.00000 |
Range |
11.00000 |
|
|
|
Interquartile Range |
3.00000 |
The table “Tests for location”
provides a test for the null hypothesis that the mean is zero. This can be
used for a paired value (paired t-test using Student’s t) to test.
Ho:
m
= 0 (The mean is 0)
Ha:
m
≠ 0 (The mean differs from 0)
The Sign test and Signed rank
tests are nonparametric tests.
|
Tests for Location: Mu0=0 |
|
Test |
Statistic |
p Value |
|
Student's t |
t |
30.48611 |
Pr > |t| |
<.0001 |
|
Sign |
M |
25 |
Pr >= |M| |
<.0001 |
|
Signed Rank |
S |
637.5 |
Pr >= |S| |
<.0001 |
The test for
normality are one way of assessing whether the distribution of the data
appears normally distributed. Four tests for normality are provided:
|
Tests for Normality |
|
Test |
Statistic |
p Value |
|
Shapiro-Wilk |
W |
0.958283 |
Pr < W |
0.0753 |
|
Kolmogorov-Smirnov |
D |
0.148067 |
Pr > D |
<0.0100 |
|
Cramer-von Mises |
W-Sq |
0.145762 |
Pr > W-Sq |
0.0259 |
|
Anderson-Darling |
A-Sq |
0.834989 |
Pr > A-Sq |
0.0301 |
Notice that in
this case these test differ in outcome (assuming a criteria of 0.05 is
strictly followed) with the Shapiro-Wilk test providing evidence that the data
are normally distributed (p=0.075) while the others reject this hypothesis.
The inclusion of
the NORMAL and PLOT statement in
PROC
UNIVARIATE
NORMAL
PLOT
data=example;
var
age;
provides the test
for normality plus a box and whiskers plot and a stem and leaf diagram.
Additional output
that is useful is visually assessing normality may be created by including one
the HISTOGRAM statement as shown below:
PROC
UNIVARIATE
NORMAL
PLOT
data=example;
var
age;
HISTOGRAM
age/normal
(color=red
w=5);
The superimposed
normal plot on the histogram allows you to not only see if the data are
approximately normally distributed, it also shows where it may not be fitting
normality. In this case, it appears that the plot has more than expected
values at the upper end of the range.