Descriptive
Statistics using SAS
PROC MEANS
See www.stattutorials.com/SASDATA
for
files mentioned in this tutorial
© TexaSoft,
2007
These SAS statistics tutorials
briefly explain the use and interpretation of standard statistical analysis
techniques for Medical, Pharmaceutical, Clinical Trials, Marketing or Scientific
Research. The examples include how-to instructions for SAS Software.
Preliminary information about PROC MEANS
PROC MEANS produces descriptive
statistics (means, standard deviation, minimum,
maximum, etc.) for numeric variables in a set of data. PROC MEANS
can be used for
·
Describing
continuous data where the average has meaning
·
Describing
the means across groups
·
Searching
for possible outliers or incorrectly coded values
·
Performing
a single sample t-test
The syntax of the PROC MEANS statement is:
PROC MEANS <options>;
<statements>;
Statistical options that may be requested are: (default statistics
are underlined.)
|
·
N - Number of observations
·
NMISS -
Number of missing observations
·
MEAN - Arithmetic average)
·
STD - Standard Deviation
·
MIN - Minimum (smallest)
·
MAX - Maximum (largest)
·
RANGE -
Range
·
SUM - Sum
of observations
·
VAR - Variance
·
USS –
Uncorr. sum of squares
·
CSS - Corr.
sum of squares
·
STDERR -
Standard Error
·
T -
Student’s t value for testing Ho: md
= 0
·
PRT -
P-value associated with t-test above
·
SUMWGT -
Sum of the WEIGHT variable values
|
(New to version 8.0)
·
MEDIAN – 50th percentile
·
P1 – 1st percentile
·
P5 - 5th percentile
·
P10 – 10th percentile
·
P90 - 90th percentile
·
P95 – 95th percentile
·
P99 - 99th percentile
·
Q1 - 1st quartile
·
Q3 - 3rd quartile
·
QRANGE – Quartile range
|
Other commonly used options available in PROC MEANS include:
- DATA= Specify data set to use
- NOPRINT Do not print output
- MAXDEC=n Use n decimal places to print output
Commonly used statements with PROC MEANS include:
- BY variable list -- Statistics are reported for
groups in separate tables
- CLASS variable list – Statistics reported by groups in a
single table
- VAR variable list – specifies which numeric variables
to use
- OUTPUT OUT = datasetname – statistics will be output to a SAS
data file
- FREQ variable - specifies a variable that represents
a count of observations
A few quick examples of PROC MEANS
* Simplest
invocation – on all numeric variables *;
PROC MEANS;
*Specified
statistics and variables *;
PROC MEANS N
MEAN STD; VAR SODIUM CARBO;
* Subgroup
descriptive statistics using by statement*;
PROC SORT; BY
SEX;
PROC MEANS; BY
SEX;
VAR FAT PROTEIN
SODIUM;
* Subgroup
descriptive statistics using class statement*;
PROC MEANS;
CLASS SEX;
VAR FAT PROTEIN
SODIUM;
Example 1: A simple use of PROC MEANS
This example calculates the means of several specified variables,
limiting the output to
two decimal places. (PROCMEANS1.SAS)
***************************************************************
* Data on
weight, height, and age of a random sample of 12 *
*
nutritionally deficient children *
***************************************************************;
DATA CHILDREN;
INPUT WEIGHT HEIGHT
AGE;
DATALINES;
64 57 8
71 59 10
53 49 6
67 62 11
55 51 8
58 50 8
77 55 10
57 48 9
56 42 10
51 42 6
76 61 12
68 57 9
;
ODS RTF;
proc means;
Title 'Example 1a -
PROC MEANS, simplest use';
run;
proc means maxdec=2;var WEIGHT HEIGHT;
Title 'Example 1b -
PROC MEANS, limit decimals, specify
variables';
run;
proc means maxdec=2 n mean stderr
median;var WEIGHT HEIGHT;
Title 'Example 1c –
PROC MEANS, specify statistics to report';
run;
ODS RTF CLOSE;
Output for Example 1:
Example
1a - PROC MEANS, simplest use
|
Variable
|
N
|
Mean
|
Std Dev
|
Minimum
|
Maximum
|
|
WEIGHT
HEIGHT
AGE
|
12
12
12
|
62.7500000
52.7500000
8.9166667
|
8.9861004
6.8240884
1.8319554
|
51.0000000
42.0000000
6.0000000
|
77.0000000
62.0000000
12.0000000
|
Example
1b - PROC MEANS, limit decimals, specify variables
|
Variable
|
N
|
Mean
|
Std Dev
|
Minimum
|
Maximum
|
|
WEIGHT
HEIGHT
|
12
12
|
62.75
52.75
|
8.99
6.82
|
51.00
42.00
|
77.00
62.00
|
Example
1c – PROC MEANS, specify statistics to report
|
Variable
|
N
|
Mean
|
Std Error
|
Median
|
|
WEIGHT
HEIGHT
|
12
12
|
62.75
52.75
|
2.59
1.97
|
61.00
53.00
|
Example 2: Using PROC MEANS using “By
Group” and Class statements
This example uses PROC MEANS to calculate means for an entire data
set or by a
grouping variables. (PROCMEANS2.SAS):
***************************************************
* Example 2 for
PROC MEANS *
***************************************************;
DATA FERTILIZER;
INPUT FEEDTYPE
WEIGHTGAIN;
DATALINES;
1 46.20
1 55.60
1 53.30
1 44.80
1 55.40
1 56.00
1 48.90
2 51.30
2 52.40
2 54.60
2 52.20
2 64.30
2 55.00
;
ODS RTF;
PROC SORT DATA=FERTILIZER;BY FEEDTYPE;
PROC MEANS; VAR WEIGHTGAIN; BY FEEDTYPE;
TITLE 'Summary
statistics by group';
RUN;
PROC MEANS; VAR WEIGHTGAIN; CLASS FEEDTYPE;
TITLE 'Summary
statistics USING CLASS';
RUN;
ODS RTF CLOSE;
Output for this SAS code is:
Summary Statistics by Group
FEEDTYPE=1
|
Analysis Variable :
WEIGHTGAIN
|
|
N
|
Mean
|
Std Dev
|
Minimum
|
Maximum
|
|
7
|
51.4571429
|
4.7475808
|
44.8000000
|
56.0000000
|
FEEDTYPE=2
|
Analysis Variable :
WEIGHTGAIN
|
|
N
|
Mean
|
Std Dev
|
Minimum
|
Maximum
|
|
6
|
54.9666667
|
4.7944412
|
51.3000000
|
64.3000000
|
In
this first version of the output the BY statement (along with the PROC SORT)
creates
two tables,
one for each value of the BY variable. In this next example, the CLASS
statement
produces a single
table broken down by
group (FEEDTYPE.)
Summary
statistics USING CLASS
|
Analysis Variable : WEIGHTGAIN
|
|
FEEDTYPE
|
N Obs
|
N
|
Mean
|
Std Dev
|
Minimum
|
Maximum
|
|
1
|
7
|
7
|
51.4571429
|
4.7475808
|
44.8000000
|
56.0000000
|
|
2
|
6
|
6
|
54.9666667
|
4.7944412
|
51.3000000
|
64.3000000
|
Hands on Exercise:
1.
Modify the above program to output the following statistics:
N MEAN MEDIAN MIN MAX
2.
Use MAXDEC=2 to limit number of decimals in output.
EXAMPLE 3: Using PROC MEANS to find
OUTLIERS
PROC MEANS is a quick way to find large or small values in your
data set that may be
considered outliers (see PROC UNIVARIATE also.) This example shows
the results of
using PROC means where the MINIMUM and MAXIMUM identify unusual
values in
the data set. (PROCMEANS3.SAS)
DATA WEIGHT;
INPUT TREATMENT LOSS
@@;
DATALINES;
2 1.0 1 3.0 1
-1.0 1 1.5 1 0.5 1 3.5 1 -99
2 4.5 3 6.0 2
3.5 2 7.5 2 7.0 2 6.0 2 5.5
1 1.5 3 -2.5 3
-0.5 3 1.0 3 .5 3 78 1 .6 2 3 2 4 3 9 1 7 2 2
;
ODS RTF;
PROC MEAN; VAR LOSS;
TITLE 'Find largest
and smallest values';
RUN;
ODS RTF CLOSE;
Notice that in this output, PROC means indicates that there is a
small value of -99 (could
be a missing value code) and a large value of 78 (could be a
miscoded number.) This is a
quick way to find outliers in your data set.
|
Analysis Variable : LOSS
|
|
N
|
Mean
|
Std Dev
|
Minimum
|
Maximum
|
|
26
|
2.0423077
|
25.4650062
|
-99.0000000
|
78.0000000
|
EXAMPLE 4: Using PROC MEANS to perform a
single sample t-test (or Paired t-test)
To compare two paired groups (such as in a before-after situation)
where both
observations are taken from the same or matched subjects, you can
perform a paired t-test
using PROC MEANS. To do this convert the paired data into a
difference variable and
perform a single sample t-test. For example, suppose your data
contained the variables
WBEFORE and WAFTER, (before and after weight on a diet), for 8
subjects. To perform
a paired t-test using PROC MEANS, follow these steps:
- Read in your data.
- Calculate the difference between the
two observations (WLOSS is the amount of weight lost), and
- Report the mean loss, t-statistic
and p-value using PROC MEANS.
The hypotheses for this test are:
Ho: μLoss = 0 (The
average weight loss was 0)
Ha: μLoss ≠ 0 (The
weight loss was different than 0)
For example, the following code performs a paired t-test for
weight loss data:
(PROCMEANS4.SAS)
DATA WEIGHT;
INPUT WBEFORE WAFTER;
* Calculate
WLOSS in the DATA step *;
WLOSS=WAFTER-WBEFORE;
DATALINES;
200 190
175 154
188 176
198 193
197 198
310 240
245 204
202 178
;
ODS RTF;
PROC MEANS N MEAN T PRT; VAR WLOSS;
TITLE 'Paired t-test
example using PROC MEANS';
RUN;
ODS RTF CLOSE;
Notice that the actual test is performed on the new variable
called WLOSS, and that is
why it is the only variable requested in the PROC MEANS statement.
This is essentially
a one-sample t-test. The statistics of interest are the mean of
WLOSS, the t-statistic
associated with the null hypothesis for WLOSS and the p-value. The
SAS output is as
follows:
Paired
t-test example using PROC MEANS
|
Analysis Variable : WLOSS
|
|
N
|
Mean
|
t Value
|
Pr > |t|
|
|
8
|
-22.7500000
|
-2.79
|
0.0270
|
The mean of the variable WLOSS is –22.75. The t-statistic
associated with the null
hypothesis is –2.79, and the p-value for this paired t-test is p =
0.027, which provides
evidence to reject the null hypothesis.
EXAMPLE 5: Using PROC MEANS to output
statistics (advanced)
Suppose you have a data set and you want to add a column
containing a z-statistic based
on the mean and standard deviation of a variable. Here is one way
to do that.
The following data set contains weights of 12 children. You want
to add a column of the
difference of the scores from the mean based on a the information
in the WEIGHT
variable. For good measure also calculate the z-score.
DATA WT;
INPUT WEIGHT;
DATALINES;
64
71
53
67
55
58
77
57
56
51
76
68
;
PROC MEANS NOPRINT DATA=WT;VAR WEIGHT;OUTPUT OUT=WTMEANS
MEAN=WTMEAN STDDEV=WTSD;
RUN;
DATA WTDIFF;SET WT;
IF _N_=1 THEN SET WTMEANS;
DIFF=WEIGHT-WTMEAN;
Z=DIFF/WTSD; * CREATES
STANDARDIZED SCORE (Z-SCORE);
RUN;
ODS RTF;
PROC PRINT DATA= WTDIFF;VAR WEIGHT DIFF Z;
RUN;
ODS RTF CLOSE;
The statement
OUTPUT OUT=WTMEANS MEAN=WTMEAN STDDEV=WTSD;
Creates a SAS data file containing a single record with variables
WTMEAN and WTSD
(and some other system variables.) You can then use that
information to calculate the
desired values, as is done in the code:
DATA WTDIFF;SET WT;
IF _N_=1 THEN SET WTMEANS;
DIFF=WEIGHT-WTMEAN;
Z=DIFF/WTSD; * CREATES
STANDARDIZED SCORE (Z-SCORE);
RUN;
The first SET statement (SET WT) reads in the entire WT data set.
The statement
IF _N_=1 THEN SET WTMEANS;
Reads in the first (and only) record from the WTMEANS data set and
merges the
WTDIFF and WTSD (and a couple of other system variables) into the
new WTDIFF data
set, allowing you to do the calculations to come up with the DIFF
and Z values.
The resulting data set contains the following information
|
Obs
|
WEIGHT
|
DIFF
|
Z
|
|
1
|
64
|
1.25
|
0.13910
|
|
2
|
71
|
8.25
|
0.91808
|
|
3
|
53
|
-9.75
|
-1.08501
|
|
4
|
67
|
4.25
|
0.47295
|
|
5
|
55
|
-7.75
|
-0.86244
|
|
6
|
58
|
-4.75
|
|