1 EPISTAT Statistical Package for the IBM Personal Computer Version 2.1, 1983 Written by: Tracy L. Gustafson, M.D. 2 INTRODUCTION EPISTAT is a collection of programs written in BASICA for statistical analysis of small to medium-sized data samples ( < 1000 observations per sample and < 28 data samples per file). It includes programs to ENTER, APPEND, and EDIT data, as well as perform several kinds of data TRANSFORMATIONS. The datafiles can be PRINTED, GRAPHED, or SAVED to disk. The 21 programs in EPISTAT can also perform 34 common statistical tests or functions. The programs are intended to be as self-explanatory and user- friendly as possible. All questions can be answered with a number, a "Y" for yes, or an "N" for no. A thorough study of this guide is not necessary before using the programs. On the other hand, neither the programs nor this manual purport to TEACH the proper use or interpre- tation of statistics. Rather, some familiarity with the kinds of data required and the underlying assumptions appropriate to each statistical test is assumed. One will note that some of the programs emphasize epidemiologic and medical applications. Despite the wording of various program questions or statements, these test also apply to many other types of data. For further explanations of tests, refer to: 1. Colton, Theodore. Statistics in Medicine. Little, Brown and Co. Boston, 1974. 2. Fleiss, Joseph. Statistical Methods for Rates and Proportions. John Wiley and Sons. New York, 1973. CAVEAT: These programs have been tested extensively, but I cannot guarantee that they will work correctly with every possible data set or in every possible situation. Incorrect results are usually due to errors in the format or type of data entered. If you believe you have discovered a problem in the programs, please write me. I intend to fix any bugs that are brought to my attention. It is good practice to regularly compare the results obtained by programs in EPISTAT with results obtained by your previous method of calculation until you are familiar with each program. ANY unexpected result should be questionned and double-checked by reference to tables or another method of calculation. 3 INDEX TO EPISTAT The following statistical tests and functions are available: TEST or FUNCTION PROGRAM NAME ---------------- ------------ Analysis of variance (1-way)......................ANOVA Analysis of variance (2-way)......................ANOVA Bayes' theorem: False positive and false negative tests.......BAYES Probability of event given positive test......BAYES Binomial distribution.............................BINOMIAL Chi-square distribution...........................CHISQR Chi-square test...................................CHISQR Correlation coefficient (Pearson's)...............CORRELAT F distribution....................................ANOVA Fisher's exact test...............................FISHERS Linear regression analysis........................LNREGRES Mantel-Haenszel Chi-square test...................MHCHISQR Mantel-Haenszel for multiple controls.............MHCHIMLT McNemar's test....................................MCNEMAR Mean..............................................DATA-ONE Median............................................DATA-ONE Normal distribution...............................NORMAL Percent of values in given range..................NORMAL Poisson distribution..............................POISSON Random sample generator: Select sample from a population................RANDOMIZ Assign unpaired cases and controls.............RANDOMIZ Assign paired cases and controls...............RANDOMIZ Rank correlation (Spearman's).....................RANKTEST Rank sum test.....................................RANKTEST Rates adjusted, direct method.....................RATEADJ Rates adjusted, indirect method...................RATEADJ Sample size calculations: For estimating population rate.................SAMPLSIZ For unpaired case-control study................SAMPLSIZ For paired case-control study..................SAMPLSIZ Signed rank test..................................RANKTEST Standard deviation................................DATA-ONE Student's T-test (independent samples)............T-TEST Student's T-test (paired samples).................T-TEST T distribution....................................T-TEST In addition, the following data-handling capabilities are available: DATA MANIPULATION PROGRAM NAME ----------------- ------------ Determine best test and program names.............EPISTAT Enter, append and edit data.......................DATA-ONE Graph data in histogram...........................HISTOGRM Print data (sorted or as entered).................DATA-ONE Perform data transformations......................LNREGRES Save data to disk file............................DATA-ONE Transfer data samples from one file to another....FILETRAN 4 SYSTEM REQUIREMENTS FOR EPISTAT MINIMUM OPTIMAL IBM PC with 64K RAM IBM PC with 96K RAM One 160K disk drive Two disk drives Monochrome monitor Color graphics adapter BASICA Hi-res color monitor BASICA IBM or Epson printer with graphics EPISTAT - OVERALL PROGRAM DESCRIPTION All calculations in EPISTAT are performed using single precision. Although it may first appear that double precision would be more appropriate for statistical tests, "double" precision makes little or no real improvement in precision in these programs. Many of the algorithms used to evaluate p values use trigonometric functions which are calculated in single precision, anyway. Specifying double precision only serves to considerably slow the calculations. For best results, data entries should be numbers between 1E+7 and 1E-7. Larger or smaller numbers should be multiplied by an appropriate power of 10 before entry and analysis in EPISTAT. All EPISTAT programs are written so that as much pertinent information about the test as possible can fit on the final screen. This feature allows a summary printed copy to be produced simply by pressing . This will work any time there is a pause in the program display. Three programs, "DATA-ONE", "HISTOGRM", and "RANDOMIZ", produce printed reports without using . In these, simply follow program instructions to route output to your printer. EPISTAT is the introductory program in the EPISTAT package. DATA-ONE is the major data entry, editing, and printing program. Most of the programs in EPISTAT can evaluate data entered and saved using DATA-ONE. Many of the programs can, in addition, evaluate summary data entered without first using DATA-ONE. The programs marked with a star (*) in the individual descriptions that follow can evaluate raw data SAVED to disk with DATA-ONE. Non-starred programs provide their own data entry routines. 5 INDIVIDUAL PROGRAM DESCRIPTIONS (1) "EPISTAT" This introductory program lists the available programs and aids the user in selecting the best statistical test for his or her data. (2) "DATA-ONE" DATA ENTRY: This is the central data entry program for the EPISTAT package. Initial data entry is accomplished by selecting option 1 and following the instructions to name each sample. Type in your numbers and press twice after each entry. The maximum number of samples (S) in a datafile is 28 with a color and 7 with a monochrome adapter. The maximum number of records in each sample is 2000/S. A blank record can be entered if no data is available for a given cell (or if 2 samples with different numbers of observations are being entered) by pressing , then Key F2. To exit the data entry mode, simply press then key F10 following the last record. The mean, median and standard deviation are then calculated and displayed automatically. When you return to the main menu, choose option 5 (see below) to save your datafile to disk for future modification or use by other programs in the EPISTAT package. Although all entries in a datafile are treated as numbers by DATA-ONE, it is possible to enter character strings in a record. Such strings will be treated as zeros in all calculations. Nevertheless, when entering several samples, it often improves data readability to use the "Sample #1" column for names or identifying information about each ROW of data. Thus, DATA-ONE allows one to specify a name for each column and row in the datafile. DATA MODIFICATION: Option 2, APPEND, allows one to add more observations to a sample after initial data entry has been terminated. Option 3, EDIT, allows one to delete or replace incorrect data entries. Both of these options can be used to modify a datafile that has been loaded from disk. Of course, if you modify a datafile in any way, you will want to SAVE the modified datafile to disk again using Option 5. PRINTING DATA: To view or review a datafile, a printout to screen or printer can be obtained, Option 4. To print a datafile exactly as it was keyed in, request the printout in INPUT order. DATA-ONE has the additional capability to present the data SORTED in the order of any selected sample. Remember, only numeric data is sorted by DATA-ONE, so it will not alphabetize a character field. Further, sorted data will print only NON-BLANK records in the selected sort sample. SAVING DATAFILES and LOADING DATAFILES: Option 5, SAVE datafile, writes your data to disk in a sequential file for later editing, review, or use by another program. DATA MUST BE SAVED TO DISK before it can be used by other programs in EPISTAT. The name chosen for each DATAFILE must conform to the rules for IBM disk file names (see p. 3-36 in BASIC manual). If you have a 2-drive system, you will probably want put the EPISTAT disk in drive A: and SAVE datafiles on drive B. To do so, simply precede each datafile name with B: (e.g. B:TESTDATA). Note that file names entered in DATA-ONE do not need to be enclosed in quotation marks. 6 (3) "ANOVA" * Provides ONE-WAY and TWO-WAY analysis of variance. ONE-WAY ANOVA compares the means of 3 or more samples. TWO-WAY ANOVA compares the combined effects of 2 variables on a third (ROW and COLUMN effects). All samples in TWO-WAY ANOVA must have the same number of elements. The program also provides for evaluation of a known F value. (4) "BAYES" Using Bayes' theorem, this program calculates the rates of false positive and false negative tests given differenct sensitivities and specificities and disease incidences. Using the formula in a different way, it can also calculate the prior probability of several diseases given a positive test. (5) "BINOMIAL" The binomial distribution allows calculation of the probability of a observed number compared to the expected. It assumes the variable is dichotomous and has an equal probability of occurring in each trial. This program calculates the ONE-TAILED probability of the entered number and all more extreme situations. For example, in the case of 2 heads in 10 tosses of a coin, the ONE-TAILED probability includes the sum of the probabilities for 0,1 and 2 heads out of 10 tosses. (6) "CHISQR" The Chi-square test evaluates either a table of data or a known chi-square value. 2 by 2 tables are automatically evaluated using Yates' correction. Tables larger than 15 by 10 cells will not fit on a single screen. (7) "CORRELAT" * Pearson's correlation coefficient assesses the correlation between paired samples. The probability of a given R value is evaluated using the T distribution. (8) "FISHERS" Fisher's exact test evaluates 2 by 2 tables of discrete variables. It is particularly valuable when the Chi-square test cannot be used because the expected value for a cell is < 5. However, this program can evaluate some tables where A+B+C+D > 200. (9) "HISTOGRM" * The histogram program graphs a data sample according to user specifications on the high resolution graphics screen. This screen image can be printed on an IBM or Epson printer with graphics features. To obtain a printed copy, simply press key F10 after the graph is displayed on screen. (Printing takes several minutes). If you do not want a printed copy, press key F1 to return to the program. 7 (10) "LNREGRES" * Linear regression analysis calculates the least-squares regression line for paired samples. It then uses the T distribution to determine if the calculated slope is significantly different than zero. The program also allows the user to specify several types of data transformations prior to regression analysis. Transformed data samples can be saved to disk for future use (or printout). (11) "MHCHISQR" The Mantel-Haenszel Chi-square test evaluates the relationship between two discrete variables while controlling for the effect of a third variable. (12) "MHCHIMLT" * The Mantel-Haenszel Chi-square test for multiple controls compares one sample (the case sample) to 2 or more matched samples (control samples). The program can evaluate raw data input using DATA-ONE, if the data is entered as "1" for factor present, and "0" for factor absent in each case and control sample record. The program will also evaluate summary data entered per program instructions. (13) "MCNEMAR" McNemar's test, or the paired Chi-square test, evaluates 2 by 2 tables of paired discrete variables. It compares discordant pairs (using Yates' correction) and calculates a probability that compares very well to the results of the binomial distribution. (14) "NORMAL" * The normal distribution has innumerable uses in statistics. This program specifically addresses three situations: First, it compares a sample mean to a population mean. Second, it calculates the proportion of samples that would be expected to fall in any given range under the normal curve. Third, it calculates the probability associated with any given value of z. (15) "POISSON" The Poisson distribution applies to dichotomous variables when the number of successes can be counted, but the number of failures cannot. It can also be used to approximate the binomial distribution when the number of trials is large (>100) and the expected rate is small (<5%). This program, like the Binomial program, calculates a ONE-TAILED probability. (16) "RANDOMIZ" This random sample generator aids in the selection of random samples for several purposes. It can provide a random subset of a larger population, or it can assign cases randomly to independent or paired groups for case-control studies. 8 (17) "RANKTEST" * Three non-parametric tests of significance are performed by this program. They are appropriate for any sample which is clearly NOT normally distributed. They also specifically apply when quantitative variables are not available but qualitative ranks are. The RANK SUM TEST compares 2 independent samples. The SIGNED RANK TEST compares the medians of paired samples. Spearman's RANK CORRELATION calculates a correlation coefficient for paired samples. For the first two tests, the program calculates a TWO-TAILED exact probability associated with the various rank sums. Note that for samples larger than 20 observations, the latter calculation can take several minutes. (18) "RATEADJ" * The rates adjustment program will adjust sample rates by either the direct or indirect methods. For DIRECT method adjustment, the datafile entered in DATA-ONE must include the study sample rates and the standard population figures. For INDIRECT method adjustment, the datafile used must include the study population figures and the standard population rates. After INDIRECT rate adjustment, the program will evaluate the probability of the observed number of cases using the Poisson distribution for small numbers, or the Chi-square distribution for large observed numbers. (19) "SAMPLSIZ" * The sample size program calculates the approximate sample sizes required to achieve statistical significance given certain specified levels of certainty. The following formulas are used: For a survey: N = [ z(a)*SQR(pi*(1-pi)) / d ] squared If N > 10% of entire population, then N' = N / (1+N/TP) . For a paired case-control study: N = [(z(a)*SQR(pi*(1-pi)) - z(b)*SQR(pi*(1-pi))) / (PT-pi) ] squared For an unpaired case-control study: [(z(a)*SQR(2*pi*(1-pi)) - z(b)*SQR(PT*(1-PT) + PC*(1-PC))] N = [-----------------------------------------------------------] squared (PT - PC) (20) "T-TEST" * The Student's T-Test compares the means of two samples. The program provides both the paired and unpaired T-Test calculations. The program will also evaluate a known T value. 9 (21) "FILETRAN" * On occassion you may find that you want to compare 2 samples that are already entered in separate DATAFILES. Or you may have standard population figures in one datafile and sample rates to be adjusted in a different datafile. EPISTAT programs, however, only allow analysis of samples that are in a SINGLE DATAFILE. Rather than reenter one or both samples from keyboard, this file transfer program allows you to add a sample from DATAFILE #1 to any other DATAFILE #2. You may also create an entirely new datafile by selecting one sample from DATAFILE #1 and another from DATAFILE #2. Yet another option in FILETRAN is the ability to combine 2 samples into a single one by APPENDING one to the other. This utility program should make reentry of data unnecessary, regardless of the number of tests applied to it. NOTICE --------------------------------------------------------------------- Users may copy EPISTAT and distribute it to others on the following conditions: 1. The programs are not modified in any way. 2. Individual programs are not distributed separately. 3. No fee is charged for copying or distribution. --------------------------------------------------------------------- The concept of user-supported software is based on three principles: 1. The value and utility of a software (programs) are best assessed by each user on his or her own system with his or her own data. Only after using a program can one determine whether it serves one's personal applications, needs, and tastes. 2. The creation of independent personal computer software requires a substantial commitment of time and effort. Rather than duplicate this effort time after time, the computing community can and should support individual creative efforts. 3. The copying and networking of programs should be encouraged, not restricted. The entire computing community benefits when the burden of copy-protection is removed. If after using EPISTAT, you find it of value, your contribution in any amount will be appreciated ( $20 suggested ). Send contributions to: Tracy L. Gustafson, M.D. 1705 Gattis School Road Round Rock, Texas 78664 Thank you, and good luck.    oad Round Rock, Texas 78664 Thank you, and good luck.