Davis & Pecar: Business Statistics Using Excel 2e
The glossary terms are arranged alphabetically. Access the hyperlinks below to 'jump' to various sections in the glossary. You can also access the full version in PDF format.
The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm
Addition law for mutually exclusive events
Addition law for mutually exclusive events is a result used to determine the probability that event A or event B occurs, but boThevents cannot occur at the same time.
The additive model time series model is a model whereby the separate components of the time series are added together to identify the actual time series value.
Adjusted R squared measures the proportion of the variation in the dependent variable accounted for by the explanatory variables and adjusted for the number of degrees of freedom.
Aggregate price index
A measure of the value of money based on a collection (a basket) of items and compared to the same collection of items at some base date or a period of time.
Alpha refers to the probability that the true population parameter lies outside the confidence interval. Not to be confused with the symbol alpha in a time series context i.e. exponential smoothing, where alpha is the smoothing constant.
Alternative hypothesis (H1)
The alternative hypothesis, H1, is a statement of what a statistical hypothesis test is set up to establish.
The sum of a list of numbers divided by the number of numbers.
Assumptions An assumption is a proposition that is taken for granted.
Autocorrelation is the correlation between members of a time series of observations and the same values shifted at a fixed time interval.
A bar chart is a way of summarizing a set of categorical data.
Base index period
A value of a variable relative to its previous value at some fixed base.
Beta refers to the probability that a false population parameter lies inside the confidence interval.
A Binomial distribution can be used to model a range of discrete random data variables.
A binomial experiment is an experiment with a fixed number of independent trials. Each trial has exactly two outcomes and the probability of each outcome in a binomial experiment remains the same for each trial.
A box plot is a way of summarizing a set of data measured on an interval scale.
A box-and-whisker plot is a way of summarizing a set of data measured on an interval scale.
Brown’s single exponential smoothing method
Brown’s single exponential smoothing method is the basis for a forecasting method called Simple Exponential Smoothing.
A set of data is said to be categorical if the values or observations belonging to it can be sorted according to category.
Central Limit Theorem
The Central Limit Theorem states that whenever a random sample is taken from any distribution (m, s2), then the sample mean will be approximately normally distributed with mean m and variance s2/n.
Measures the location of the middle or the centre of a distribution.
Chance is the unknown and unpredictable element in happenings that seems to have no assignable cause.
Chi square distribution
The chi square distribution is a mathematical distribution that is used directly or indirectly in many tests of significance.
Chi square test
Apply the chi square distribution to test for homogeneity, independence, or goodness-of-fit.
Chi square test for goodness-of-fit
The chi-square goodness-of-fi t test of a statistical model describes how well the statistical model fits a set of observations.
Chi square test of association
The chisquare test of association provides a method for testing the association between the row and column variables in a two-way table where the null hypothesis H0 assumes that there is no association between the variables.
Chi square test of independent samples
Pearson chi-square test is a non-parametric test for a difference in proportions between two or more independent samples.
Class boundaries separate one class in a grouped frequency distribution from another.
Class limits separate one class in a grouped frequency distribution from another.
The class mid-point is the midpoint of each class interval.
Classes provide several convenient intervals into which the values of the variable of a frequency distribution may be grouped.
Classical time series analysis
Approach to forecasting that decomposes a time series into certain constituent components (trend, cyclical, seasonal, and random component), makes estimates of each component, and then re-composes the time series and extrapolates into the future.
Classical time series decomposition
Classical time series decomposition is a statistical method that deconstructs a time series into notional components.
Coefficient of determination (COD)
The proportion of the variance in the dependent variable that is predicted from the independent variable.
Coefficient of variation
The coefficient of variation measures the spread of a set of data as a proportion of its mean.
Conditional probability is the probability of an event occurring given that another event has already occurred.
Confidence interval (1 − a)
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter.
A contingency table is a table of frequencies classified according to the values of the variables in question.
Continuous probability distribution
If a random variable is a continuous variable, its probability distribution is called a continuous probability distribution.
Continuous random variable
A continuous random variable is one which takes an infinite number of possible values.
A set of data is said to be continuous if the values belong to a continuous interval of real values.
Covariance is a measure of how much two variables change together.
Critical test statistic
The critical value for a hypothesis test is a limit at which the value of the sample test statistic is judged to be such that the null hypothesis may be rejected.
The critical value(s) for a hypothesis test is a threshold to which the value of the test statistic in a sample is compared to determine whether or not the null hypothesis is rejected.
Cross tabulation is the process made with two or more data sources (variables) that are tabulating the results of one against the other.
Cumulative distribution function
The cumulative distribution function (CDF), or just distribution function, describes the probability that a real-valued random variable X with a given probability distribution will be found at a value less than or equal to x.
Cumulative frequency distribution
The cumulative frequency for a value x is the total number of scores that are less than or equal to x.
Cyclical variations (C)
The cyclical variations of the time series model that result in periodic above-trend and below trend behaviour of the time series lasting more than one year.
Degrees of freedom
Refers to the number of independent observations in a sample minus the number of population parameters that must be estimated from sample data.
A dependent variable is what you measure in the experiment and what is affected during the experiment.
Discrete data are a set of data where the values/observations belonging to it are distinct and separate, i.e. they can be counted (1,2,3. . .).
Discrete probability distribution
If a random variable is a discrete variable, its probability distribution is called a discrete probability distribution.
Discrete random variable
A discrete random variable is one which may take on only a countable number of distinct values such as 0, 1, 2, 3, 4 . . .
A set of data is said to be discrete if the values belonging to it can be counted as 1, 2, 3 . . .
The variation between data values is called dispersion.
The Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation (a relationship between values separated from each other by a given time lag) in the residuals (prediction errors) from a regression analysis.
Empirical probability, also known as relative frequency, or experimental probability, is the ratio of the number of outcomes in which a specified event occurs to the total number of trials.
Equal variance (homoscedasticity)
Homogeneity of variance (homoscedasticity) assumptions state that the error variance should be constant.
A method of validating the quality of forecasts. Involves calculating the mean error, the mean squared error, and the percentage error, etc.
An estimate is an indication of the value of an unknown quantity based on observed data.
Event An event is any collection of outcomes of an experiment.
In a contingency table the expected frequencies are the frequencies that you would predict in each cell of the table, if you knew only the row and column totals, and if you assumed that the variables under comparison were independent.
Experimental probability approach
Experimental probability approach (see Empirical approach).
One of the methods of forecasting that uses a constant (or several constants) to predict future values by ‘smoothing’ the past values in the series. The effect of this constant decreases exponentially as the older observations are taken into calculation.
An underlying time series trend that follows the movements of an exponential curve.
An extreme value is an unusually large or an unusually small value compared with the others in the data set.
The F distribution (also known the Fisher–Snedecor distribution) is a continuous probability distribution that arises frequently as the null distribution of a test statistic, most notably in the analysis of variance.
Tests whether two population variances are the same based upon sample values.
F test for two population variances (variance ratio test)
F test for two population variances (variance ratio test) is used to test if the variances of two populations are equal.
A five-number summary is especially useful when we have so many data that it is sufficient to present a summary of the data rather than the whole data set.
A method of predicting the future values of a variable, usually represented as the time series values.
A difference between the actual and the forecasted value in the time series.
A number of the future time units until which the forecasts will be extended.
Frequency definition of probability
Frequency definition of probability defines an event’s probability as the limit of its relative frequency in a large number of trials.
Systematic method of showing the number of occurrences of observational data in order from least to greatest.
A graph made by joining the middle-top points of the columns of a frequency histogram.
General addition probability law
General addition probability law is a result used to determine the probability that event A or event B occurs or both occur.
A graph is a picture designed to express words, particularly the connection between two or more quantities.
Grouped frequency distributions
Data arranged in intervals to show the frequency with which the possible values of a variable occur.
A histogram is a way of summarizing data that are measured on an interval scale (either discrete or continuous).
Histogram with unequal class intervals
A histogram with unequal class intervals is a graphical representation showing a visual impression of the distribution of data where class widths are of different sizes.
Hypothesis test procedure
A series of steps to determine whether to accept or reject a null hypothesis, based on sample data.
Independence of errors
Independence of errors means that the distribution of errors is random and not influenced by or correlated to the errors in prior observations. The opposite of independence is called autocorrelation.
Two events are independent if the occurrence of one of the events has no influence on the occurrence of the other event.
An independent variable is the variable you have control over, what you can choose and manipulate.
A value of a variable relative to its previous value at some base.
Value of the regression equation (y) when the x value = 0.
The interquartile range is a measure of the spread of or dispersion within a data set.
An interval scale is a scale of measurement where the distance between any two adjacent units of measurement (or ‘intervals’) is the same but the zero point is arbitrary.
The irregular variations of the time series model that reflect the random variation of the time series values beyond what can be explained by the trend, cyclical, and seasonal components.
Kurtosis is a measure of the ‘peakedness’ or the distribution.
The method of least squares is a criterion for fitting a specified model to observed data. If refers to finding the smallest (least) sum of squared differences between fitted and actual values.
Left-skewed (or negative skew) indicates that the tail on the left side of the probability density function is longer than the right side and the bulk of the values (possibly including the median) lie to the right of the mean.
Level of confidence
The confidence level is the probability value (1 − a) associated with a confidence interval.
Level of significance
The level of significance is the criterion used for rejecting the null hypothesis.
A linear relationship exists between variables if, when you plot their values, you get a straight line.
Linear regression analysis
Simple linear regression aims to find a linear relationship between a response variable and a possible predictor variable by the method of least squares.
Linear trend is a straight line fit to a data set.
A model that uses the logarithmic equation to approximate the time series.
Lower one tail test
A lower one tail test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0 are located entirely in the left tail of the probability distribution.
Mann–Whitney U test
The Mann– Whitney U test is used to test the null hypothesis that two populations have identical distribution functions against the alternative hypothesis that the two distribution functions differ only with respect to location (median), if at all.
McNemar’s test is a non-parametric method used on nominal data to determine whether the row and column marginal frequencies are equal.
The mean is a measure of the average data value for a data set.
Mean absolute deviation (MAD)
The mean value of all the differences between the actual and forecasted values in the time series. The differences between these values are represented as absolute values, i.e. the effects of the sign are ignored.
Mean absolute percentage error (MAPE)
The mean value of all the differences between the actual and forecasted values in the time series. The differences between these values are represented as absolute percentage values, i.e. the effects of the sign are ignored.
Mean error (ME)
The mean value of all the differences between the actual and forecasted values in the time series.
Mean percentage error (MPE)
The mean value of all the differences between the actual and forecasted values in the time series. The differences between these values are represented as percentage values.
Mean square error (MSE) The mean value of all the differences between the actual and forecasted values in the time series. The differences between these values are squared to avoid positive and negative differences cancelling each other.
The median is the value halfway through the ordered data set.
The mixed time series blends both additive and multiplicative components together to identify the actual time series value.
The mode is the most frequently occurring value in a set of discrete data.
Averages calculated for a limited number of periods in a time series. Every subsequent period excludes the first observation from the previous period and includes the one following the previous period. This becomes a series of moving averages.
Moving average trend
The moving average trend is a method of forecasting or smoothing a time series by averaging each successive group of data points.
Multiple regression model
Multiple linear regression aims to find a linear relationship between a dependent variable and several possible independent variables.
Multiplication law is a result used to determine the probability that two events, A and B, both occur.
Multiplication law for independent events
Multiplication law for independent events is the chance that they both happen simultaneously is the product of the chances that each occurs individually, e.g. P(A and B) = P(A)*P(B).
Multiplication law for joint events
see Multiplication law.
The multiplicative time series model is a model whereby the separate components of the time series are multiplied together to identify the actual time series value.
Methods that use more than one variable and try to predict the future values of one of the variables by using the values of other variables.
Mutually exclusive events are ones that cannot occur at the same time.
A set of data is said to be nominal if the values belonging to it can be assigned a label rather than a number.
Non-parametric tests are often used in place of their parametric counterparts when certain assumptions about the underlying population are questionable.
Non-seasonal is the component of variation in a time series which is not dependent on the time of year.
Non-stationary time series
A time series that does not have a constant mean and oscillates around this moving mean.
Normal approximation to the binomial
If the number of trials, n, is large, the binomial distribution is approximately equal to the normal distribution.
The normal distribution is a symmetrical, bell-shaped curve, centred at its expected value.
Normal probability plot
Graphical technique to assess whether the data is normally distributed.
Normality of errors
Normality of errors assumption states that the errors should be normally distributed—technically normality is necessary only for the t-tests to be valid, estimation of the coefficients only requires that the errors be identically and independently distributed.
Null hypothesis (H0)
The null hypothesis, H0, represents a theory that has been put forward but has not been proved.
In a contingency table the observed frequencies are the frequencies actually obtained in each cell of the table, from our random sample.
One sample test
A one sample test is a hypothesis test for answering questions about the mean (or median) where the data are a random sample of independent observations from an underlying distribution.
One sample t-test for the population mean
A one sample t-test is a hypothesis test for answering questions about the mean where the data are a random sample of independent observations from an underlying normal distribution where population variance is unknown.
One sample z-test for the population mean
A one-sample z-test is used to test whether a population parameter is significantly different from some hypothesized value.
One tail test
A one tail test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0, are located entirely in one tail of the probability distribution.
Ordinal scale is a scale where the values/observations belonging to it can be ranked (put in order) or have a rating scale attached. You can count and order, but not measure, ordinal data.
A set of data is said to be ordinal if the values belonging to it can be ranked.
An outcome is the result of an experiment or other situation involving uncertainty.
An outlier is an observation in a data set which is far removed in value from the others in the data set.
Any statistic computed by procedures that assumes the data were drawn from a particular distribution.
Pearson’s coefficient of correlation
Pearson’s correlation coefficient measures the linear association between two variables that have been measured on interval or ratio scales.
A pie chart is a way of summarizing a set of categorical data.
A point estimate (or estimator) is any quantity calculated from the sample data which is used to provide information about the population.
Point estimate of the population mean
Point estimate for the mean involves the use of the sample mean to provide a ‘best estimate’ of the unknown population mean.
Point estimate of the population proportion
Point estimate for the proportion involves the use of the sample proportion to provide a ‘best estimate’ of the unknown population proportion.
Point estimate of the population variance
Point estimate for the variance involves the use of the sample variance to provide a ‘best estimate’ of the unknown population variance.
Poisson distributions model a range of discrete random data variables.
Poisson probability distribution
The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.
A polynomial line is a curved line whose curvature depends on the degree of the polynomial variable.
A model that uses an equation of any polynomial curve (parabola, cubic curve, etc.) to approximate the time series.
The population mean is the mean value of all possible values.
Population standard deviation The population standard deviation is the standard deviation of all possible values.
The population variance is the variance of all possible values.
A model that uses an equation of a power curve (a parabola) to approximate the time series.
Probability provides a quantitative description of the likely occurrence of a particular event.
Probability of event A given that event B has occurred
See Conditional probability.
Probable represents that an event or events is likely to happen or to be true.
The p-value is the probability of getting a value of the test statistic as extreme as or more extreme than that observed by chance alone, if the null hypothesis is true.
Q1 is the lower quartile and is the data value a quarter way up through the ordered data set.
Q3 is the upper quartile and is the data value a quarter way down through the ordered data set.
Variables can be classified as descriptive or categorical.
Variables can be classified using numbers.
Quartiles are values that divide a sample of data into four groups containing an equal number of observations.
A random experiment is an experiment, trial, or observation that can be repeated numerous times under the same conditions.
A random sample is a sampling technique where we select a sample from a population of values.
A random variable is a function that associates a unique numerical value with every outcome of an experiment.
List data in order of size.
The range of a data set is a measure of the dispersion of the observations.
Ratio scale consists not only of equidistant points but also has a meaningful zero point.
Raw data is data collected in original form.
Region of rejection
The range of values that leads to rejection of the null hypothesis.
Regression analysis is used to model the relationship between a dependent variable and one or more independent variables.
A regression coefficient is a measure of the relationship between a dependent variable and an independent variable.
Relative frequency is another term for proportion; it is the value calculated by dividing the number of times an event occurs by the total number of times an experiment is carried out.
The residual represents the unexplained variation (or error) after fitting a regression model.
The differences between the actual and predicted values. Sometimes called forecasting errors. Their behaviour and pattern has to be random.
Right-skewed (or positive skew) indicates that the tail on the right side is longer than the left side and the bulk of the values lie to the left of the mean.
If a test is robust, the validity of the test result will not be affected by poorly structured data. In other words, it is resistant against violations of parametric assumptions.
The sample space is an exhaustive list of all the possible outcomes of an experiment.
Sample standard deviation
A sample standard deviation is an estimate, based on a sample, of a population standard deviation.
The sampling distribution describes probabilities associated with a statistic when a random sample is drawn from a population.
Sampling error refers to the error that results from taking one sample rather than taking a census of the entire population.
A sampling frame is the source material or device from which a sample is drawn.
A scatter plot is a plot of one variable against another variable.
Seasonal is the component of variation in a time series which is dependent on the time of year.
A component in the classical time series analysis approach to forecasting that covers seasonal movements of the time series, usually taking place inside one year’s horizon.
Seasonal time series
A time series, represented in the units of time smaller than a year, that shows regular pattern in repeating itself over a number of these units of time.
Seasonal variations (S)
The seasonal variations of the time series model that shows a periodic pattern over one year or less.
The shape of the distribution refers to the shape of a probability distribution and involves the calculation of skewness and kurtosis.
Significance level, α
The significance level of a statistical hypothesis test is a fixed probability of wrongly rejecting the null hypothesis, H0, if it is in fact true.
The sign test is designed to test a hypothesis about the location of a population distribution.
Simple exponential smoothing
Simple exponential smoothing is a forecasting technique that uses a weighted average of past time series values to arrive at smoothed time series values that can be used as forecasts.
A simple index is designed to measure changes in some measure over time.
Skewness is defined as asymmetry in the distribution of the data values.
Gradient of the fitted regression line.
Smoothing constant Smoothing constant is a parameter of the exponential smoothing model that provides the weight given to the most recent time series value in the calculation of the forecast value.
Spearman’s rank coefficient of correlation
Spearman’s rank correlation coefficient is applied to data sets when it is not convenient to give actual values to variables but one can assign a rank order to instances of each variable.
Measure of the dispersion of the observations (A square root value of the variance)
Standard error of forecast
The square root of the variance of all forecasting errors adjusted for the sample size.
Standard error of the estimate (SEE)
The standard error of the estimate (SEE) is an estimate of the average squared error in prediction.
Standard error of the mean
The standard error of the mean (SEM) is the standard deviation of the sample mean’s estimate of a population mean.
Standard error of the proportion
The standard error of the proportion is the standard deviation of the sample proportion’s estimate of a population proportion.
Standard normal distribution
A standard normal distribution is a normal distribution with zero mean (μ = 0) and unit variance (σ2 = 1).
The lower and upper limits of a class interval.
A statistic is a quantity that is calculated from a sample of data.
Two events are independent if the occurrence of one of the events gives us no information about whether or not the other event will occur.
The power of a statistical test is the probability that it will correctly lead to the rejection of a false null hypothesis.
Stationary time series
A time series that does have a constant mean and oscillates around this mean.
Student’s t distribution
The t distribution is the sampling distribution of the t statistic.
Sum of squares for error (SSE)
The SSE measures the variation in the modelling errors.
Sum of squares for regression (SSR)
The SSR measures how much variation there is in the modelled values.
A data set is symmetrical when the data values are distributed in the same way above and below the middle value.
A table shows the number of times that items occur.
A tally chart is a method of counting frequencies, according to some classification, in a set of data.
A test statistic is a quantity calculated from our sample of data.
Two or more data values share a rank value.
An unit of time by which the variable is defined (an hour, a day, a month, a year, etc.).
A variable measured and represented per units of time.
Time series plot
A chart of a change in variable against time.
Total sum of squares (SST)
The SST measures how much variation there is in the observed data (SST = SSR + SSE).
The trend is the long-run shift or movement in the time series observable over several periods of time.
A component in the classical time series analysis approach to forecasting that covers underlying directional movements of the time series.
True or mathematical limits
True or mathematical limits separate one class in a grouped frequency distribution from another.
Two sample tests
A two sample test is a hypothesis test for answering questions about the mean where the data are collected from two random samples of independent observations, each from an underlying distribution.
Two sample t-test for population mean (dependent or paired samples)
A two sample t-test for population mean (dependent or paired samples) is used to compare two dependent population means inferred from two samples (dependent indicates that the values from both samples are numerically dependent upon each other—there is a correlation between corresponding values).
Two sample t-test for the population mean (independent samples, equal variance)
A two sample t-test for the population mean (independent samples, equal variance) is used when two separate sets of independent and identically distributed samples are obtained, one from each of the two populations being compared.
Two sample t-test for population mean (independent samples, unequal variances)
A two sample t-test for population mean (independent samples, unequal variances) is used when two separate sets of independent but differently distributed samples are obtained, one from each of the two populations being compared.
Two sample z-test for the population mean
A two sample z-test for the population mean is used to evaluate the difference between two group means.
Two sample z-test for the population proportion
A two sample z-test for the population proportion is used to evaluate the difference between two group proportions.
Two tail test
A two tail test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0, are located in both tails of the probability distribution.
Type I error, α
A type I error occurs when the null hypothesis is rejected when it is in fact true.
Type II error, α
A type II error occurs when the null hypothesis, H0, is not rejected when it is in fact false.
Types of trends
The type of trend can include line and curve fits to the data set.
When the mean of the sampling distribution of a statistic is equal to a population parameter, that statistic is said to be an unbiased estimator of the parameter.
Uncertainty is a state of having limited knowledge where it is impossible to describe exactly the existing state or future outcome of a particular event occurring.
Methods that use only one variable and try to predict its future value on the basis of the past values of the same variable.
Upper one tail test
A upper one tail test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0 are located entirely in the right tail of the probability distribution.
A variable is a symbol that can take on any of a specified set of values.
Measure of the dispersion of the observations.
Variation is a measure that describes how spread out or scattered a set of data is.
Wilcoxon signed rank sum test
The Wilcoxon signed ranks test is designed to test a hypothesis about the location of the population median (one or two matched pairs).