When working with data, one of the important things to consider is whether or not it follows a normal distribution. Normality, or the Gaussian distribution, is characterized by a bell-shaped curve that shows the data clusters most frequently around the mean or average value, with fewer values occurring farther away from it. The normal distribution is a widely used statistical model for a wide range of applications because of its attractive mathematical properties, and because many real-world phenomena have been shown to follow a normal distribution.
Normality tests are conducted to determine whether a given variable or set of variables follow a normal distribution. This is a key step in many statistical analyses which assumes normality, such as hypothesis testing, t-tests, ANOVA, and Linear regression.
Luckily, there are many statistical tests available to determine if your data follows a normal distribution. In this article, we will explore several tests for normality, including graphical methods and statistical tests, along with their corresponding code examples using R, a popular programming language for statistical analysis.
Graphical Methods for Testing Normality
- Histograms and Density Plots
Histograms are one of the most common ways to visualize a variable's distribution. They show the frequency of data points in each bin or interval, and can give you an idea of whether the data is skewed, symmetric or bimodal. A density plot is similar to a histogram, except it smoother and continuous, with a curve that represents an estimate of the probability distribution function of the variable.
Here's an example of how to create a histogram and density plot for a data variable “age”, using the built-in dataset "mtcars” in R:
# Load the dataset
data("mtcars")
# Plot a histogram
hist(mtcars$age, col="skyblue", main="Histogram of Age")
# Plot a density plot
plot(density(mtcars$age), col="skyblue", main="Density Plot of Age")
Histograms and density plots can indicate whether the data is roughly normal or not. In general, a normal distribution will show a bell-shaped curve in the center of the plot, with tails extending symmetrically in both directions. However, the issue with this method is that it is often difficult to determine graphically whether the distribution is exactly normal or not.
- QQ Plots
A more reliable graphical method for testing normality is a Quantile-Quantile (QQ) plot. QQ plots can visualize how well the observed data points match up with what would be expected in a normally distributed population. They plot the quantiles of the data versus those of an expected normal distribution on a scatter plot.
If the distribution is normal, the points on the plot will lie along a straight line. However, if the distribution is not normal, the points will deviate from the straight line in some noticeable way. Skewed data will cause points to deviate on one side of the line more than the other.
Here's an example of how to create a QQ plot for the variable "age" using the 'car' package in R:
#install.packages("car") #install the package if needed
library(car)
# Create QQ plot
qqPlot(mtcars$age)
A QQ plot is a straightforward way to visualize normality, especially when dealing with smaller datasets.
Statistical Tests for Normality
While graphical methods are a helpful initial step in determining normality, statistical tests are often necessary for statistical inference purposes. There are several statistical tests available to check for normality, and we will discuss the most widely used tests in this section.
- Shapiro-Wilk Test
The Shapiro-Wilk test is a popular test of normality that provides an estimate of the probability that the data is from a normal distribution. It is often used to determine the normality of a small sample size, as it is highly sensitive for detecting departures from normality in small datasets.
In R, the 'shapiro.test()' function can be used to conduct the Shapiro-Wilk test.
# Load the dataset
data("mtcars")
# Shapiro-Wilk Test
shapiro.test(mtcars$age)
The output provides a p-value, which is typically compared to a significance level, denoted as alpha (α). If the p-value is greater than alpha, we do not reject the null hypothesis that the data is normally distributed. In this example, the p-value is 0.0732, which is greater than the typical significance level of 0.05, hence we can infer that the data is normal.
- Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov test is another method to test for normality, which checks whether the data distribution conforms to a normal distribution. The null hypothesis assumes that the data is normally distributed, and this test rejects the null hypothesis when there is significant evidence to show that the data does not follow a normal distribution.
In R, the 'ks.test()' function can be used to conduct the Kolmogorov-Smirnov test.
# Load the dataset
data("mtcars")
# Kolmogorov-Smirnov Test
ks.test(mtcars$age, "pnorm", mean(mtcars$age), sd(mtcars$age))
Similar to the Shapiro-Wilk test, the output provides a p-value and we compare it against an alpha level to determine whether to reject or accept the null hypothesis. If the p-value is greater than alpha (e.g., 0.05), we assume that the data is normal.
- Anderson-Darling Test
The Anderson-Darling test is another test for normality, which involves calculating the distance between the empirical cumulative distribution function of the data and the normal distribution function. The test output provides a critical value and a significance level that can be compared against the p-value to determine whether to reject or accept the null hypothesis.
In R, the 'ad.test()' function from the 'nortest' package can be used to conduct the Anderson-Darling test.
#install.packages("nortest") #install the package if needed
library(nortest)
# Anderson-Darling Test
ad.test(mtcars$age)
The output provides a p-value and a critical value, as well as the positive test statistic value and degrees of freedom. The null hypothesis is rejected if the p-value is less than the chosen significance level.
Conclusion
In conclusion, normality tests are essential when working with data that requires statistical analysis. While the graphical methods are helpful to get a visual indication of whether a dataset is normally distributed, statistical tests provide a more rigorous and reliable estimation. Researchers have various tests to choose from, depending on the size of the sample, data distribution, and other factors. We have outlined some of the most commonly used tests, including the Shapiro-Wilk, Kolmogorov-Smirnov, and Anderson-Darling tests, alongside their R code examples. It is important to choose the right test for your data to ensure that any analysis thereafter will be valid.
- Histograms and Density Plots
Histograms and Density plots are great tools for understanding the distribution of a variable and whether the data is normally distributed or not. A histogram is a graphical representation of the distribution of numerical data, and shows the distribution by organizing data into bins and calculating the count of observations that fall into each bin. On the other hand, a density plot is a non-parametric estimate of the probability density function of a random variable. A density plot displays a smoothed version of a histogram, giving a sense of the shape of continuous distributions of data. In R, both histograms and density plots can be created using thehist()
anddensity()
functions, respectively.
#histogram
hist(mtcars$age, col="skyblue", main="Histogram of Age")
#density plot
plot(density(mtcars$age), col="skyblue", main="Density Plot of Age")
- QQ Plots
A QQ (Quantile-Quantile) plot is a graphical tool that can help you determine if a sample of data comes from a normally distributed population. It plots the quantiles of the sample against the quantiles of a standard normal distribution. If the data follows a normal distribution, the points on the plot will form a straight line. If the data does not follow a normal distribution, the points will depart from a straight line in some systematic way. In R, QQ Plots can be created using theqqnorm()
function.
#QQ Plot
qqnorm(mtcars$age)
qqline(mtcars$age)
- Shapiro-Wilk Test
The Shapiro-Wilk test is a test of normality that assesses whether a set of data is drawn from a normal distribution. The test is based on the observed distribution, and calculates the probability that the data was drawn from a normal distribution. The test statistic is based on the difference between the expected sample values in a normal distribution and the actual sample values in the data. The test is widely used for small to medium sample sizes (n < 50), as it has been shown to be generally more powerful than other tests of normality for these sample sizes. In R, the Shapiro-Wilk test can be performed using theshapiro.test()
function.
#Shapiro-Wilk Test
shapiro.test(mtcars$age)
- Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov test is a non-parametric test of the equality of continuous one-dimensional probability distributions. It is used to determine whether two sets of data differ significantly from each other, and assumes no prior knowledge of the underlying distribution of the data. The test compares the empirical distribution function of the data to the theoretical cumulative distribution function of the distribution being tested. In R, the Kolmogorov-Smirnov test can be performed using theks.test()
function.
#Kolmogorov-Smirnov Test
ks.test(mtcars$age, "pnorm", mean(mtcars$age), sd(mtcars$age))
- Anderson-Darling Test
The Anderson-Darling test is another test of normality that is based on the difference between the observed sample distribution function and the expected distribution function of a normal population. The advantage of the Anderson-Darling test is that it is more powerful than other tests of normality for smaller sample sizes. In R, the Anderson-Darling test can be performed using thead.test()
function in thenortest
package.
#Anderson-Darling Test
ad.test(mtcars$age)
In conclusion, normality tests are important in statistical analysis to ensure that our data follows a normal distribution or to determine whether we need to use other types of analysis. Histograms, density plots, QQ Plots and statistical tests like the Shapiro-Wilk test, Kolmogorov-Smirnov test and Anderson-Darling test are great tools to determine the normality of our data. In R, there are many built-in functions and packages that can be utilized to perform these tests and visualize your data.
Popular questions
- What are some graphical methods for testing normality in R?
- Graphical methods for testing normality in R include histograms, density plots, and QQ (Quantile-Quantile) plots.
- What is the Shapiro-Wilk test in R?
- The Shapiro-Wilk test is a popular test of normality in R that provides an estimate of the probability that the data is from a normal distribution. It is often used to determine the normality of a small sample size.
- What is the Kolmogorov-Smirnov test in R?
- The Kolmogorov-Smirnov test is a non-parametric test of the equality of continuous one-dimensional probability distributions. It is used in R to determine whether two sets of data differ significantly from each other and assumes no prior knowledge of the underlying distribution of the data.
- How can you create a density plot in R?
- To create a density plot in R, you can use the
density()
function. For example, to create a density plot for a variableage
using the "mtcars" dataset, you can use the following code:
plot(density(mtcars$age), col="skyblue", main="Density Plot of Age")
- What is an advantage of the Anderson-Darling test over other tests of normality?
- An advantage of the Anderson-Darling test is that it is more powerful than other tests of normality for smaller sample sizes. It is based on the difference between the observed sample distribution function and the expected distribution function of a normal population.
Tag
"Normtest"