In data analysis and statistical modeling, distribution plots are an essential tool to visualize the distribution of a dataset. The seaborn library in python provides us with an excellent way of creating a distribution plot using the distplot() function. In this article, we will explore the distplot() function and how to use this function to create various distribution plots.
What is distplot()?
The distplot() function is a part of the seaborn library in python, which displays a histogram of the distribution of a dataset. It creates a visual representation of the distribution of the data by plotting the density estimate on top of the histogram and fitting a normal distribution curve.
Syntax:
To create a distribution plot using distplot(), the following syntax should be used:
seaborn.displot(data, x = None, y = None, hue = None, kind=’dist’, rug = False, hist = True, kde = True, bins = None, hist_kws = None, kde_kws = None, rug_kws= None, ax = None, **kwargs)
Here, data is the Pandas dataframe that we want to plot; x and y are the column names of the dataframe that we want to plot on the x-axis and y-axis, respectively. The hue parameter allows us to differentiate between multiple groups in the same plot. The kind parameter specifies the type of distribution plot we want to create, and we will discuss the different types of distribution plots in the subsequent sections. The other parameters control the appearance of the plot, such as the color of the plot, the number of bins in the histogram, and the width of the density curve.
Creating a Basic Histogram using distplot()
A histogram is a graphical representation of data distribution, where we group data into continuous intervals called bins. In python, we can create a histogram using the distplot() function.
Let's say we have a dataset of student’s scores in a math test. We want to visualize the distribution of the scores using a histogram using the distplot() function. Here is how we can do it:
import seaborn as sns
import pandas as pd
Creating a data frame
df = pd.DataFrame({'score': [75, 88, 90, 70, 56, 54, 68, 89, 75, 76, 89, 78, 80, 49, 91, 92, 70, 77, 62, 74]})
Creating a distribution plot for the score column
sns.distplot(df['score'])
Displaying the plot
plt.show()
In the above code, we have created a pandas dataframe with the column name 'score' to store the student's math scores. We have then passed this column to the distplot() function to create a histogram.
The output of this code will be a histogram of the distribution of the student's scores as shown below:
Histogram using distplot()
As we can see in the output image, the distplot() function has plotted a histogram of the scores on the x-axis and the normalized frequency of the scores on the y-axis. Additionally, it has also plotted a density line that estimates the probability density function of the scores.
Creating a Density Plot using distplot()
A density plot displays the probability density of a continuous variable by smoothing the histogram of a variable. It is a variation of a histogram and is used to visualize the distribution of data. To create a density plot using distplot() function, we set the kde parameter to True.
Let's use the same dataset of student’s math scores and create a density plot using the distplot() function:
import seaborn as sns
import pandas as pd
Creating a data frame
df = pd.DataFrame({'score': [75, 88, 90, 70, 56, 54, 68, 89, 75, 76, 89, 78, 80, 49, 91, 92, 70, 77, 62, 74]})
Creating a density plot for the score column
sns.distplot(df['score'], kde=True)
Displaying the plot
plt.show()
In the above code, we have set the kde parameter of the distplot() function to True, which creates a density plot instead of a histogram.
The output of the above code will be a density plot of the distribution of the student's scores as shown below:
Density plot using distplot()
The density curve in the above plot represents the probability density function of the scores, and the histogram represents the count of scores in each bin.
Creating a Rug plot using distplot()
A rug plot is a graphical representation of data distribution. It is a one-dimensional scatter plot that displays the data points along a number line. It is represented as a series of triangles that indicate the location of each data point. This plot is a useful tool for visualizing the distribution of small datasets.
To create a rug plot using the distplot() function, we set the rug parameter to True.
For our student’s math score dataset, let's create a rug plot using the distplot() function:
import seaborn as sns
import pandas as pd
Creating a data frame
df = pd.DataFrame({'score': [75, 88, 90, 70, 56, 54, 68, 89, 75, 76, 89, 78, 80, 49, 91, 92, 70, 77, 62, 74]})
Creating a rug plot for the score column
sns.distplot(df['score'], kde = False, rug = True)
Displaying the plot
plt.show()
In the above code, we have set the kde and rug parameters of the distplot() function to False and True, respectively, to create a rug plot.
The output of the above code will be a rug plot of the distribution of the student's scores as shown below:
Rug plot using distplot()
As we can see in the output image, the distplot() function has plotted a rug plot of the scores on the x-axis, and the y-axis shows the count of the scores.
Conclusion:
In conclusion, the distplot() function of the seaborn library is an excellent tool to create various distribution plots in python. This article discussed how to create basic histograms, density plots, and rug plots using the distplot() function. However, distplot() can also be used to create other distribution plots like the kernel density estimation plot (kdeplot), empirical cumulative distribution plot (ecdfplot), etc. Therefore, it is an essential function in any python data analyst or data scientist’s toolkit to create graphs to gain insights and understanding of the data.
Creating a Basic Histogram using distplot()
The histogram created using the distplot() function is an excellent way to visualize the frequency distribution of a dataset. However, when creating histograms using the distplot() function, it is essential to choose the right number of bins. The number of bins determines the granularity of the histogram and can impact the interpretation of the data distribution.
To set the number of bins in the histogram, we can pass the bins parameter to the distplot() function. By default, the distplot() function sets the number of bins to 10. However, we can experiment with different values to see how the histogram changes.
For example, let's say we have a dataset of 100 randomly generated values between the range of 1 and 100. We want to plot a histogram of the distribution of these values using the distplot() function.
import seaborn as sns
import numpy as np
Generating an array of 100 random numbers
x = np.random.randint(1, 100, 100)
Creating a histogram with 20 bins
sns.distplot(x, bins=20)
Displaying the plot
plt.show()
In the above code, we have set the bins parameter of the distplot() function to 20 to create a histogram with 20 bins.
The output of the above code will be a histogram of the distribution of the 100 randomly generated values as shown below:
Histogram with 20 bins using distplot()
As we can see in the output image, the number of bins determines the granularity of the histogram. A smaller number of bins produce a more coarse histogram, while a larger number of bins produces a more detailed histogram.
Creating a Kernel Density Estimation Plot using distplot()
Kernel Density Estimation (KDE) is a non-parametric way to estimate the probability density function of a variable. It is a way to estimate the probability density function of a continuous variable using a kernel function. The KDE plot created using the distplot() function is a smooth version of the histogram.
To create a KDE plot using the distplot() function, we set the kde parameter to True. The bandwidth of the kernel function can be set by passing the bw parameter to the distplot() function.
For example, let's say we have a dataset of 1000 randomly generated values from a normal distribution. We want to plot a KDE plot of the distribution of these values using the distplot() function.
import seaborn as sns
import numpy as np
Generating an array of 1000 random numbers from a normal distribution
x = np.random.normal(size=1000)
Creating a KDE plot with a bandwidth of 0.5
sns.distplot(x, kde=True, bw=0.5)
Displaying the plot
plt.show()
In the above code, we have set the kde and bw parameters of the distplot() function to True and 0.5, respectively, to create a KDE plot with a bandwidth of 0.5.
The output of the above code will be a KDE plot of the distribution of the 1000 randomly generated values as shown below:
KDE plot with a bandwidth of 0.5 using distplot()
As we can see in the output image, the KDE plot created using the distplot() function is a smoother version of the histogram. The curve represents the probability distribution function of the data.
Creating a Rug Plot using distplot()
A rug plot is a one-dimensional scatter plot that represents the data points along the number line. It can be used to supplement the histogram and KDE plots created using the distplot() function.
To create a rug plot using the distplot() function, we set the rug parameter to True.
For example, let's say we have a dataset of 100 randomly generated values between the range of 1 and 100. We want to plot a rug plot of the distribution of these values using the distplot() function.
import seaborn as sns
import numpy as np
Generating an array of 100 random numbers
x = np.random.randint(1, 100, 100)
Creating a rug plot
sns.distplot(x, rug=True)
Displaying the plot
plt.show()
In the above code, we have set the rug parameter of the distplot() function to True to create a rug plot.
The output of the above code will be a rug plot of the distribution of the 100 randomly generated values as shown below:
Rug plot using distplot()
As we can see in the output image, the distplot() function has plotted a rug plot of the values on the x-axis, and the y-axis shows the count of the values.
Conclusion
The distplot() function in the seaborn library is a powerful tool for visualizing and analyzing the distribution of a dataset. The distplot() function can create histograms, KDE plots, and rug plots. By using the distplot() function in combination with other seaborn plots, we can create powerful visualizations of our data to gain insights and understanding.
However, it is essential to choose the right parameters and set the correct number of bins to prevent misleading interpretations of the data. The distplot() function allows us to customize the plot in many ways, and we can experiment with different settings to find the best way to visualize the distribution of our data.
Popular questions
-
What is distplot() in python and how is it used to visualize data?
Answer: distplot() is a function in the seaborn library in python that displays a histogram of the given dataset. It visualizes the distribution of a dataset by plotting the density estimate on top of the histogram and fitting a normal distribution curve. -
How can we create a histogram using the distplot() function?
Answer: To create a histogram using distplot() function in python, we pass the dataset to the function and set the kde parameter to False. For example, we can use the following code to create a histogram of student scores: sns.distplot(student_scores, kde=False) -
What is the difference between a histogram and a density plot?
Answer: A histogram is a graphical representation of data distribution by grouping data into continuous intervals called bins. A density plot, however, displays the probability density of a continuous variable by smoothing the histogram of a variable. It is a variation of a histogram and is used to visualize the distribution of data. -
How can we set the number of bins in a histogram created using the distplot() function?
Answer: To set the number of bins in a histogram created using the distplot() function, we can pass the bins parameter to the function. For example, we can use the following code to create a histogram with 20 bins: sns.distplot(data, bins=20) -
What is a rug plot and how can we create it using the distplot() function?
Answer: A rug plot is a one-dimensional scatter plot that represents the data points along the number line. It can be used to supplement the histogram and KDE plots created using the distplot() function. To create a rug plot using the distplot() function in python, we set the rug parameter to True. For example, we can use the following code to create a rug plot: sns.distplot(data, rug=True)
Tag
Seaborn