Calculating Mean, Mode, Median, Quantiles, Maximum, Minimum, and Outliers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import gaussian_kde, mode
# Given grades data
grades = pd.Series([50.0, 50.0, 47.0, 97.0, 49.0, 3.0, 53.0, 42.0, 26.0, 74.0,
82.0, 62.0, 37.0, 15.0, 70.0, 27.0, 36.0, 35.0, 48.0, 52.0,
63.0, 64.0])
# Calculating mean, mode, and median
mean_grade = grades.mean()
mode_grade = mode(grades).mode[0]
median_grade = grades.median()
# Calculating quantiles
quantiles = grades.quantile([0.25, 0.5, 0.75])
q1 = quantiles[0.25]
q3 = quantiles[0.75]
# Calculating maximum and minimum
max_grade = grades.max()
min_grade = grades.min()
# Calculating interquartile range (IQR) and identifying outliers
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = grades[(grades < lower_bound) | (grades > upper_bound)]
print(f"Mean: {mean_grade}")
print(f"Mode: {mode_grade}")
print(f"Median: {median_grade}")
print(f"Quantiles: \n{quantiles}")
print(f"Max: {max_grade}")
print(f"Min: {min_grade}")
print(f"Lower Bound for Outliers: {lower_bound}")
print(f"Upper Bound for Outliers: {upper_bound}")
print(f"Outliers: \n{outliers}")
Plotting Histogram, Density Plot, and Box Plot
# Plotting Histogram
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.hist(grades, bins=10, edgecolor='black')
plt.title('Histogram of Grades')
plt.xlabel('Grade')
plt.ylabel('Frequency')
# Plotting Density Plot
plt.subplot(1, 3, 2)
sns.kdeplot(grades, shade=True)
plt.title('Density Plot of Grades')
plt.xlabel('Grade')
plt.ylabel('Density')
# Plotting Box Plot
plt.subplot(1, 3, 3)
sns.boxplot(grades)
plt.title('Box Plot of Grades')
plt.xlabel('Grade')
plt.tight_layout()
plt.show()
Explanation of Plots
-
Histogram:
- x-axis: Represents the grade values.
- y-axis: Represents the frequency (number of occurrences) of each grade range.
- Meaning: The histogram shows how frequently each grade range appears in the dataset.
-
Density Plot:
- x-axis: Represents the grade values.
- y-axis: Represents the probability density.
- Meaning: The density plot is a smoothed version of the histogram. It shows the distribution of grades and helps identify the areas where grades are more concentrated.
-
Box Plot:
- x-axis: Represents the grade values.
- y-axis: Not applicable here as it’s a vertical plot.
- Meaning: The box plot shows the five-number summary of the data: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It also highlights any potential outliers.
Introducing the Probability Density Function (PDF)
The Probability Density Function (PDF) describes the likelihood of a continuous random variable to take on a particular value. For a given interval, the area under the curve of the PDF over that interval represents the probability of the variable falling within the interval.
Calculating Probability Density and Interval Probabilities
# Calculate the probability density function
kde = gaussian_kde(grades)
density_at_37 = kde(37)[0]
print(f"Density at 37: {density_at_37}")
# Calculate probability between 36 and 38
prob_36_to_38 = kde.integrate_box_1d(36, 38)
print(f"Probability between 36 and 38: {prob_36_to_38:.4f}")
Explanation of PDF and Interval Probability
-
Probability Density at a Point (e.g., 37):
- Meaning: The density value at x = 37 indicates how dense the data is around the score of 37. It is not a probability but a density value.
- Example: Density at 37: 0.014779821322454553 means the data is relatively dense around 37.
-
Probability over an Interval (e.g., 36 to 38):
- Meaning: The probability of grades falling between 36 and 38 is the area under the PDF curve between these two points.
- Example: Probability between 36 and 38: 0.0296 means there is a 2.96% chance that a randomly selected grade from this dataset will fall within this range.