Descriptive Statistics in R
In data analysis, descriptive statistics is a basic technique that summarizes and identifies important characteristics of populations from sample data in order to transform raw data into useful information. This entails characterizing the relationships between variables as well as their form, center, and spread. These fundamental statistical calculations can be easily performed in R Programming using built-in functions.
Elementary statistics frequently start with defining the raw data, which includes determining if the variables are categorical or numeric and whether the data is univariate (having only one dimension) or multivariate (having multiple components linked to each observation). A sample’s estimated characteristics are known as statistics, and they are used to deduce the corresponding parameters of the larger population.
Calculating Measures of Centrality
A popular summary statistic for describing the center of numerical observations is the measure of centrality. Typically, centrality is measured using the mean, median, and mode.
Mean (mean())
A set of observations’ conceptual “balance point” is the arithmetic mean.To get the sample mean for numerical measures, divide the total number of observations. The arithmetic mean in R is calculated using mean(). For instance, 1–6 mean 3.5. Mean(), a generic R function, utilizes mean.default or mean.data.frame depending on the object.
The usual mean() function returns NA if a vector has missing data. Calculate the mean exclusively from seen data by setting the optional input na.rm (NA remove) to TRUE (or T). The mean’s value can change drastically when extreme observations are added or withdrawn, making it a sensitive measure of outliers.
ColMeans() was built to swiftly calculate the mean of each matrix or data frame column. When mean is applied across the columns (MARGIN=2), this customized function is frequently far faster than the more standard apply() function.
Median (median())
An example of the “middle magnitude” of a dataset is the median. Sorting data from smallest to biggest provides the median, which is either the mean of the two central values or the single middle value.
The median 0.5th quantile or 50th percentile shows 50% of data falls below it. In R, median() calculates the median. The median is less affected by severe statistics than the mean due to its robustness.
Calculating Measures of Spread
Spread measures quantify the degree of variability or dispersion of numerical observations around a centrality metric. Standard deviation, variance, and the interquartile range (IQR) are important metrics.
Variance (var())
A measure of the distribution of observations around the arithmetic mean is the sample variance. It represents each observation’s average squared distance from the mean.
R calculates variance with var(x). Because the data often reflects a sample drawn from a larger, supposed population, R’s default calculation utilizes the sample estimator, which yields an objective estimate of the true population variance. Outliers can affect the variance because it is directly correlated with the mean.
Standard Deviation (sd())
The standard deviation, is the variance’s square root. Its ability to convert the measure of spread back to the original units of measurement makes it popular since it offers a rough interpretation of the average deviation of each observation from the mean.
The R sd(x) function calculates standard deviation. A function that calculates the standard deviation for each data frame column can be written using Apply(df, 2, sd). Specialized functions can be developed for computations over rows or columns. Examples of sophisticated computations employing the standard deviation (or its square) across rows of a matrix are provided by R code codes.
When estimating the standard deviation of data that is thought to have a normal distribution, the Interquartile Range (IQR) divided by roughly 1.349 can be used.
Interquartile Range (IQR())
The Interquartile Range measures the “middle 50%” of data width. The difference between the 75th and 25th percentiles is used to calculate.
Instead of measuring mean, the IQR measures spread around the median. An effective tool for examining skewed distributions or data with extremes, the IQR is comparatively robust to outliers due to its reliance on quantiles.
It is calculated using the R function IQR(x). The internal difference between the input vector’s 75th and 25th quantiles is determined by R’s IQR function using diff(). A similar idea is the five-number summary, which may be retrieved using the summary() or quantile functions and comprises the minimum, the first quartile, the median, the third quartile, and the maximum.
Measuring Association with cor()
Analyzing multivariate data frequently requires looking into the connection or link between two numerical variables.
Covariance (cov())
The degree to which two numerical variables, “change together” is measured by covariance. Similar to variance, sample covariance is determined using the paired product of mean deviations.
Y rises as rises in a linear relationship with positive covariance. A linear connection with a negative covariance shows decreasing as grows. and not linearly connected. R calculates covariance between with cov(x, y). With one matrix or data frame parameter, R returns cov’s variance-covariance matrix. Since covariance depends on mean, outliers alter it.
Correlation (cor())
Pearson’s product-moment correlation coefficient illustrates the strength and direction of a linear relationship between two variables. Divide the two variables’ standard deviations by their product for covariance. The correlation coefficient, spans -1 to 1. Higher correlations are nearer while weaker or no linear relationships are nearer 0.
R Programming correlates x and y. Cor(x) produces the column correlation matrix from one input and a matrix or data frame. Use=”complete.obs” to compute just whole pairs of observations if vectors have missing data. Pearson’s correlation coefficient only measures straight-line relationships, and a high correlation indicates association, not causation. The optional method argument of cor() returns Spearman’s rank correlation, a Pearson’s coefficient substitute in R.
Summary of R Functions for Descriptive Statistics:
Measure Type | Statistic | R Function(s) | Notes |
Centrality | Mean | mean() | Requires na.rm=TRUE for missing values; sensitive to outliers. |
Median | median() | Robust against outliers. | |
Spread | Variance | var() | Sample estimator uses divisor; sensitive to outliers. |
Standard Deviation | sd() | Square root of variance. | |
Interquartile Range | IQR() | Robust measure of middle 50% spread. | |
Association | Covariance | cov(x, y) | Measures direction of joint linear change. |
Correlation | cor(x, y) | Measures direction and strength of linear relationship (range -1 to 1). | |
General Summary | Five-Number Summary | summary() or quantile() | Provides Min, Q1, Median, Q3, Max. |