Page Content

Tutorials

Simple Linear Regression In R Programming With Example

Simple Linear Regression In R

One of the basic statistical modeling methods for determining the exact relationship between variables is simple linear regression. It stands for a particular category of parametric models, making it an ideal place to start while learning about statistical regression. A linear regression model’s primary goal is to create a function that, given a specific value of an explanatory variable also referred to as the predictor variable estimates the mean value of a continuous outcome variable, or response variable.

A continuous response variable’s value is influenced by a single explanatory variable, which can be continuous, discrete, or categorical, according to the Simple Linear Regression (SLR) model. According to the model, the value of the response variable depends on the value of the explanatory variable and is composed of three primary parts: a random error term, a scaled influence of the predictor, and a constant predicted value.

The error term, also known as residuals or residual variation, requires certain assumptions to be made in order for the conclusions derived from this model to be legitimate. It is assumed that this random mistake is centered, regularly distributed, and has a constant variance. Any raw response value that results from random, residual variation or normally distributed noise beyond the predictor’s linear effect is taken into account by the error term. The two fundamental values the intercept and the slope that are inferred from the data are known as the regression coefficients.

The Intercept: The intercept is a critical component in the simple linear regression model, describing the connection between a continuous response variable and a single explanatory variable.The intercept represents the expected response value when the predictor is zero. The calculated mean response equation when fitting the model with observed data. R programming lm() function fits a linear model and outputs the intercept coefficient estimate under the label.

The Slope: This is frequently the subject of curiosity. It can be understood as the shift in the average response for every unit increase in the predictor.Positive slopes cause the regression line to rise, whereas negative slopes cause it to descend. A slope of 0 means the predictor does not affect response.

Fitting the linear model involves utilizing observed data pairs to determine the estimations of the intercept and slope. For this estimation, least-squares regression is the conventional approach. The fitted line that results from this method of parameter estimation minimizes the average squared difference between the line and the observed data points. Since the line that results from this procedure is “closest to all observations,” the estimated regression equation is also referred to as the line of best fit.

Fitting a Linear Model with lm()

Simple Linear Regression and other linear models can be fitted in R using the lm() function. The model structure must be defined by a model formula for this function to work.Lm() fits linear models using least-squares. Lm() delivers fitted values, residuals, standard errors, and estimated coefficients. This output object represents an instance of the “lm” special internal class. R employs generic functions and is polymorphic, thus the generic function you use on this “lm” object determines how you work with it.

Printing the object: The outcome is basic and usually shows a repeat of the call used to fit the model along with the point estimates for the intercept and slope coefficients if you only provide the name of the fitted model. By sending the call to a specific method function designed for the “lm” class, such as print.lm(), R accomplishes this goal and succinctly restricts the display to crucial quantities.

Summarizing the object: A more thorough printout is obtained by calling the generic function summary(). This causes summary.lm() to be called in the background, giving statistical inference a summary tailored to a certain regression.The R-squared coefficient, regression coefficient significance tests, and residual standard error are included in this output.

Extracting components: The “lm” object contains model details like “fitted.values,” “coefficients,” and “residuals.” Although these can be easily accessed using normal list component extraction notation, “direct-access” functions which are frequently generic functions themselves are technically preferred for programming purposes. For instance, fitted() returns the fitted values, resid() returns the residuals, and coef() returns the estimated coefficients.

Example:

# Sample data
x <- c(1,2,3,4,5)
y <- c(2.3,2.9,4.1,4.8,5.3)

# Fit linear model
model <- lm(y ~ x)

# Print coefficients, fitted values, and residuals
cat("Coefficients:\n"); print(coef(model))
cat("Fitted Values:\n"); print(fitted(model))
cat("Residuals:\n"); print(resid(model))

Output:

Coefficients:
(Intercept)           x 
       1.51        0.79 
Fitted Values:
   1    2    3    4    5 
2.30 3.09 3.88 4.67 5.46 
Residuals:
            1             2             3             4             5 
 8.326673e-16 -1.900000e-01  2.200000e-01  1.300000e-01 -1.600000e-01 

Understanding Model Formulas Using the Tilde ~

Almost all model-fitting and hypothesis-testing functions, such as lm() and t.test(), rely on model formulae, which are essential to R programming approach to statistical analysis. Formulas are unique objects of the “formula” class that are used to code for variable relationships.

A key component in defining a model formula is the tilde operator (~). This symbol distinguishes between the predictor variable terms on the right and the responder variable term on the left. The response variable “is modeled by” or “is described by” the predictors, according to the tilde operator’s conceptual specification. The simplest model formula structure in the context of Simple Linear Regression is defined by designating a single response variable that is characterized by a single explanatory variable.

Structure and Components:

When the answer is described by one or more terms connected by operators, a model formula often follows this structure.

Response Variable: Always positioned to the left of the tilde, the response variable is a vector or an expression that evaluates to a vector and defines the outcome variable under analysis.

Predictor Terms: These are predictor terms that come after the tilde. They may take the shape of vectors, factors, matrices, or formula expressions made up of various combinations of these. Usually, there is just one such phrase in SLR.

Operators: Arithmetic operators, as opposed to their typical mathematical interpretation, acquire a unique meaning within the formula. For example, the plus sign is the operator used to add a new term to the model. Typically included in the model by default, the overall intercept term is handled implicitly by the formula structure.

Handling Coefficients and Intercepts: The intercept term is automatically included in a model calculation for simple regression. To explicitly incorporate the intercept term or to completely ignore it, certain notations can be utilized. An implicit column of ones for the intercept term and a column of the model matrix are usually provided by each continuous variable indicated on the right side of the tilde when continuous variables are utilized.

Special Interpretations of Operators: A formula’s operators must be interpreted specifically. In a model that specifies a response modeled by the total of the impacts of predictors one and two, for instance, the + sign indicates an additive link between the two variables rather than the mathematical sum of them.

A specific function called I() must encapsulate an operation in order to let R understand it as a mathematical calculation instead of a model term operator. The formula definition can now incorporate arithmetic operations, like squaring a predictor variable to mimic quadratic regression.

Statistical Inference and Diagnostics

After fitting the model with lm(), statistical inference must be done to evaluate the accuracy and dependability of the estimated relationship.

Significance Testing:

The entire model fit summary includes significance tests for the intercept and slope parameters. Is there statistical support for a predictor-response relationship? The slope parameter test is usually the most important test. The standard test contrasts the null hypothesis that the true slope value is zero with the alternative hypothesis that it is not zero.

If the null hypothesis is rejected, the explanatory variable changes statistically significantly affect the mean outcome. We show that coefficients follow a T-distribution, with sample size affecting degrees of freedom. The summary() output also shows the coefficient of determination, or R-squared, which reflects the predictor’s contribution to response variation.

Residual Diagnostics:

The linear model’s validity depends on the calculated residuals’ differences from the observed data and the fitted regression line. Diagnostics must verify error term theoretical assumptions such normality, independence, linearity, and constant variance. A scatterplot that compares the raw residuals to their corresponding fitted values is a popular graphical tool. The residuals ought to be dispersed randomly around zero if the model’s assumptions are correct.

The residual assumptions may be broken as a result of dependent observations or unrecorded nonlinear relationships if there is a trend or a “fanning out” of the residuals, which indicates nonconstant variance or heteroscedasticity. The standardized residuals are plotted using a Normal Quantile-Quantile (QQ) plot to evaluate the assumption of normality. A normal error distribution should have data points around a diagonal line that represents the theoretical normal quantiles.

Prediction

Even for new values of the explanatory variable that were not in the original data set, linear models are helpful for forecasting response variable values in addition to measuring correlations. From a fitted “lm” object, these forecasts are produced using the predict() function. It demands that the new predictor values be provided in a particular format, usually as a data frame, with the column name matching the predictor that was used in the initial lm() call.

There should always be a measure of spread included with prediction findings. The two primary interval types offered are:

Confidence Interval (CI): A Confidence Interval (CI) is a determined range that indicates the uncertainty of the estimated mean response value for a particular set of predictor values in a fitted statistical model. This interval has a lower and upper limit and lets users declare a percentage of confidence that the response’s genuine expected population value fits inside it. For instance, CI suggests the true mean response is between the boundaries.

Prediction Interval (PI): Using a set of predictor values, a Prediction Interval (PI) calculates the feasible range of values for a single response variable realization. A fitted statistical model for prediction uses this spread measure with a point estimate. The Confidence Interval (CI) reflects the uncertainty of the estimated mean response, while the PI describes the variability of an observation. The Prediction Interval is consistently and significantly wider than the Confidence Interval produced at the same location because it accounts for the additional variability in a single, raw observation.

Since they have to take into consideration both the intrinsic random error of individual observations and the uncertainty in the line estimation, prediction intervals are by nature bigger than confidence intervals. Researchers should be cautious when producing extrapolations, which are predictions based on values outside of the original observed data range, and interpolation, which are forecasts based on predictor values lying within that range.

Modeling Categorical Predictors

Although SLR is mostly concerned with continuous explanatory variables, it can also include categorical predictors, which are composed of two or more different groups or levels. R programming internally codes the groups for a binary variable, which is a category variable with only two levels. Typically, these groups are 0 and 1. This situation causes a little modification in interpretation: the slope shows the additive effect on the mean answer if the predictor changes from zero to one (the comparative level), while the intercept shows the mean response for the group coded as zero (the baseline or reference).

R automatically handles multilevel variables, or categorical predictors with more than two levels, by introducing dummy coding. With one level selected as the reference baseline, this procedure transforms the single multilevel variable into a group of binary variables.

A distinct estimated slope term for each level that remains, signifying the additive change in relation to the baseline mean, is included in the fitted regression model along with the overall intercept, which is the mean of the reference level. Technically, the SLR fit becomes a multiple regression in terms of parameter counts, but the single multilevel predictor is regarded as a single entity even though it leads to the estimation of multiple parameters.

One important finding is that, technically speaking, fitting a basic linear model with a single categorical predictor and least-squares estimate is identical to doing a one-way analysis of variance (ANOVA), regardless of the number of levels. ANOVA is seen as a particular instance of least-squares regression as a result. The p-value obtained by the related one-way ANOVA test is precisely the same as the global significance test given in the summary() output of the fitted “lm” object in this case, evaluating the statistical evidence against the null hypothesis that all group means are identical.

The function defining the mean response is linear in terms of the regression coefficients, so regression models fitted with lm() are linear regression models even when polynomial terms or logarithms are applied to individual variables. The linear model framework can capture complex, curved relationships using least-squares estimates due to its versatility.

Kowsalya
Kowsalya
Hi, I'm Kowsalya a B.Com graduate and currently working as an Author at Govindhtech Solutions. I'm deeply passionate about publishing the latest tech news and tutorials that bringing insightful updates to readers. I enjoy creating step-by-step guides and making complex topics easier to understand for everyone.
Index