The primary objective of regression is to explain the variation in Y using the knowledge of X.
Interpretation of R2
• The
following measures are used to validate the simple linear regression models:
1.
Co-efficient of determination (R-square).
2.
Hypothesis test for the regression coefficient b1.
3.
Analysis of variance for overall model validity (relevant more for multiple
linear regression).
4.
Residual analysis to validate the regression model assumptions.
5.
Outlier analysis.
• The
primary objective of regression is to explain the variation in Y using the
knowledge of X. The coefficient of determination (R-square) measures the
percentage of variation in Y explained by the model (ẞ0 + ẞ1 X).
Characteristics of R-square:
• Here
are some basic characteristics of the measure:
1. Since
R2 is a proportion, it is always a number between 0 and 1.
2. If R2
= 1, all of the data points fall perfectly on the regression line. The
predictor x accounts for all of the variation in y!.
3. If R2
= 0, the estimated regression line is perfectly horizontal. The predictor x
accounts for none of the variation in y!
•
Coefficient of determination, R2 a measure that assesses the ability
of a model to predict or explain an outcome in the linear regression setting.
More specifically, R2 indicates the proportion of the variance in
the dependent variable (Y) that is predicted or explained by linear regression
and the predictor variable (X, also known as the independent variable).
• In
general, a high R2 value indicates that the model is a good fit for
the data, although interpretations of fit depend on the context of analysis. An
R2 of 0.35, for example, indicates that 35 percent of the variation
in the outcome has been explained just by predicting the outcome using the
covariates included in the model.
• That
percentage might be a very high portion of variation to predict in a field such
as the social sciences; in other fields, such as the physical sciences, one
would expect R2 to be much closer to 100 percent.
• The
theoretical minimum R2 is 0. However, since linear regression is
based on the best possible fit, R2 will always be greater than zero,
even when the predictor and outcome variables bear no relationship to one
another.
• R2
increases when a new predictor variable is added to the model, even if the new
predictor is not associated with the outcome. To account for that effect, the
adjusted R2 incorporates the same information as the usual R2
but then also penalizes for the number of predictor variables included in the
model.
• As a result, R2 increases as new predictors are added to a multiple linear regression model, but the adjusted R increases only if the increase in R2 is greater than one would expect from chance alone. In such a model, the adjusted R2 is the most realistic estimate of the proportion of the variation that is predicted by the covariates included in the model.
• The
regression is spurious when we regress one random walk onto another independent
random walk. It is spurious because the regression will most likely indicate a
non-existing relationship:
1. The
coefficient estimate will not converge toward zero (the true value). Instead,
in the limit the coefficient estimate will follow a non-degenerate
distribution.
2. The t
value most often is significant.
3. R2
is typically very high.
•
Spurious regression is linked to serially correlated errors.
•
Granger and Newbold(1974) pointed out that along with the large t-values strong
evidence of serially correlated errors will appear in regression analysis,
stating that when a low value of the Durbin-Watson statistic is combined with a
high value of the t-statistic the relationship is not true.
Hypothesis Test for Regression Co-Efficient
(t-Test)
• The
regression co-efficient (ẞ1)
captures the existence of a linear relationship between the response variable
and the explanatory variable.
If ẞi = 0, we
can conclude that there is no statistically significant linear relationship
between the two variables.
• Using the Analysis of Variance (ANOVA), we
can test whether the overall model is statistically significant. However, for a
simple linear regression, the null and alternative hypotheses in ANOVA and
t-test are exactly same and thus there will be no difference in the p-value.
Residual analysis
•
Residual (error) analysis is important to check whether the assumptions of
regression models have been satisfied. It is performed to check the following:
1. The
residuals are normally distributed.
2. The
variance of residual is constant (homoscedasticity).
3. The
functional form of regression is correctly specified.
4. If
there are any outliers.
Foundation of Data Science: Unit III: Describing Relationships : Tag: : Characteristics, Spurious Regression | Data Science - Interpretation of R2
Foundation of Data Science
CS3352 3rd Semester CSE Dept | 2021 Regulation | 3rd Semester CSE Dept 2021 Regulation