Foundation of Data Science: Unit III: Describing Relationships

Interpretation of R2

Characteristics, Spurious Regression | Data Science

The primary objective of regression is to explain the variation in Y using the knowledge of X.

Interpretation of R2

• The following measures are used to validate the simple linear regression models:

1. Co-efficient of determination (R-square).

2. Hypothesis test for the regression coefficient b1.

3. Analysis of variance for overall model validity (relevant more for multiple linear regression).

4. Residual analysis to validate the regression model assumptions.

5. Outlier analysis.

• The primary objective of regression is to explain the variation in Y using the knowledge of X. The coefficient of determination (R-square) measures the percentage of variation in Y explained by the model (0 + 1 X).

Characteristics of R-square:

• Here are some basic characteristics of the measure:

1. Since R2 is a proportion, it is always a number between 0 and 1.

2. If R2 = 1, all of the data points fall perfectly on the regression line. The predictor x accounts for all of the variation in y!.

3. If R2 = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!

• Coefficient of determination, R2 a measure that assesses the ability of a model to predict or explain an outcome in the linear regression setting. More specifically, R2 indicates the proportion of the variance in the dependent variable (Y) that is predicted or explained by linear regression and the predictor variable (X, also known as the independent variable).

• In general, a high R2 value indicates that the model is a good fit for the data, although interpretations of fit depend on the context of analysis. An R2 of 0.35, for example, indicates that 35 percent of the variation in the outcome has been explained just by predicting the outcome using the covariates included in the model.

• That percentage might be a very high portion of variation to predict in a field such as the social sciences; in other fields, such as the physical sciences, one would expect R2 to be much closer to 100 percent.

• The theoretical minimum R2 is 0. However, since linear regression is based on the best possible fit, R2 will always be greater than zero, even when the predictor and outcome variables bear no relationship to one another.

• R2 increases when a new predictor variable is added to the model, even if the new predictor is not associated with the outcome. To account for that effect, the adjusted R2 incorporates the same information as the usual R2 but then also penalizes for the number of predictor variables included in the model.

• As a result, R2 increases as new predictors are added to a multiple linear regression model, but the adjusted R increases only if the increase in R2 is greater than one would expect from chance alone. In such a model, the adjusted R2 is the most realistic estimate of the proportion of the variation that is predicted by the covariates included in the model.

Spurious Regression

• The regression is spurious when we regress one random walk onto another independent random walk. It is spurious because the regression will most likely indicate a non-existing relationship:

1. The coefficient estimate will not converge toward zero (the true value). Instead, in the limit the coefficient estimate will follow a non-degenerate distribution.

2. The t value most often is significant.

3. R2 is typically very high.

• Spurious regression is linked to serially correlated errors.

• Granger and Newbold(1974) pointed out that along with the large t-values strong evidence of serially correlated errors will appear in regression analysis, stating that when a low value of the Durbin-Watson statistic is combined with a high value of the t-statistic the relationship is not true.

Hypothesis Test for Regression Co-Efficient (t-Test)

• The regression co-efficient (1) captures the existence of a linear relationship between the response variable and the explanatory variable.

If i = 0, we can conclude that there is no statistically significant linear relationship between the two variables.

 • Using the Analysis of Variance (ANOVA), we can test whether the overall model is statistically significant. However, for a simple linear regression, the null and alternative hypotheses in ANOVA and t-test are exactly same and thus there will be no difference in the p-value.

Residual analysis

• Residual (error) analysis is important to check whether the assumptions of regression models have been satisfied. It is performed to check the following:

1. The residuals are normally distributed.

2. The variance of residual is constant (homoscedasticity).

3. The functional form of regression is correctly specified.

4. If there are any outliers.

Foundation of Data Science: Unit III: Describing Relationships : Tag: : Characteristics, Spurious Regression | Data Science - Interpretation of R2