Look at the correlation matrix of all the explanatory variables. Which two explanatory variables are so highly correlated that they may give cause for concern?
Temp Manuf Pop Wind PrecipDays
Temp - -0.190 -0.063 -0.350 0.386 -0.43
Manuf -0.190 - 0.955 0.238 -0.032 0.132
Pop -0.063 0.955 - 0.213 -0.026 0.042
Wind -0.350 0.238 0.213 - -0.013 0.164
Precip0.386 -0.032 -0.026 -0.013 - 0.496
Days -0.43 0.132 0.042 0.164 0.496 -
The number of manufacturing enterprises employing 20 or more workers (Manuf) and population size in thousands (Pop) are highly correlated since they have the highest correlation coefficient of 0.955.
Now conduct a regression analysis using all the explanatory variables. If using StatCrunch store the standardized residuals and the predicted values. Interpret in detail your printout.
Y = 111. 7285 1.2679 X1 + 0.06492 X2 0.03928 X3 3.1814 X4 + 0.5124 X5 0.05205 X6
S02 = 111.7285 1.2679 Temp + 0.06492Manuf 0.03928 Pop 3.1814 Wind + 0.5124Precip 0.05205 Days
The expected value of Sulphur dioxide content of air in micrograms per cubic meter (SO2) is 111.7285 when all other factors are held constant.
An increase in the average annual temperature in degrees F (Temp)
885825734695An increase in the average annual temperature in degrees F (Temp) by 1oc leads to the decrease in the expected Sulphur dioxide content of air in micrograms per cubic meter (SO2) by 1.2679 when all other factors are held constant Ceteris Paribus.
895350218440i.e. SO2 = 1.2679
iii. Number of manufacturing enterprises employing 20 or more workers (Manuf)
933450749935An increase in the number of manufacturing enterprises by 1 additional worker (Manuf) leads to the increase in the expected Sulphur dioxide content of air in micrograms per cubic meter (SO2) by 0.06492 Ceteris Paribus.
914400234315i.e. SO2 = 0.06492
iv. Population size in thousands (Pop)
895350741045An increase in the population size by an additional one thousands leads to the decrease in the expected Sulphur dioxide content of air in micrograms per cubic meter (SO2) by 0.03928 when all other factors are held constant.
895350209550i.e. SO2 = - 0.03928
iv. Average annual wind speed in miles per hour (Wind)
904875759460An increase in the average annual wind speed in miles per hour (Wind) by one additional unit leads to the decrease in the expected Sulphur dioxide content of air in micrograms per cubic meter (SO2) by 3.1814 ceteris Paribus.
895350237490i.e. SO2 = -3.1814
v. Average annual precipitation in inches (Precip)
923925750570An increase in the Average annual precipitation in inches (Precip) leads to the increase in the expected Sulphur dioxide content of air in micrograms per cubic meter (SO2) by 0.5124 Ceteris Paribas.
904875238125i.e. SO2 = 0.5124
vi. Average number of days with precipitation per year (Days)
914400744855An increase in the average number of days with precipitation per year (Days) by one additional day leads to the decrease in the expected Sulphur dioxide content of air in micrograms per cubic meter (SO2) by 1.2679 Ceteris Paribas.
885825218440i.e. SO2= - 0.05205
This data has been collected to investigate the determinants of air pollution.
c. Examine the residual plots (or construct a residual plot of the standardized residuals and the predicted values if using StatCrunch). What do the plots indicate?
By plotting a graph of predicted value against the standardized residuals, we obtain the plots that indicate scope of the data by showing the data which are outside the scope (outliers) and those that are inside the scope and hence relevant.
d. Look at the standardized residuals for each city. You will notice from these and your plots that two cities stand out in the model fit as being outliers. Locate these cities and comment.
The cities that are outliers are Philadelphia and Pittsburgh. The have the highest standardized residuals of 2.23 and 3.61 respectively. The fact that they have the highest standardized residuals indicated that that they are outliers.
e. Using the information in the ANOVA table (part b) and the correlation matrix (part a) comment on if any variables should be eliminated from the regression model.
In order to eliminate variables in the regression model, outliers and data having a high correlation coefficient should be eliminated. For the above case Air Pollution in U.S. Cities, data relating to the (Number of manufacturing enterprises employing 20 or more workers (Manuf) and Population size in thousands (Pop) ) has a high correlation of 0.955 and hence one of the variables should be eliminated. In this case, data relating to Manuf should hence be eliminated. The existence of a high correlation between variables makes the regression equation inefficient and unreliable.
Informational data relating to Air Pollution in U.S. Cities contains outliers with Philadelphia and Pittsburgh cities falling outside the range of scope of data. In this case, the city with the highest residuals standardized residual of 3.61 which is Pittsburg should be eliminated. The fact that the informational data has a high standardized residual indicates that it is an outlier and does not fall within the range of data required and hence it should be eliminated since it makes the regression equation as well as the results unreliable.
IPS TEXTBOOK PROBLEMS 11.31-11.33 DATA SET ATTACHED IN BLACKBOARD
The following three exercises use the HAPPINESS data set. The World Database of Happiness is an online registry of scientific research on the subjective appreciation of life. It is available at worlddatabaseofhappiness.eur.nl and is directed by Dr. Ruut Veenhoven, Erasmus University, Rotterdam. One inventory presents the average happiness score for various nations between 2007 and 2008. This average is based on individual responses from numerous general population surveys to a general life satisfaction (well-being) question. Scores ranged between 0 (dissatisfied) to 10 (satisfied). The NationMaster Web site, www.nationmaster.com, contains a collection of statistics associated with various nations. For this data set, the factors considered are the GINI Index: measures the degree of inequality in the distribution of income (higher score = greater inequality); the degree of corruption in government (higher score = less corruption); average life expectancy; and the degree of democracy (higher score = more political liberties).
11.31 Predicting a nations average happiness score. Consider the five statistics for each nation: LSI, the average life-satisfaction score; GINI, the GINI index; CORRUPT, the degree of corruption in government; LIFE, the average life expectancy; and DEMOCRACY, a measure of civil and political liberties.
(a) Using numerical and graphical summaries, describe the distribution of each variable (Working in the attached Excel Sheet).
(b) Using numerical and graphical summaries, describe the relationship between each pair of variables. CORRELATIONS (Working in the attached Excel Sheet).
11.32 Building a multiple linear regression model. Lets now build a model to predict the life-satisfaction score, LSI
Consider a simple linear regression using GINI as the explanatory variable. Run the regression and summarize the results. Be sure to check assumptions.
GINI Explanatory variable (X)
LSI Dependent variable (Y)
Y = 7.0238 0.02014X
LSI = 7.0238 0.02014 GINI
Now consider a model using GINI and LIFE. Run the multiple regression and summarize the results. Again be sure to check assumptions.
LSI Dependent variable Y
GINI Explanatory variable X1
LIFE Explanatory variable X2
Linear Regression Equation
Y = -3.82567 + 0.028733 X1 +0.12503 X2
LSI = -3.82567 + 0.028733 Gini + 0.12503 Life
Question 1 (c)
Now consider a model using GINI, LIFE, and DEMOCRACY. Run the multiple regression and summarize the results. Again be sure to check assumptions.
LSI Dependent variable Y
GINI Explanatory variable X1
LIFE Explanatory variable X2
DEMOCRACY Explanatory variable X3
Multiple Regression Equation
Y = -3.2524 + 0.028X1 +0.1063 X2 +0.1857 X3
LSI = -3.2524 + 0.028 Gini +0.1063 Life +0.1857 Democracy
Now consider a model using all four explanatory variables. Again summarize the results and check assumptions.
Y = -2.7201 + 0.0368X1 +0.0905X2 +0.0392 X3 + 0.1855 X4
LSI = 2.7201 + 0.0368 Gini + 0.0905 Life + 0.0392 Democracy + 0.1855 Corruption
11.33 Selecting from among several models. Refer to the results from the previous exercise.
Make a table giving the estimated regression coefficients, standard errors, t statistics, and P-values.
Coefficients Standard Errors T-statistics P-values
LSI 2.720 0.866 -3.141 0.003
Gini 0.037 0.009 3.916 0.0002
Life 0.091 0.011 8.080 1.73E-11
Democracy 0.039 0.066 0.5977 0.552
Corruption 0.186 0.050 3.680 0.0005
Question 1 (b)
Describe how the coefficients and P-values change for the four models.
The expected LSI (the average life-satisfaction score) is 111.7285 when all other factors are held constant.
The GINI index.
914400501650An increase in the GINI index leads to the increase in the expected The LSI, the average life-satisfaction score by 0.037 ceteris Paribas.
895350224790i.e. LSI = 0.037
Corruption (The degree of corruption in government)
914400501650An increase in the corruption leads to the increase in the LSI, the average life-satisfaction score by 0.091 ceteris Paribas.
914400224790i.e. LSI = 0.091
LIFE, the average life expectancy
914400501650An increase in the corruption leads to an increase in the LSI, the average life-satisfaction score by 0.091 ceteris Paribas.
914400224790i.e. LSI = 0.039
DEMOCRACY, a measure of civil and political liberties
An increase in the democracy as a measure of civil and political liberties leads to the increase in the LSI (the average life-satisfaction score) by 0.86 ceteris Paribas.
914400-67310914400224790i.e. LSI = 0.186
Democracy has the highest probability value of 0.552 while Life has the lower probability value. This means that Life as a factor determining the average life-satisfaction score is more significant as compared to Democracy. Also, the standard error of Life is less and hence has low deviations indicating that it is a significant variable. It is evident that from the above data, thethat standard errors increase with every addition of an extra variable determining the average life-satisfaction score.
Based on the table of coefficients, suggest another model. Run that model, summarize the results, and compare it with the other ones. Which model would you choose to explain LSI? Explain.
When LSI is run against GINI, Life, Good health and Economy, it given less standard errors as well as the p-values. The t-statistics are also significant. Hence we can conclude that LSI (average life satisfaction score) is a function of Gini, Life, goo...
Need a paper on the same topic?
We will write it for you from scratch!
If you are the original author of this essay and no longer wish to have it published on the SpeedyPaper website, please click below to request its removal:
- OSHA Citation and Notification of Penalty
- Environmental Analysis for Health First
- Brain Teaser
- Charter of Human Rights and Responsibilities 2006
- New Diseases
- Why Guns Should Be Controlled
- Agricultural Drones
- Personal strengths and weaknesses essay
- History of the poets
- Computer Science Program
- Humanitarian War
- Gender inequality in the workplace essay