Multiple Regression — Octane Rating Prediction
--
The Octane data found in the attached spreadsheet (Octane.csv)
show how three different materials in the feed stock and a composite variable describing processing conditions affect the octane rating of refined gasoline. Since higher octane is valuable to a refinery, we wish to build a multiple regression model to predict resulting octane depending on feed stock composition and processing conditions.
Generate an OLS model with all main effects included. Perform standard regression diagnostics on this model. What can you conclude?
Following are the coefficients and the intercepts calculated using the linear regression model (OLS).
Intercept 96.27422785026842
Coefficient
Material1 -0.096111
Material2 -0.126626
Material3 -0.026994
Condition 1.905263Mean Absolute Error: 0.4355744480702062
Mean Squared Error: 0.25699821227022623
Root Mean Squared Error test: 0.506949911007218
R Squared test: 0.8969403057107406
Root Mean Squared Error train: 0.40881534712587425
R Squared train: 0.906540864451247
We can see that the value of root mean squared error is 0.507, which is very less than 1% of mean of octane 91.84 and not much higher from training RMSE.
Its R squared value is 0.897. which indicates a good fit.
We can also look at plot below between actual predicted in test set. All points lie close to straight line with 45-degree slope. This means that our algorithm is accurate and can make reasonably good predictions.
Regression Diagnostics:
Linear relationship:
Using scatter plots we can see the linear relationship between dependent and the independent variables. So, there is no need to transform any variable.
Multicollinearity:
Check if two or more variables are highly correlated.
And there appears some collinearity between independent variables Condition & material3 >0.7. So, there may be some redundancy.
Residual plot variability:
Residual plots should be randomly scattered around line zero. If there is structure in the plot, that means model is not capturing something. Here we see no structure, so our model is correct.
Normality of errors:
The residual errors must be normally distributed. Here we can see the residuals are normally distributed. We can also use Q-Q plot to check the normal distribution.
The diagnostics analysis shows that the model has no modelling errors. Linear regression is appropriate model to use.
However, there is some multicollinearity in the model which can be investigated further.
Next, generate a subset model with the least significant main effect excluded. Compare these two models using all the model comparison techniques applicable. What can you conclude?
From the OLS model using Statsmodel and observing the p-values –
The Material3 coefficient has a p-value of 0.073 which is more than p-value of 0.05 for statistical significance. So, this feature can be removed to generate a subset model.
Model 2 (subset model)
Intercept 94.22624576667697
Coefficient
Material1 -0.097790
Material2 -0.122025
Condition 2.301843
Mean Absolute Error: 0.42864602044174327
Mean Squared Error: 0.25391471478981115
Root Mean Squared Error test: 0.5038995086223156
R Squared test: 0.8981768291280285
Root Mean Squared Error train: 0.4209239432690823
R Squared train: 0.9009225915902364
The model 2 is not much improvement in accuracy over model 1 which used all features.
But we get same results with lesser features and we have removed redundancy.
If there is huge amount of data, then we can remove this feature for improvement in performance.
Also, we conclude that Material3 does not contribute significantly to the Octane value. So Material3 need not be used to compute Octane ratings.
If your goal was to produce gasoline at an octane rating of 95, pick one set of operating conditions that would do so. Make sure that this operating condition set is within the scope of the model (that is, within the ranges for each variable used to build the model).
Using model 2, as we can conclude that material3 does not contribute to octane rating.
We can use values of material and condition using the min-max range of the variables
Material1 = 4.23 to 75.54
Material2 = 0 to 10.76
Condition = 1.19975 to 2.319090
By taking values from above ranges and putting in model 2 equation .The values of operating conditions to generate 95 Octane gasoline are:
Material1 = 15
Material2 = 6
Condition = 1.291469
There are other set of values also possible, but this is one of the sets.