Regression

Regression is one of the statistical tools usedin the Multi-Vari approach and isprobably the most powerful.It determines The statistical significance of a relationshipbetween a Continuous X and a Continuous Y in Y =f(X1,X2,..., Xn). The nature of the relationship itself (i.e., theequation).
There are two basic forms of Regression: Simple Linear , which relates oneContinuous Y with one Continuous X Multiple Linear , which relates oneContinuous Y with more than one Continuous X
It is the statistical analysistechnique used to investigate and model the relationship betweenthe variables. For both the Simple and Multiple techniques, themodel parameters are linear in nature, not quadratic or any otherpower. Given the sheer size of the subject and the application ofthe tool in Lean Sigma, here the focus is primarily on SimpleLinear Regression. As with all statistical tests, a sample ofreality is required. Generally 30 or more data points are requiredfor the X and the corresponding value of Y at that point.Regression is a passive analysis tool and so the process is notactively manipulated during the data capture. After the requisitenumber of data points have been collected, they are entered as twocolumns into a statistical software package and analyzed. Analyzing the data graphically using a FittedLine Plot shows a result similar to the example shown below. Here the X is "AgeOf Propellant" in a rocket motor and the Y is "Shear Strength" ofthe propellant at that age. The data points are plotted on aScatter Plot and then a straight line is fitted through them togive the best statistical fit. This is the Regression Line. Thereare many ways of doing this mathematically; in Regression theapproach is to use Least Squares, which minimizes the total squaresof all the distances from the line. 
The equation of the straight line (theRegression model) is given above the graph and is Shear Strength (psi) = 2628 37.15 x Age ofPropellant (weeks)
Thus, in the future, for any Age of Propellantfrom 0 to 25 weeks,it is possible to predict thephysical property Shear Strength for that propellant. Also, if theShear Strength had to be maintained above a certain level toperform correctly, then it is also possible to calculate a would-beshelf life for the propellant based on the model. Thereis no data outside of this timeframe and so no predictions shouldbe made beyond 25 weeks. In the top right of Figure are three statistics. These are infact only three of many which are available from the full analysisresults, which are shown in Figure below The analysis shows the sameequation (model) representing the relationship between Y and X. Analysis results for theRocket Propellant example.RegressionAnalysis | The regression equation is Shear Strength (psi) = 2628 37.2 Age of Propellant (weeks) | Predictor | Coef | St Dev | T | P | | Constant | 2627.82 | 44.18 | 59.47 | 0.000 | | Age of P | 37.154 | 2.889 | 12.86 | 0.000 | | S = 96.11 | R Sq = 90.2% | R Sq (adj) = 89.6% | Analysis of Variance | Source | DF | SS | MS | F | P | Regression | 1 | 1527483 | 1527483 | 165.38 | 0.000 | Residual Error | 18 | 166255 | 9236 | | | Total | 19 | 1693738 | | | |
Source: SBTI's Lean Sigma Methodology training material. For the constant term and for each X in themodel there is a p-value indicating whether that term issignificantly non-zero. Both have a p-value of zero, whichindicates thatthere would be a small (almost zero) chance of getting coefficientsthis large (far from zero) purely by random chance.Specifically The p-value for the constant indicates that theY-intercept is not equal to zero The p-value for the Age of Propellant indicatesthat the slope of the Regression line is not equal to zero Thestatistics on the center row are the same as those listed on theFitted Line Plot in Figure above
S is the standard deviation of the variation notexplained by the model, known as the Residuals. It is the spread ofthe data around the Regression line. R-Sq (R2) is the amount of variationin the data that is explained by the model. It is calculated fromthe ANOVA table at the bottom of Figure above by the equation SS(Regression) / SS(Total). Here 90.2% (calculatedas 1527483 ÷ 1693738) of all the variability in the sample data isexplained by the model. R-Sq(adj) is an indicator of whether anyredundant (non-contributing) terms have been included in themodel. If the R-Sq(adj) falls well below theR-Sq value then there are redundant terms. Here the two are closeand thus the conclusion should be that all the terms used in themodel actually contribute something.
The bottom table is an ANOVA (Analysis OfVariance) table. TheANOVA table breaks the variation into two main pieces: The calculation of these is shown graphically inFigure below The Mean of the Y data is calculated andrepresented by the dashed horizontal straight line in thefigure. The Total Variation (Source Total in the ANOVATable) is calculated by taking the square of the distance for eachdata point from the mean and then summing all the squares.SS(Total) = 1693738 is calculated this way. The Residual Error is calculated by taking thesquare of the distance of each data point from the Regression Lineand then summing all the squares. It is the bit left over after theline has been fitted. SS(Residual Error) = 166255 was calculatedthis way. The variation explained by the model, known asthe Regression, is calculated by taking the square of the distancefrom where the line predicts a point should be from the mean for every data pointand then summing all the squares. 
From the preceding calculations it is possibleto calculate a signal-to-noise ratio based on the size of theRegression (the signal) versus the background noise (ResidualError). This is the F-test in the table. Here the value of F is165.38, which means the size of the signal due to the X is 165.38times greater than the background noise. The software then looks up the F value in astatistical table to discover the likelihood of seeing a differenceof this magnitude. The likelihood is the p-value, in thiscase 0.000. The p-value indicates thelikelihood of seeing a relationship this strong in the data samplepurely by random chance; this means that there is no relationshipat the population level, it happened by coincidence in selectingthe sample from the population. As in most statistical tests, ifthe p-value is associated with a pair of hypotheses, forRegression: If the p-value is less than 0.05 (as in thisexample) then the null hypothesis Ho should be rejectedand the conclusion is that the Y is dependent on the X. Beltssometimes are misled at this point into assuming that there is adirect causal relationship between the X and the Y. There might be,but a change in X does not necessarily directly cause Y to move. The statistically correctexplanation here is that when X moves 1 unit, Y moves by someconsistent associated amount. The analysis is not complete until the modeladequacy is validated, which is done by reviewing the quality ofthe fit and an investigation into the variation that has not beenexplained, the Residuals (the bit left over). Residual evaluationgives a warning sign that the generated model might not be adequateor appropriate. Looking at Figure above, you know the residual is theactual value minus the fitted value, and it can be negative orpositive depending on whether the data point is above or below theline. There are several measures of model adequacy with respect tothe Residuals: The sum of the Residuals = 0 The Residuals have a constant variance The Residuals are normally distributed The Residuals are in control
To validate model adequacy it is useful toexamine the residuals graphically. To determine if the Residualsare Normal a few options are available: A Probability Plot can be applied as shown inFigure below. Residualson the Normal Plot should form a straight line.  A Histogram can be applied as shown in Figure The Histogramshould appear to be forming a normal curve. This can behit-and-miss and should be used in conjunction with the NormalProbability Plot. A NormalityTest can be applied on the Residuals to gain a p-value. This is by far the best approach.
To determine if the Residuals are in Control, anIndividuals Chart can be applied to the Residuals as shown in GraphC. Residuals that appear out of control should be studied further.Possible out-of-control issues might include Measurement Systemserror, incorrect data entry, or an unusual process event. In thecase of the latter, the Team should consult any notes taken duringthe data collection to evaluate the impact of the processevent. To determine if the Residuals have constantvariance and to show that they are random (just background noise),a Residuals versus Fits Plot can be applied, as shown in Graph D.The Residuals should be distributed randomly across the Plot; anyobvious patterns could indicate model inadequacy as described inbelow Interpretation of theResiduals versus Fits PlotPattern | Residuals versus Fits | Interpretation |
|---|
Residuals are contained in a straight band withno obvious pattern in the graph. | 
| The model is adequate. | Residuals show a funnel pattern. The variance ofthe errors is not constant and increases as Y increases. | 
| The model is inadequate. This might be resolvedby transforming the Y. | Residuals show a parabolic or quadraticpattern. | 
| The model is inadequate. This might be resolvedwith a higher order model (quadratic, for example). | Residuals show a bow pattern. The variance ofthe errors is not constant. | 
| The model is inadequate. This might be resolvedby transforming the Y. |
If there are patterns in the Residuals and theR2 value is very high, it probably presents no problem;however, if, for example, R2 is less than 80% then theremight be opportunity to create a better model based on the pathsrecommended in the table. After the model is deemed to be adequate, theTeam should collectively draw practical conclusions from it andpresent them back to the Process Owner and the Champion. RoadmapTheroadmap to conducting a Regression analysis is as follows: Step 1. | Plan the study. Identify the Ys and Xs to beconsidered. For each Y (preferably both the Xs and Ys) verify theMeasurement System using a Gage R&R Study . Agree on the data collection approach and assignresponsibilities to the Team members .
| Step 2. | Pilot data collection. Validate the datacollection approach as created in Step 1. Modify and retest ifnecessary.
| Step 3. | Collect the data, carefully following theagreed data collection approach. Take copious notes of processconditions and record any unusual process events. Transfer the datapromptly into electronic form and make backup copies.
| Step 4. | Analyze the data: Create the Fitted Line Plot Evaluate significance of R2 and thep-values Check the Residuals to validate modeladequacy
| Step 5. | Formulate practical conclusions from theanalysis, including potential follow-on studies.
|
Interpreting theOutputRegression in its Simple Linear form is quitestraightforward to apply. There are, however, as with all tools,several pitfalls that can cause Belts problems: The purpose of a model is to create a predictionmodel for behavior of the response Y based on the predictor X.However, if the X itself cannot be predicted, then the model isuseless. An example of this might be a desire to predict themaximum daily load on an electric power generation system from amaximum daily temperature model. The accuracy and usefulness of theRegression model for electric load prediction is conditional to theforecast of the temperature; the accuracy of which is patchy atbest. Regression is an interpolation technique, not anextrapolation technique. Predictions from Regression models aremade only with confidence within the confines of the data. If nodata has been taken in an operating region, the model ishitand-miss at best. To remedy this, data points should be takenover the breadth of the region in which predictions are made. Single points can heavily affect Regressionmodels. In Graph A of Figurebelow the single outlier dramatically reduces theR2 value of the model. If the outlier is a bad value, then the modelestimates are wrong and the error is inflated. However, if theoutlier is a real process value, it should not be removed. It is auseful piece of data for the process. Refer to notes taken duringdata collection to understand the point and if possible try torecreate it.  In Graph B the single outlier increases theR2 and regression coefficient. In this case, evaluatethe model with and without the point to determine its effect. Ifthe R2 value greatly changes during this analysis, thenthat value is too influential. Conduct other data runs near thatpoint to lower its leverage and confirm its validity. Regression models should represent meaningfulrelationships. Take for example the relationship shown in Figure below Data about a cityshowed that as population density of storks increased, so did thetown's population. As much as I'd like to believe thisrelationship, it could equally be the reverse, mundane scenario. Asthe town's population increases then there are more chimneys(nesting grounds for storks); thus the stork population canincrease accordingly.

Source SBTI's Lean Sigma Methodology training material.
Regression Updates

|