Multivariate statistical methods can be used to monitor process variables and predict final product quality at an early stage, while also providing deeper understanding of the process. This allows engineers and production managers to optimize their processes, ultimately leading to significant cost and time savings.
This white paper includes a background and explanation of some key multivariate methods, as well as examples of how to interpret typical multivariate plots. It uses a real-world example from a paper manufacturing company that was able to improve a key quality parameter, Print Through, by better understanding the variables impacting it.
About Camo Analytics
Founded in 1984, Camo Analytics is a recognized leader in multivariate data analysis, a powerful set of data mining techniques that help identify patterns and understand the relationships between variables in large and complex data sets.
Our software is used by many of the world’s leading companies, universities and research institutes in the life sciences, food & beverage, agriculture, energy, oil & gas, mining & metals, industrial manufacturing, pulp & paper, automotive, aerospace and technology sectors.
Unscrambler software is the preferred choice of engineers, scientists and data analysts because of its ease of use, world-leading analytical tools and data visualization. Our solutions are used by more than 25000 people in 3000 organizations to analyze data, monitor process or equipment performance and build better predictive models. This gives them valuable insights to make more informed decisions, improve market segmentation, research & development, manufacturing processes and product quality.
Introduction to Multivariate Process Monitoring and Control
Multivariate statistical process monitoring (MSPM) – also referred to as multivariate statistical process control or MSPC – is a valuable tool for ensuring reliable product quality in the process industry.
However, many organizations today are still not fully utilizing their potential to make significant improvements in their production environment. The MSPM approach to process monitoring involves the use of multivariate models to simultaneously capture the information from as few as two process variables, up to thousands.
The methodology provides major benefits for process engineers and production managers, including:
- ● Increased process understanding
- ● Early fault detection
- ● On-line prediction of quality
- ● Process optimization
With MSPM approaches, it is possible to monitor the data at the final product quality stage, but also all of the available variables at different stages of the process, to identify underlying systematic variations in the process.
The variables measured in a process are often correlated to a certain extent, for example when several temperatures are measured in a distillation column. This means that the events or changes in a process can be visualized in a smaller subspace that may give a direct chemical or physical interpretation. If we want to keep such a process ”in control”, traditional univariate control charts may not assure this efficiently due to the covariance or interaction between variables. Because univariate analysis visualizes the relationship to the response variable one at a time, it does not reveal the multivariate patterns between the variables simultaneously, which for both interpretation and prediction are vital for industrial processes.
In many processes, the variables have important interactions affecting the outcome (e.g. final product quality) which cannot be detected by traditional univariate statistical process control charts. Figure 1 exemplifies a typical situation where two process variables are both inside their univariate control limits (given as two standard deviations) but fails to detect that the general trend of correlation between these two variables is broken for the sample shown in red.
Figure 1. Comparing univariate and multivariate views of a simple process involving only two variables, temperature and pH. The process appears to be within specification limits when examining two separate univariate control charts (temperature control chart and pH control chart). When switching to a multivariate view, however, a fault in the process can be clearly observed outside the limits.
- ● Only with multivariate analysis can the fault be detected
- ● The univariate limits are too wide to detect a multivariate fault
- ● The two variables under consideration are not independent
- ● The “sweet spot” is defined by the ellipse
Multivariate data analysis (MVA) is the analysis of more than one variable at a time. Essentially, it is a tool to find patterns and relationships between several variables simultaneously. It lets us predict the effect a change in one or more variables will have on other variables. Multivariate analysis methods include exploratory data analysis (data mining), classification (e.g. cluster analysis), regression analysis and predictive modelling.
Univariate analysis is the simplest form of quantitative (statistical) analysis. The analysis is carried out with the description of a single variable and its attributes of the applicable unit of analysis. Univariate analysis is also used primarily for descriptive purposes, while multivariate analysis is geared more towards explanatory purposes. (Source: Wikipedia)
Common Multivariate Methods and Statistics
The most frequently applied multivariate methods are Principal Component Analysis (PCA) and Partial Least Squares (PLS) Regression.
PCA answers the question “Is the process under control?” but does not provide a quantitiave model for the final product quality. Typical applications of PCA for this purpose are raw material identification and on-line testing of product quality.
In addition to the monitoring aspect, PLS Regression also provides quantitative prediction of the final product quality based on all or a subset of the process variables. One vital aspect in this context is to reduce the off-line laboratory work, both to have the prediction at an early stage as the product properties are not available on-line, and to reduce the labour-intensive work.
Critical statistical limits can be derived from the empirical data chosen to establish a model for when the process is under control. One limit is based on the space defined by the model, the so-called Hotelling’s T2 statistic. This statistic indicates if there is too high or too low concentration of the quality variable of interest. The other limit is based on the distance to the model, meaning there is something new e.g. there is a change in the raw material.
Multivariate statistical methods are also excellent tools to develop processes further. With these methods we can look inside the process to gain the necessary information for optimizing them.
Principal Component Analysis (PCA)
A method for analyzing variability in data. PCA does this by separating the data into principal components (PCs). Each PC contributes to explaining the total variability, with the first PC describing the greatest source of variability. The goal is to describe as much of the information in the system as possible in the fewest number of PCs and whatever is left can be attributed to noise i.e. no information. Maps of samples (scores) and variables (loadings) give valuable information about the underlying data structures.
Partial Least Squares Regression (PLSR)
A method for relating the variations in one or several response variables (Y-variables) to the variations of several predictors (X-variables), with explanatory or predictive purposes.
Industry Example: Background and Data
A paper producer monitors the quality of newsprint by applying ink to one side of the paper. By measuring the reflectance of light on the reverse side of the paper, a reliable, practical measure of how visible the ink is on the opposite side is obtained. This property, Print Through, is an important quality parameter. The paper is also analyzed with regard to several other product variables and raw material variables.
The data used in this example is taken from a real-world paper manufacturing process. Samples were collected from the production line over a considerable period of time to ensure the measurements would capture the important variations in production.
A model is a mathematical equation summarizing variations in a data set. Models are built so that the structure of a data table can be understood better than by just looking at all raw values. Statistical models consist of a structure part and an error part. The structure part (information) is intended to be used for interpretation or prediction, and the error part (noise) should be as small as possible for the model to be reliable.
Stage of data analysis where a model is established with the available data, so that it describes the data as well as possible. It is imperative that this is based on model validation and not the best numerical fit.
After calibration, the variation in the data can be expressed as the sum of a modelled part (structure) and a residual part (noise). ’Calibration samples’ are the samples on which the calibration is based. The variation observed in the variables measured on the calibration samples provides the information that is used to build the model. If the purpose of the calibration is to build a model that will later be applied on new samples for prediction, it is important to collect calibration samples that span the variations expected in the future prediction samples.
Predictions are performed by collecting new samples, obtain the values for the variables with the appropriate sensors similarly as in the calibration stage and apply the model to give a prediction (estimate) of the product quality. The multivariate methods also have diagnostics for detecting outliers at the prediction stage.
The data consists of 66 samples with 15 process and product attribute variables and the response variable, Print Through. In this case, 16 of the samples were test samples used for prediction using the model based on the calibration data of 50 samples. The process variables are given in Table 1.
Table 1. Variables and product attributes in a paper manufacturing process which determine quality.
The purpose was to establish a model that could be used for quality control and production management. The objectives were:
- ● Predict quality from the process variables and other product variables
- ● Rationalize the quality control process by reducing the number of variables measured i.e. build a model that includes a subset of variables without losing the underlying variability
Using Unscrambler multivariate analysis software, a PLS regression model was run with 50 calibration samples and the 15 process and product variables with Print Through as the response variable. As mentioned above, an important aspect of multivariate modelling is that the dimensionality of the process is typically lower than the number of process variables measured i.e. there is a redundancy among the observed variables.
This is exemplified in the scores plot in Figure 2 which summarises the model in the two underlying dimensions (“factors” or “latent variables”) for the 15 original process variables. Therefore, rather than plotting the individual variables in one, two or three dimensions, the process can be visualized as a map of the samples in the latent variable space, the scores plot. The corresponding loadings plot (not shown) visualizes the relationships between all variables.
Figure 2. Scores plot for factors 1 and 2 in a paper manufacturing process. The scores plot is used to visualize the samples based on all variables. Samples (dots) are evenly and widely scattered, indicating no clear groupings or outliers, a positive result in this instance. The direction of the process timeline can been seen from left to right as shown in Figure 3. The ellipse defines the 95% confidence limit.
The Scores plot
A Scores plot represents each sample in the space defined by a particular principal component. They can be plotted as line plots for describing sample trends, or 2D or 3D scatter plots for defining trends and visualizing clusters.
Alternatively one may visualise the change in the process over time as a one-dimensional scores plot if such a clear trend exists, as shown in Figure 3. A line plot of scores can be used to visualize trends and developments in a process over time.
Figure 3. Line plot of factor 1 scores for the resulting PLS regression model. The upward trend indicates a change over time towards higher scores. In this case this corresponds to higher quality.
The overall importance of the process variables are most easily depicted in terms of the model coefficients as shown in Figure 4. Weighted regression coefficients show which of the variables have a significant impact on the model in terms of the final product quality.
Figure 4. Weighted regression coefficients resulting from the PLS regression model of paper parameters to predict the quality parameter Print Through. The bars with the diagonal lines are those that have a significant relationship to Print Through. The model shows that Weight, Scatter, Opacity and Filler have an inverse relationship to Print Through (the lower the weight the higher the Print Through), while Print Through increases with increased Brightness. Where the bars are blue, there is no significant relationship, or there is high uncertainty, as indicated by the ‘I’ shaped confidence interval lines.
While validating the model robustness, a 95% confidence interval is estimated for each variable, thus indicating which variables are important. The practical benefit of this is that if many variables describe the model in the same way, it is not necessary to measure all of them. Of course, one may decide to continue monitoring all the variables but not to use them for prediction if the parsimonious (most simplistic version possible) model is better for that purpose.
From the results of the first model, a reduced model with only five variables was chosen for on-line prediction using the variables which were shown to be significant:
- ● Weight
- ● Brightness
- ● Scatter
- ● Opacity
- ● Filler
Using Unscrambler Process Pulse real-time process monitoring software, process operators or engineers can view interactive plots during the prediction stage which give insight into any changes in the process. Upper and lower control limits for the print through are shown in real-time (Figure 5) and using the Hotelling’s T2 statistic (Figure 6).
Hotelling’s T² statistic
A linear function of the leverage that can be compared to a critical limit according to an F-test. This statistic is useful for the detection of outliers at the modelling or prediction stage. The ‘Hotelling’s T² ellipse’ is a 95% confidence ellipse which can be included in scores plots and reveals potential outliers, lying outside the ellipse.
Multivariate methods allow quality variable values predicted using a model based on all the five input variables numbered above to be represented relative to simple upper and lower limits.
Figure 5. Predicted values for Print Through. Sample 14 is out of specification (above the red critical limit line), indicating a problem with the process. See Figure 7 for further explanation.
Figure 6. Univariate representation of the confidence ellipse (Hotelling’s T2 statistic critical limits) shown in Figure 2.
Using Unscrambler Process Pulse, when a new sample falls outside the critical limit, the process operator or engineer can simply click on the suspect data point in the plot to immediately see which variable is outside the limits as defined by the calibration (Figure 7).
The diagnostics in Process Pulse allow the process operator to ’drill down’ and identify the specific variables which are out of limit for an individual sample. After drilling down into sample 14 (Figure 7), the process operator can see that Opacity and Scatter are outside the minimum and maximum limits from the calibration, illustrated by the red dot below the blue min and max lines.
Figure 7. Real-time drill down for Sample 14. Moving from the multivariate view to a univariate view for an individual sample makes it easy to detect why it falls outside the critical multivariate limits. In this case, the sample is outside the minimum limit for scatter and opacity.
In this case, the paper manufacturer was able to:
- ● Identify the critical process variables affecting Print Through
- ● Implement real-time process monitoring enabling them to fix possible failures at an early stage
- ● Optimize the process using their newfound understanding of its behaviour
- ● Improve end product quality and reduce scrap, re-work and energy costs
Better process understanding can result in major cost and efficiency improvements relatively quickly. When combined with the knowledge and experience of in-house production teams, multivariate data analysis tools can give manufacturers much deeper insights into process behavior than traditional univariate statistics. These tools can be applied from the initial data analysis stage through to optimization of critical process parameters and the knowledge gained transferred to other processes (Figure 8).
Figure 8. Continuous process improvement with multivariate data analysis.
Multivariate methods are a powerful and efficient tool for monitoring process variables as well as for predicting final product quality at an early stage. For complex processes involving several variables interacting, multivariate statistical process monitoring (MSPM) methods are considerably more effective than univariate control charts. They enable the identification of the process “sweet-spot”, while disturbances in the process can easily be detected and the variables causing the upset can be interactively spotted in the on-line monitoring phase.
Importantly, process operators do not need to understand the methodology behind the system, as the plot of the original process variables is shown on screen. The concept of MSPM can be extended for classification and prediction of raw material quality in a complete production process quality system.