Garbage In Garbage Out: Finding the Answer in Data Integrity

"For Data Integrity, Filtering Out Transient-State Data and Securing Model Reliability with Steady-State Data."

The calculators we commonly use operate on one key assumption: "The input is correct." If a user mistakenly enters '1+2' instead of '1+1', the calculator produces a result of '3'. Is this the calculator's fault? No. The calculator has merely performed its function perfectly according to the logic of basic arithmetic operations. 

However, when this simple story is applied to a multi-million dollar chemical plant, the situation changes. In many cases, process simulation involves building a design-purpose model (Design Model) under the assumption of "Clearly Defined Conditions" and "standards," with the presumption that inputs are error-free. 

When this “idealized calculator” is applied to actual operations, an unacceptable gap emerges between calculated results and reality. Feeding imperfect field data directly into a Design Model is likely to produce incorrect outputs, no matter how sophisticated the simulation is. In other words, to prevent the “Garbage In, Garbage Out” problem, ensuring Data Integrity must be the prerequisite.

This case study of Company H tells the story of a three-year technical impasse caused by forcing "highly variable real-world data" into an "ideal simulator," and how Simacro resolved it through a new perspective: Data Discrimination.

The Challenge: A Three-Year Impasse Caused by a Lack of Data Integrity

Company H, a major Southeast Asian polymer producer, operates a state-of-the-art process producing over 50 different product grades annually. To achieve operational excellence and digital transformation (DX), they partnered with a global leader in simulation software to initiate an advanced modeling project.

However, even after three long years, the simulation model failed to deliver consistently accurate results across all ~50 product grades. With no new ideas for improving model performance further, the two companies could not narrow their technical differences regarding the model’s accuracy.

The customer believed that “there is something wrong with the developed model,” while the vendor countered that “the reliability of the process data is questionable.” So what, exactly, was the real problem?

‍ ‍

Screening of the noise characteristics of 27 KPVs.Red horizontal line indicates CV=0.05.

The core problem was that "a simulator typically doesn't validate the quality of its own input."

Company H’s process is characterized by very frequent Grade Changes. Because the product being produced shifts almost every week, the process data fluctuated constantly in a transient state, and periods of sustained steady-state were extremely rare. The project team simply fed averaged transient-state data into the model. It was like measuring the average water level in a stormy sea and then assuming the ocean is calm before setting sail.

‍ ‍

Screening of the noise characteristics of 68 Sub-datasets. Red horizontal line indicates CV=0.05.

Solution Step 1: Filtering Transient-state Data to Ensure Data Integrity

As soon as Simacro stepped in, we changed the fundamental question. Instead of asking, "How do we tune the model parameters?" we asked, "Is this data valid for simulation input?"

With ProcessMetaverse™ Stability Analysis Agent, SIMACRO carried out a comprehensive technical analysis of nearly a full year of highly variable plant operating data to identify steady-state windows that could truly represent each product grade. Using a statistical stability threshold of CV 0.05 (5%), we screened the data for reliable operating segments. The outcome was a clear success—and the findings completely exceeded our expectations. 

None of the many periods that the customer and the vendor had previously assumed as “stable operation” passed this criterion; they were all identified as quasi-transient states. In other words, there was virtually no “clean data” available that could be used for modeling.  

Using transient-state data that only looks steady is like building a castle on sand. Simacro applied the advanced data-analytics AI agent capabilities of ProcessMetaverse™. Rather than trusting the entire dataset selected by the former project team, we performed noise characterization to screen out data suitable for steady-state modeling, and then precisely extracted stable operating windows from highly fluctuating periods.

Among numerous pseudo-steady-state windows, we pinpointed intervals of true steady state across multiple variables and extracted only those data segments. This went beyond ordinary data cleansing—it was a process of data discrimination.

‍ ‍

Figure 4. Data Discrimination: Extracting Stable Operating Windows by AI Agent.

Solution Step 2: Data Reconciliation – Generating Physically Valid Input Data

Finding valid time windows was just the start. Chemical plant sensors are never perfect. For instance, in a mass balance equation where A + B = C, real-world meters might read A=1 and B=2, but C=3.5.

For real-time digital twin applications or handling massive operational data, data reconciliation is a mandatory process. 

Reconciliation is a process that identifies statistical inconsistencies among data, then compares and aligns them to ensure data accuracy, consistency, and integrity. Through ProcessMetaverse™, Simacro analyzed the causal relationships in the data over time and corrected conflicting values, ultimately producing a physically valid set of input data.

Technology Deep Dive: Unlocking the Black Box with MWD Deconvolution

While data discrimination and reconciliation served as validation of the input values, validating the "Reaction Model Structure" required a much deeper review and analysis. Matching measurable variables like flow or temperature was not sufficient, because what the customer ultimately wanted was accurate prediction of polymer product properties.

To accurately reproduce Molecular Weight Distribution (MWD)—a critical quality attribute—Simacro verified the existing kinetic model using Lab Data for all grades. While the existing polymerization reaction model used too few catalyst active sites, producing a low-accuracy molecular weight distribution, Simacro increased the number of active sites and upgraded the model to deliver a more accurate molecular weight distribution 

We implemented the MWD Deconvolution technique within the PMv Canvas, utilizing Log-normal and Schulz-Zimm distribution functions to precisely reverse-engineer (back-track) the catalyst activity and polymer chain growth process inside the reactor. 

This approach moves beyond merely estimating the recipe from the finished dish’s taste; it is akin to transparently reviewing the 'Reaction History' inside the reactor to reproduce the exact timing and pathway of reactants and additives. 

By combining reconciled process data with this reverse-engineering analysis, we proved that our simulation model perfectly mirrored the chemical operations inside the reactor.

The Result: Achieving Model Reliability through Verified Data Integrity

The results were clear. After running the simulation with a revised polymerization reaction model using data that Simacro had analyzed and reconciled through ProcessMetaverse™, the coefficient of determination (R²) between the experimental data and the simulation results showed a strong correlation, exceeding the project’s minimum acceptance criteria. 

This was completely different from the previous standard of being merely “acceptable.” Through ProcessMetaverse™, Simacro demonstrated that the model accurately represents the real process within a 95% confidence interval.

After three years of going nowhere, the project finally began to move forward once we changed our perspective on the data. By shifting the focus from simulation-parameter tuning to validating the integrity of the input data, we were able to develop it into a real-time Operational Digital Twin.

‍ ‍

Screenshot of Python Editor in ProcessMetaverse™

The Path to an Operational Digital Twin

The lesson from the Company H project is clear. 

If simulation was a 'Tool for Design,' the Digital Twin is a 'Tool for Operation.'

The focus is shifting from technology that calculates the future to technology that optimizes the present.

Real-world data is imperfect. If we ignore this reality and proceed with the assumption that "the data must be correct," no simulation software can fulfill its role. 

The capability to discriminate valid signals from imperfect reality, reconcile contradictory values, and achieve high reproducibility through the reinterpretation of polymerization reaction models—this is the core competitive edge that allowed Simacro to solve a challenge that even global tech leaders found insurmountable.

This approach not only restored trust in the model but provided the client with a scalable foundation for real-time optimization, future AI model deployment, and plant-wide operational excellence. 

Simacro is now moving beyond simple design- focused modeling to open the era of true Operational Digital Twins, powered by AI Agent for data analysis.

Related Articles

With headquarters in Boston and Seoul, SIMACRO has completed over 90 commercial modeling projects across 40 companies since 2018. Collaborating with global technology leaders such as AspenTech, Emerson, and OLI, SIMACRO is committed to advancing digital innovation in the process industry.About SIMACRO​Designer

Previous
Previous

Polymer OTS: Breaking The Reactor Modeling Barrier

Next
Next

Understanding the Complexity of Green Hydrogen Production — Mirroring Digital Twin