Why Synthetic Data is Not Always Reliable
The primary objective of Synthetic Data is to allow integration of various scientific and technical data sets within an organization. Synthetic data is defined as “any non-tangible data that can be used as-is or modified to serve some purpose”. In other words, Synthetic Data provide an interface that allows scientists and other technicians to perform statistical analysis without requiring knowledge of the underlying data structure.
Unfortunately the definition of “as-is” is very narrow. Synthetic Data can be modified at any time with little or no cost to the developer, making it potentially useful for scientific research and for implementing changes in business strategy. However, Synthetic Data often fails for one reason or another. Below are some of the more common reasons why Synthetic Data fails:
Test Coverage: When Synthetic Data is used in conjunction with a set of real data or test cases, the possibilities for change detection are virtually unlimited. For example, a user may alter all data in a test case (adding, deleting rows, altering column widths, etc.) without actually performing a test for the purpose of making the changes. However, Synthetic Data does not distinguish between these two types of use. In other words, even if a test is run to identify potential problems with data conversion, Synthetic Data could be used to detect if changes were intentionally made.
Unpredictable Outcomes: It is entirely possible that Synthetic Data will provide information that is more consistent than real data. However, it is also entirely possible for Synthetic Data to generate results that are completely unrelated to the type of test being run or the type of data being used. For instance, if a test involves the use of probability theory to derive statistical estimates of value from data, Synthetic Data can result in biased or invalid estimates because the distribution of possible outcomes is inherently non-normal. Because of this unreliability, Synthetic Data should be used only as a last resort for exploratory tests and not as the primary or sole data source.
Inconsistent Results: During the testing process a test may require information from more than one data source. This can lead to inconsistencies in data conversion. The more sources of data that are converted during a test, the greater the potential for error. Also, some test reporting software may only report the sum of the values of all metrics during a test instead of presenting all data individually. This could lead to inconsistent test reporting that is inaccurate and/or unsuitable for measuring specific metrics. Such inconsistent data could result in inaccurate measurements and invalid conclusions.
Inconsistent Time Schedule: A time schedule for the completion of a test can lead to a significant delay in completing the task of Synthetic Data conversion. It can also lead to undesired results. If Synthetic Data contains information that requires additional processing beyond what is necessary to convert it to ready-to-use data in the application, the end result may be delayed even further. Moreover, Synthetic Data does not guarantee an exact end result as it relies on an accurate current date and time for the calculation of statistics and data conversions. If the end date of the test is changed, there is no way for the software to re-calculate the calculations and get the right data.
Poorly Designed Set of Metrics: Synthetic Data does not provide any way to automatically assign appropriate metrics to various items. For example, the date information in the data conversion needs to be converted into a meaningful date. Therefore, a businessperson looking to measure the effectiveness of a sales training program can make incorrect assumptions about the effectiveness of the program without being able to do so due to the absence of an accurate date. Likewise, a businessperson looking to measure an employee’s productivity can make inappropriate assumptions based on the type of work he is performing. Such assumptions are likely to lead to inaccurate data conversion and therefore to inaccurate results. Thus Synthetic Data should only be used as a last resort for exploratory testing.
Read also: https://brian-martin.blogspot.com/2021/01/how-testing-can-reduce-data-duplication.html
Overriding the Benefits of Testing: There are many benefits of using synthetic data. The primary benefit that it offers is that it allows fast measurements without wasting time for data conversion and calculation. Another benefit of using it is that it eliminates all of the hassles involved in conversion and calculation when you use real data from the application. However, it must be noted that although synthetic data offers these benefits it is not completely free from hassles.