what to do when some data are normal and some not
One of the most mutual question that I go while mentoring GB/BB projects likewise as while training LSS belts is "What should I do when my data isn't Normal?" Normally distributed information takes a heart stage in statistics. A big number of statistical tests are based on the supposition of normality of the information, which instills a lot of fear in project leaders when there data is not usually distributed.
A few years ago, some statisticians held a conventionalities that when the processes are not normally distributed in that location is something wrong with the processes or that the processes were 'out of control'. In their view, the purpose of a control chart was to determine when the processes were non-normal so they could be "corrected" and returned to normality. Fortunately, most statisticians and LSS practitioners today exercise not attach to this conventionalities. We recognize today that at that place is nothing wrong about a non-normal data and the preferred use of usually distributed data in statistics is only due to its simplicity and null more.
Many processes naturally follow a Non Normal Distribution, or a specific type of Not Normal distribution. Bicycle fourth dimension, calls per hour, customer waiting time, shrinkage etc., are a few examples of such processes.
Types of Non Normal Distributions
There are many types of Non-Normal distributions that a data set can follow, based on the nature of the procedure, the information collection methodology used, the sample size, outliers in the data etc. Few of the major Non-Normal distributions are listed below;
- Beta Distribution
- Exponential Distribution
- Gamma Distribution
- Changed Gamma Distribution
- Log Normal Distribution
- Logistic Distribution
- Maxwell-Boltzmann Distribution
- Poisson Distribution
- Skewed Distribution
- Symmetric Distribution
- Uniform Distribution
- Unimodal Distribution
- Weibull Distribution
Reasons for Non Normal Distribution
Many processes or data sets naturally fit a Non Normal distribution. For example, the number of accidents will tend to fit a Poisson distribution and lifetimes of products commonly fit a Weibull distribution. Even so, there may be times when your information is supposed to fit a normal distribution, but does not. For example, time taken to achieve office from home information is commonly supposed to fit a normal distribution. If you face up a Non Normal distribution for such information sets, it is advised to check for the below reasons in your data and correct if needed.
- Outliers / Extreme values: Outliers can skew your distribution. The central trend of your data set (Mean) is especially very sensitive to outliers and may result in a Non-Normal distribution. You should identify all the outliers, which may be extremely high or extremely low values in the information set or special causes in the process and remove them. Once washed, check for normality again. It is important that outliers are identified equally truly special causes before they are eliminated. The nature of ordinarily distributed data is that, a modest percentage of extreme values can be expected. Not every outlier is acquired past a special reason. Farthermost values should exist removed the data only if there are more of them than expected under normal atmospheric condition.
- Subgroups / Overlap of ii or more processes: A data set up, which is a combination of 2 or more data sets from 2 or more processes combined into i, can likewise lead to a non-normal distribution. If yous take ii data sets, which follow normal distribution and merge them into ane, information technology volition follow a bimodal distribution. The remedial activeness for these situations is to make up one's mind the reasons, which crusade bimodal or multimodal distribution and then stratify the information. Ensure that your information set is coherent and is not a mixture of multiple subgroups.
- Bereft information bigotry: Round-off errors or measurement devices with poor resolution/precision can brand truly continuous and ordinarily distributed data look detached and non-normal. Usage of a more than accurate measurement system or collection of more data points should be done to overcome insufficient information discrimination or an insufficient number of different values.
- Smaller Sample size: This can cause a normally distributed data wait scattered. For example, if you look at distribution of the height of 50 students in a particular grade, you will see that it follows a normal distribution. Still, if you randomly chose just three students from the aforementioned form, it may follow a uniform distribution or a skewed distribution as well depending on which students are chosen. Increasing your sample size until you go normal distribution unremarkably resolve this issue.
- Values Shut Process Boundaries: If a process has many values close to goose egg or close to a natural process purlieus, the data distribution will skew to the right or left. In this instance, a transformation, such as the Box-Cox power transformation, may help brand information normal. When comparing transformed data, everything under comparison must be transformed in the same style.
- Sorted Data: Data collected from a usually distributed process tin can also fit a non-normal distribution if information technology represents merely a sample / subset of the total output of the procedure. This happens when the collected information is sorted and and so analyzed. Suppose there is a ring manufacturing process where the target is to produce rings with a diameter of ten CM. The USL and LSL are x.25 CM and 9.75 CM respectively. If the band diameter information were collected from such a process and all values outside the specification limits were removed, it volition prove a non-normal distribution (compatible distribution), even though the data collected will originally be commonly distributed
- Data Follows a Different Distribution: In addition to above-mentioned reasons where a normally distributed procedure information can bear witness as non-normal, there are many data types, which follow a non-normal distribution past nature. In such cases, the data should exist analyzed using the tests that do non assume normality.
How to deal with Not Normal Distribution
Once yous have ensured that your data in non-normal due to the nature of the data / process itself and not due to any of the above-mentioned reasons, then you tin can go on with analyzing the aforementioned. There are two ways to get about analyzing the non-normal data. Either utilize the non-parametric tests, which do not assume normality or transform the data using an advisable function, forcing it to fit normal distribution.
Several tests are robust to the assumption of normality such as t-test, ANOVA, Regression and DOE. Such tests should only be used for normally distributed information. Yet, yous may notwithstanding be able to run these tests, with circumspection, if your sample size is large enough.
If you lot have a very modest sample, a sample that is skewed or i that naturally fits some other distribution type, you should run a non-parametric examination. A non-parametric examination is 1 that does not presume that the data fits any specific distribution type. Non-parametric tests include the Wilcoxon test, the Mann-Whitney Test, Moods Median test and the Kruskal-Wallis exam. Beneath is the listing of tests, which assumes normality, and the equivalent not-parametric test.
Generally, data needs to be statistically analyzed for 2 reasons. Offset involves various tests to see if the information is stable and to summate the process capability / sigma levels in measure phase of the projection using pre-improvement project Y data and in command phase using mail-comeback project Y information. The next part involves hypothesis testing in Measure stage and command charts in command phase. The exam equivalent non-parametric tests mentioned higher up are suitable used for hypothesis testing.
Let u.s. likewise look at how to summate process capability and sigma level when the information is non-normal.
Process Capability for Non-Normal Data
When your data is normally distributed, we perform 'Capability Analysis > Normal' in Minitab to summate process sigma. This capability analysis exam assumes that the data is normal and accordingly calculates the Procedure Sigma (brusque term) and Cpk values.
Withal, when the data is non-normal, the same test cannot exist used. The alternating exam is 'Capability Assay > Not-Normal'. Below figure shows the path for this test.
One of the prerequisite for this exam is to know the exact distribution that the data is post-obit as axiomatic from the beneath effigy.
Hence, the offset job is to place the distribution of the data by using 'Private Distribution Identification' test in Minitab.
The output of this test is multiple probability plots with p-values for each distribution that it tests.
The distribution with the highest p-value is the best-fit distribution of the data. Select this distribution in dialogue box for capability analysis test to calculate process sigma.
At that place are instances where, for some information sets, nosotros practice not get whatever distribution with p-value that is more than than 0.05 concluding that the data set does not follow whatever distributions that the exam looks for. In such scenarios, the one of the preferred remedial action is to transform the non-normal data into normal information using one of the data transformation methods. Box Cox power transformation and Johnson's transformation are nigh preferred methods to for such data transformation. More most data transformation using these methods in the next article.
Source: https://www.linkedin.com/pulse/non-normal-data-how-deal-sachin-naik
0 Response to "what to do when some data are normal and some not"
Post a Comment