Outliers-Anomalies in the data.

Jagruti Pawashe
4 min readOct 29, 2022

--

What are Outliers?

Essentially, outliers are data points floating way off from the trend, the pattern, or wherever else the other data points are hanging out.

In simple terms, an outlier is an extremely high or extremely low data point relative to the nearest data point and the rest of the neighboring co-existing values in a data graph or dataset you’re working with.

Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.

But, Outliers are an important part of a dataset. They can hold useful information about your data.

Outliers can give helpful insights into the data you’re studying, and they can have an effect on statistical results. This can potentially help you discover inconsistencies and detect any errors in your statistical processes.

Some outliers represent true values from natural variation in the population. Other outliers may result from incorrect data entry, equipment malfunctions, or other measurement errors.

An outlier isn’t always a form of dirty or incorrect data, so you have to be careful with them in data cleansing. What you should do with an outlier depends on its most likely cause.

True Outliers:

True outliers should always be retained in your dataset because these just represent natural variations in your sample.

Example:- You measure height in meters for a representative from a sample of 560 college students. Your data are normally distributed with a couple of outliers on either end.

Most values are centered around the middle, as expected. But these extreme values also represent natural variations because a variable like running time is influenced by many other factors.

True outliers are also present in variables with skewed distributions where many data points are spread far from the mean in one direction. It’s important to select appropriate statistical tests or measures when you have a skewed distribution or many outliers.

Other outliers:

Outliers that don’t represent true values can come from many possible sources:

  • Measurement errors
  • Data entry or processing errors
  • Unrepresentative sampling

If data values are impossible or obviously incorrect, they should be removed. But if data don’t fit your model, it is your model that should be changed, not the data.

It is not usual for it to rain 15 cm in a day where I live, but it does happen. If that data point were excluded, it would give an incorrect impression of the total rainfall and the distribution of rainfall.

For some data sets, it can be difficult to find appropriate models, but that doesn’t justify discarding data just because it doesn’t fit a model you’re familiar with. Sometimes simplifying the analysis to be able to use a nonparametric test is a good solution.

To answer your question, how much outliers affect statistical analysis depends upon the analysis. Some methods are quite robust to outliers and some are quite sensitive. Consider using the mean or the median as a measure of location. The mean is sensitive to outlying values, but the median is not.

Effects of Outliers on data:

Outliers have a huge impact on the result of data analysis and various statistical measures.
Some of the most common effects are as follows:

  • If the outliers are non-randomly distributed, they can decrease normality.
  • It increases the error variance and reduces the power of statistical tests.
  • They can cause bias and/or influence final results.
  • They can also impact the basic assumption of regression as well as other statistical models.

How to identify outliers using visualizations?

Visualizing data as a box plot makes it very easy to spot outliers. A box plot will show the “box” which indicates the interquartile range (from the lower quartile to the upper quartile, with the middle indicating the median data value) and any outliers will be shown outside of the “whiskers” of the plot, each side representing the minimum and maximum values of the dataset, respectively. If the box skews closer to the maximum whisker, the prominent outlier would be the minimum value. Likewise, if the box skews closer to the minimum-valued whisker, the prominent outlier would then be the maximum value.

When should you remove outliers?

It may seem natural to want to remove outliers as part of the data-cleaning process. But in reality, sometimes it’s best — even absolutely necessary — to keep outliers in your dataset.

Removing outliers solely due to their place in the extremes of your dataset may create inconsistencies in your results, which would be counterproductive to your goals. These inconsistencies may lead to reduced statistical significance in an analysis.

--

--