Remove outlier in dataframe in python

Before answering the actual question we should ask another one that's very relevant depending on the nature of your data:

What is an outlier?

Imagine the series of values [3, 2, 3, 4, 999] (where the 999 seemingly doesn't fit in) and analyse various ways of outlier detection

Z-Score

The problem here is that the value in question distorts our measures mean and std heavily, resulting in inconspicious z-scores of roughly [-0.5, -0.5, -0.5, -0.5, 2.0], keeping every value within two standard deviations of the mean. One very large outlier might hence distort your whole assessment of outliers. I would discourage this approach.

Quantile Filter

A way more robust approach is given is this answer, eliminating the bottom and top 1% of data. However, this eliminates a fixed fraction independant of the question if these data are really outliers. You might loose a lot of valid data, and on the other hand still keep some outliers if you have more than 1% or 2% of your data as outliers.

IQR-distance from Median

Even more robust version of the quantile principle: Eliminate all data that is more than f times the interquartile range away from the median of the data. That's also the transformation that sklearn's RobustScaler uses for example. IQR and median are robust to outliers, so you outsmart the problems of the z-score approach.

In a normal distribution, we have roughly iqr=1.35*s, so you would translate z=3 of a z-score filter to f=2.22 of an iqr-filter. This will drop the 999 in the above example.

The basic assumption is that at least the "middle half" of your data is valid and resembles the distribution well, whereas you also mess up if your distribution has wide tails and a narrow q_25% to q_75% interval.

Advanced Statistical Methods

Of course there are fancy mathematical methods like the Peirce criterion, Grubb's test or Dixon's Q-test just to mention a few that are also suitable for non-normally distributed data. None of them are easily implemented and hence not addressed further.

Code

Replacing all outliers for all numerical columns with np.nan on an example data frame. The method is robust against all dtypes that pandas provides and can easily be applied to data frames with mixed types:

import pandas as pd
import numpy as np                                     

# sample data of all dtypes in pandas (column 'a' has an outlier)         # dtype:
df = pd.DataFrame({'a': list(np.random.rand(8)) + [123456, np.nan],       # float64
                   'b': [0,1,2,3,np.nan,5,6,np.nan,8,9],                  # int64
                   'c': [np.nan] + list("qwertzuio"),                     # object
                   'd': [pd.to_datetime(_) for _ in range(10)],           # datetime64[ns]
                   'e': [pd.Timedelta(_) for _ in range(10)],             # timedelta[ns]
                   'f': [True] * 5 + [False] * 5,                         # bool
                   'g': pd.Series(list("abcbabbcaa"), dtype="category")}) # category
cols = df.select_dtypes('number').columns  # limits to a (float), b (int) and e (timedelta)
df_sub = df.loc[:, cols]


# OPTION 1: z-score filter: z-score < 3
lim = np.abs((df_sub - df_sub.mean()) / df_sub.std(ddof=0)) < 3

# OPTION 2: quantile filter: discard 1% upper / lower values
lim = np.logical_and(df_sub < df_sub.quantile(0.99, numeric_only=False),
                     df_sub > df_sub.quantile(0.01, numeric_only=False))

# OPTION 3: iqr filter: within 2.22 IQR (equiv. to z-score < 3)
iqr = df_sub.quantile(0.75, numeric_only=False) - df_sub.quantile(0.25, numeric_only=False)
lim = np.abs((df_sub - df_sub.median()) / iqr) < 2.22


# replace outliers with nan
df.loc[:, cols] = df_sub.where(lim, np.nan)

To drop all rows that contain at least one nan-value:

df.dropna(subset=cols, inplace=True) # drop rows with NaN in numerical columns
# or
df.dropna(inplace=True)  # drop rows with NaN in any column

Using pandas 1.3 functions:

  • pandas.DataFrame.select_dtypes()
  • pandas.DataFrame.quantile()
  • pandas.DataFrame.where()
  • pandas.DataFrame.dropna()

Remove outlier in dataframe in python

Should you remove outliers from a dataset?

Outliers are data points in a dataset that are considered to be extreme, false, or not representative of what the data is describing. These outliers can be caused by either incorrect data collection or genuine outlying observations. Removing these outliers will often help your model to generalize better as these long tail observations could skew the learning.

Should you remove outliers from a dataset?

Outliers should be removed from your dataset if you believe that the data point is incorrect or that the data point is so unrepresentative of the real world situation that it would cause your machine learning model to not generalise.

Methods for handling outliers in a DataFrame

Removing outliers from your dataset is not necessarily the only approach to take. As a rule of thumb there are three choices that you can take when wanting to deal with outliers in your dataset.

  1. Remove - The observations are incorrect or not representative of what you are modelling
  2. Re-scale - You want to keep the observations but need to reduce their extreme nature
  3. Mark - Label the outliers to understand if they had an effect on the model afterwards

Methods to detect outliers in a Pandas DataFrame

Once you have decided to remove the outliers from your dataset, the next step is to choose a method to find them. Assuming that your dataset is too large to manually remove the outliers line by line, a statistical method will be required. There are a number of approaches that are common to use:

  1. Standard deviation - Remove the values which are a certain number of standard deviations away from the mean, if the data has a Gaussian distribution
  2. Automatic outlier detection - Train a machine learning model on a smaller normal set of observations which can then predict data points outside of this normal set
  3. Interquartile range - Remove the values which are above the 75th percentile or below the 25th percentile, doesn't require the data to be Gaussian

There are trade-offs for each of these options, however the method most commonly used in industry is the standard deviation, or z-score, approach.

How many standard deviations away from the mean should I use to detect outliers?

The standard deviation approach to removing outliers requires the user to choose a number of standard deviations at which to differentiate outlier from non-outlier.

This then begs the question, how many standard deviations should you choose?

The common industry practice is to use 3 standard deviations away from the mean to differentiate outlier from non-outlier. By using 3 standard deviations we remove the 0.3% extreme cases. Depending on your use case, you may want to consider using 4 standard deviations which will remove just the top 0.1%.

Remove outliers in Pandas DataFrame using standard deviations

The most common approach for removing data points from a dataset is the standard deviation, or z-score, approach. In this example I will show how to create a function to remove outliers that lie more than 3 standard deviations away from the mean:

import pandas as pd

def remove_outliers(df,columns,n_std):
    for col in columns:
        print('Working on column: {}'.format(col))
        
        mean = df[col].mean()
        sd = df[col].std()
        
        df = df[(df[col] <= mean+(n_std*sd))]
        
    return df

Scale columns
Label encode columns
loc vs iloc

References

Pandas mean documentation
Pandas standard deviation documentation
Scipy z-score documentation
Sklearn outlier detection documentation

Remove outlier in dataframe in python

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.

Oslo, Norway

How do you remove outliers from a DataFrame in Python?

Use scipy..
print(df).
z_scores = stats. zscore(df) calculate z-scores of `df`.
abs_z_scores = np. abs(z_scores).
filtered_entries = (abs_z_scores < 3). all(axis=1).
new_df = df[filtered_entries].
print(new_df).

How do you remove outliers from a data frame?

For each series in the dataframe, you could use between and quantile to remove outliers..
DataFrame. select_dtypes().
DataFrame. quantile().
DataFrame. where().
DataFrame. dropna().

How do I remove an outlier from a column in Python?

Removing the outliers Inplace =True is used to tell python to make the required change in the original dataset. row_index can be only one value or list of values or NumPy array but it must be one dimensional. Full Code: Detecting the outliers using IQR and removing them.

How do you treat outliers in pandas DataFrame?

Methods to detect outliers in a Pandas DataFrame There are a number of approaches that are common to use: Standard deviation - Remove the values which are a certain number of standard deviations away from the mean, if the data has a Gaussian distribution.