Cleaning time series data python

A beginner’s guide to the world of time series forecasting!

No matter what kind of data science project one is assigned to, making sense of the dataset and cleaning it always critical for success. The first step is to understand the data using exploratory data analysis [EDA]as it helps us create the logical approach for solving…

“Black and white photo of the street sign for Wall St in New York City” by Rick Tap on Unsplash

A hypothetical company, ABC Financial Services Corp makes financial investments decisions on behalf of it’s clients based on the company’s economic research. A lot of these decisions involve speculating on future prices of financial instruments. ABC Corp utilizes several economic indicators but there is one in particular that is heavily weighted in their analysis and that is the University of Michigan’s Consumer Sentiment Survey [CSI].

The only problem is that they have to wait for the release of this data [released once a month] which erodes some of ABC’s edge in the market. To stay competitive, they would like a way to predict this number ahead of time. I propose using a form of Machine Learning [ML] to make time series predictions on the final Consumer Sentiment number that’s yet to be released.

To do this we are going to use other economic data [as features for the ML algorithm] which is released before the CSI. We’ll use this collection of data to construct a final dataset that is ready for a predictive algorithm.

Multiple datasets

The historical datasets that we’ll use are listed below and can be downloaded from the following links:

  • The Dow Jones Index : Source [Yahoo Finance]
  • US Unemployment [Jobless Claims] data from the US Department of Labor : Source [Federal Reserve]
  • Historical price of Crude Oil in the open market : Source [Federal Reserve]
  • New Housing Starts from US Census Bureau : Source [Federal Reserve]
  • Total Vehicles Sold : Source [Federal Reserve]
  • Retail Sales data from US Census Bureau : Source [Federal Reserve]
  • Federal Interest Rates : Source [Federal Reserve]
  • The University of Michigan’s Consumer Sentiment Survey — data to predict : Source [University of Michigan]

Tools

We’ll use Python with the Pandas library to handle our data cleaning task. We are going to use can use Jupyter Notebook which is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. It is a really great tool for data scientists. You can head over to Anaconda.org to download the latest version which comes pre-loaded with most data science libraries.

We will combine the above datasets into one table using pandas and then do the necessary cleaning. Some of the above datasets have been seasonally adjusted to remove the influences of predictable seasonal patterns. In actual prediction learning/testing, we would experiment with both types of datasets.

Data cleaning is highly dependent on the type of data and the task you’re trying to achieve. In our case we combine data from different sources and clean up the resulting dataframe. In image classification data, we may have to reshape and resize the images and create labels while a sentiment analysis task may need to be checked for grammatical errors and keyword extraction.

Visually inspect the dataframes

To do this, we’ll need a few imports from the python library as shown below.

# Import necessary modulesimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from scipy import stats
from datetime import datetime
from functools import reduce
import datetime

Import the data tables into Pandas Dataframe.

# load all the datasets to pandas DataFramesdow     = pd.read_csv['data/Dow Jones Industrial Average DJI.csv']
unemp = pd.read_csv['data/Civilian Unemployment Rate UNRATE.csv']
oil = pd.read_csv['data/Crude Oil Prices MCOILBRENTEU.csv']
hstarts = pd.read_csv['data/Housing Starts HOUST.csv']
cars = pd.read_csv['data/Total Vehicle SalesTOTALSA .csv']
retail = pd.read_csv['data/Advance Retail Sales_RSXFS.csv']
fedrate = pd.read_csv['data/Federal Interest Rates FEDFUNDS.csv']
umcsi = pd.read_excel['data/consumer_sent_UMCH_tbmics.xls',header=3]

After loading up the data, the first thing we do is visually inspect the data to understand how it is structured, it’s contents and to note anything out of the ordinary. Most of the data that you are going to come across is at least thousands of rows long so I like to inspect random chucks of rows at a time.

We use the head[] and tail[] functions to inspect the top and the bottom sections of the table respectively.

# view the top of the dow jones tabledow.head[]

# view the top of the umcsi tableumcsi.head[]

# view the bottom of the data tableumcsi.tail[]

We can also use a loop to iterate through all tables and get back the sizes. From this step we can start to anticipate the kinds of joins we need to do or decide whether we statistically have enough data to get started. Remember bad data is worse than no data.

# get the shape of the different datasets

dflist = [dow, unemp, oil, hstarts, cars, retail, fedrate, umcsi]
for i, dfr in enumerate[dflist]:
print[dflist[i].shape]

Another useful pandas tool when we are inspecting data is the describe[] function that gives a snapshot of the general statistics of all the numerical columns in the data.

# we look at the statistical charateristics of the datsets

for i, dfr in enumerate[dflist]:
print[dflist[i].describe[]]

The table above contains more columns as indicated by the back slash at the top right

We also want to know if we are dealing with data containing null values as this can result in bad data if ignored. One way to get the null values is to use the isnull[] function to extract this information. We use a loop below to iterate over all the data tables.

# see which datasets have null values

for i, dfr in enumerate[dflist]:
print[dflist[i].isnull[].sum[].sum[]]

Making observations about the data

  • Some of the observations below are not apparent from the images above but are visible from the original notebook available here in my GitHub repository.
  • In this case, the data we acquired is not overly complicated with hundreds of columns but it is good to bear in mind that this is not always the case and we must be comfortable handling really large messy data.
  • From the inspections above, there are some things that need to be rectified before we can our achieve a final clean dataset.
  • The date formats need to be converted to a uniform format across all datasets.
  • The date ranges are also very inconsistent. The start dates range from 1947 to 1992.
  • The dates in umcsi are in two columns [String month & float year] making it hard to merge with other datsets on that column.
  • umcsi also has 3 null values so we have to remove entire rows where this exists.
  • For our purposes, the dow dataframe also has extra columns so we will need to get rid of some of them.
  • These are just some of the modifications that I can observe but there is likely going to be other complications.
  • It is also necessary to comment your code so your colleagues can understand what you are doing.
  • At some point we also have to change the date format from String to a format that supports plotting.

Data Cleaning

The Dow Jones data comes with a lot of extra columns that we don’t need in our final dataframe so we are going to use pandas drop function to loose the extra columns.

# drop the unnecessary columns

dow.drop[['Open','High','Low','Adj Close','Volume'],axis=1,inplace=True]

# view the final table after dropping unnecessary columnsdow.head[]

# rename columns to upper case to match other dfs

dow.columns = ['DATE', 'OPEN', 'HIGH', 'LOW', 'CLOSE', 'ADJ CLOSE', 'VOLUME']

# view result after renaming columns
dow.head[]

Most of these steps can be combined into fewer steps but I break them down so we can follow along and also we can confirm that we are achieving the intended results. Next we drop those columns that have null values from the data table. There are times when we may need to combine these null columns and then drop them later where we can fill in the values from other tables or gain additional columns [information].

The inplace flag below permanently removes the dropped rows.

# drop NaN Values

umcsi.dropna[inplace=True]

The umcsi table contains year values as data type float which may be problematic when we start getting decimals numbers for years. I created a function that creates a new integer column from the float column. We can then drop the old float year column inplace[]. Sometimes the date columns are in string format and we have to parse the date using pandas built in functions or we can create our own columns for those unique cases. And there will be many of those cases when you’re cleaning data.

# create 'Year' column with int values instead of float# casting function
def to_int[x]:
return int[x]
# use function to convert floats to int

umcsi['Year'] = umcsi['Unnamed: 1'].apply[to_int]
umcsi.head[]

Observe that we have month and year as separate columns which need to be combined to match the date format from the rest of the data tables. For that we use pandas datetime functions which are very capable of handling most date and time manipulations. As it turns out, the other tables have dates in string data type so we will also have to change the umcsi date column to string. This will make sense because as a time series, any table joinings will be on the date column as the key.

# combine year columns to one column format

umcsi['DATE'] = umcsi.apply[lambda x: datetime.datetime.strptime["{0} {1}".format[x['Year'],x['DATE OF SURVEY']], "%Y %B"],axis=1]

# turn date format to string to match other DATE's. We'll merge the data on this column so this is a vital step.

def to_str[x]:
return str[x][:10]

umcsi['DATE'] = umcsi['DATE'].apply[to_str]
umcsi.head[]

Our umcsi table is looking good with the exception of the old float date column and the month column so we have to get rid of that. We should also move the final date column to the front column for the sake of staying organized.

# drop unneeded columns column
umcsi.drop[['Unnamed: 1','Year','DATE OF SURVEY'], axis=1, inplace=True]

# move 'DATE' column to the front
cols = list[umcsi]
cols.insert[0, cols.pop[cols.index['DATE']]]
umcsi = umcsi.reindex[columns = cols]
umcsi.head[]

A more useful table than we started of with.

With all tables in a cohesive format, we can go ahead and join them and do some final cleanup steps. We shall concatenate the tables with date column as the key. We’ll use the all powerful lambda function for this one to get it done on the fly. Actually we will wrap a few more functions to demonstrate just how powerful pandas is for data manipulations.

# concatenate all dataframes into one final dataframe  dfs = [dow,unemp,oil,hstarts,cars,retail,fedrate,umcsi] # we perform the joins on DATE column as key and drop null valuesdf = reduce[lambda left,right: pd.merge[left,right,on='DATE', how='outer'], dfs].dropna[] df.head[5]

We now have a final pandas dataframe even though it still needs a bit more cleanup. Next we have to remove outliers from our final table since these outliers are likely to introduce a lot of noise to our machine learning task later on.

# remove all rows with outliers in at least one row

df = df[[np.abs[stats.zscore[df.drop[['DATE'], axis=1]]] < 3].all[axis=1]]

# show final size after removing outliers
df.shape

Final dataframe shape

Python has a specialized format for dealing with time columns which is very efficient. We can extract the datetime.datetime format from current string format using the strip[] function. Again we’ll use the lambda function to apply it to all rows on the fly.

# change the DATE column from String to python's 
# datetime.datetime format
df['DATE'] = df['DATE'].apply[lambda x: datetime.datetime.strptime[x,"%Y-%m-%d"]]

The final step is to rename the columns to more user friendly names for those that go on to consume this data.

# rename columns to more user friendly names

df.columns = ['DATE', 'OPEN', 'HIGH', 'LOW', 'CLOSE', 'ADJ CLOSE', 'VOLUME', 'UNEMP %','OIL PRICE','NEW HOMES','NEW CARS SOLD',
'RETAIL SALES','FED INTRST %','CSI' ]

# preview final table
df.head[20]

FINAL pandas dataframe

Data cleaning conclusions

  • Data cleaning comes in all shapes and sizes and there is no one template to handle all situations.
  • While we do not know how the data will perform for the task of predicting the CSI, we do know that the data supplied has been processed to facilitate fast adoption and testing in a ML enviroment.
  • Sure we could have engineered more features and processed the current ones some more but that would be too presumptuous on our part as to how the ML team would proceed. For example we could have normalized the features to a uniform scale but did not.
  • In conclusion, like most tasks in the data science world, the best we can do is keep asking questions in the right team and experiment deeper based on those questions.

Chủ Đề