Cleaning time series data python
No matter what kind of data science project one is assigned to, making sense of the dataset and cleaning it always critical for success. The first step is to understand the data using exploratory data analysis (EDA)as it helps us create the logical approach for solving… “Black and white photo of the street sign for Wall St in New York City” by Rick Tap on UnsplashA hypothetical company, ABC Financial Services Corp makes financial investments decisions on behalf of it’s clients based on the company’s economic research. A lot of these decisions involve speculating on future prices of financial instruments. ABC Corp utilizes several economic indicators but there is one in particular that is heavily weighted in their analysis and that is the University of Michigan’s Consumer Sentiment Survey (CSI). The only problem is that they have to wait for the release of this data (released once a month) which erodes some of ABC’s edge in the market. To stay competitive, they would like a way to predict this number ahead of time. I propose using a form of Machine Learning (ML) to make time series predictions on the final Consumer Sentiment number that’s yet to be released. To do this we are going to use other economic data (as features for the ML algorithm) which is released before the CSI. We’ll use this collection of data to construct a final dataset that is ready for a predictive algorithm. Multiple datasetsThe historical datasets that we’ll use are listed below and can be downloaded from the following links:
ToolsWe’ll use Python with the Pandas library to handle our data cleaning task. We are going to use can use Jupyter Notebook which is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. It is a really great tool for data scientists. You can head over to Anaconda.org to download the latest version which comes pre-loaded with most data science libraries. We will combine the above datasets into one table using pandas and then do the necessary cleaning. Some of the above datasets have been seasonally adjusted to remove the influences of predictable seasonal patterns. In actual prediction learning/testing, we would experiment with both types of datasets. Data cleaning is highly dependent on the type of data and the task you’re trying to achieve. In our case we combine data from different sources and clean up the resulting dataframe. In image classification data, we may have to reshape and resize the images and create labels while a sentiment analysis task may need to be checked for grammatical errors and keyword extraction. Visually inspect the dataframesTo do this, we’ll need a few imports from the python library as shown below. # Import necessary modulesimport numpy as np Import the data tables into Pandas Dataframe. # load all the datasets to pandas DataFramesdow = pd.read_csv('data/Dow Jones Industrial Average DJI.csv') After loading up the data, the first thing we do is visually inspect the data to understand how it is structured, it’s contents and to note anything out of the ordinary. Most of the data that you are going to come across is at least thousands of rows long so I like to inspect random chucks of rows at a time. We use the head() and tail() functions to inspect the top and the bottom sections of the table respectively. # view the top of the dow jones tabledow.head() # view the top of the umcsi tableumcsi.head() # view the bottom of the data tableumcsi.tail() We can also use a loop to iterate through all tables and get back the sizes. From this step we can start to anticipate the kinds of joins we need to do or decide whether we statistically have enough data to get started. Remember bad data is worse than no data. # get the shape of the different datasets Another useful pandas tool when we are inspecting data is the describe() function that gives a snapshot of the general statistics of all the numerical columns in the data. # we look at the statistical charateristics of the datsets The table above contains more columns as indicated by the back slash at the top rightWe also want to know if we are dealing with data containing null values as this can result in bad data if ignored. One way to get the null values is to use the isnull() function to extract this information. We use a loop below to iterate over all the data tables. # see which datasets have null values
Making observations about the data
Data CleaningThe Dow Jones data comes with a lot of extra columns that we don’t need in our final dataframe so we are going to use pandas drop function to loose the extra columns. # drop the unnecessary columns # rename columns to upper case to match other dfs Most of these steps can be combined into fewer steps but I break them down so we can follow along and also we can confirm that we are achieving the intended results. Next we drop those columns that have null values from the data table. There are times when we may need to combine these null columns and then drop them later where we can fill in the values from other tables or gain additional columns (information). The inplace flag below permanently removes the dropped rows. # drop NaN Values The umcsi table contains year values as data type float which may be problematic when we start getting decimals numbers for years. I created a function that creates a new integer column from the float column. We can then drop the old float year column inplace(). Sometimes the date columns are in string format and we have to parse the date using pandas built in functions or we can create our own columns for those unique cases. And there will be many of those cases when you’re cleaning data. # create 'Year' column with int values instead of float# casting function Observe that we have month and year as separate columns which need to be combined to match the date format from the rest of the data tables. For that we use pandas datetime functions which are very capable of handling most date and time manipulations. As it turns out, the other tables have dates in string data type so we will also have to change the umcsi date column to string. This will make sense because as a time series, any table joinings will be on the date column as the key. # combine year columns to one column format Our umcsi table is looking good with the exception of the old float date column and the month column so we have to get rid of that. We should also move the final date column to the front column for the sake of staying organized. # drop unneeded columns column A more useful table than we started of with.With all tables in a cohesive format, we can go ahead and join them and do some final cleanup steps. We shall concatenate the tables with date column as the key. We’ll use the all powerful lambda function for this one to get it done on the fly. Actually we will wrap a few more functions to demonstrate just how powerful pandas is for data manipulations. # concatenate all dataframes into one final dataframe dfs = [dow,unemp,oil,hstarts,cars,retail,fedrate,umcsi] # we perform the joins on DATE column as key and drop null valuesdf = reduce(lambda left,right: pd.merge(left,right,on='DATE', how='outer'), dfs).dropna() df.head(5) We now have a final pandas dataframe even though it still needs a bit more cleanup. Next we have to remove outliers from our final table since these outliers are likely to introduce a lot of noise to our machine learning task later on. # remove all rows with outliers in at least one rowFinal dataframe shape Python has a specialized format for dealing with time columns which is very efficient. We can extract the datetime.datetime format from current string format using the strip() function. Again we’ll use the lambda function to apply it to all rows on the fly. # change the DATE column from String to python's The final step is to rename the columns to more user friendly names for those that go on to consume this data. # rename columns to more user friendly namesFINAL pandas dataframe Data cleaning conclusions
|