- Code of Conduct
- Setup
- Reference
- Episodes
- Extras
- License
Overview
Teaching: ?? min
Exercises: 0 minQuestions
How do we explore and better understand the structure and format of our data.
Objectives
Learn about character and numeric data types.
Learn how to explore the structure of your data.
Understand NaN values and different ways to deal with them.
The format of individual columns and rows will impact analysis performed on a dataset read into Python. For example, you can’t perform mathematical calculations on a string [character formatted data]. This might seem obvious, however sometimes numeric values are read into Python as strings. In this situation, when you then try to perform calculations on the string-formatted numeric data, you get an error.
Types of Data
How information is stored in a DataFrame or a Python object affects what we can do with it and the outputs of calculations as well. There are two main types of data that we’re explore in this lesson: numeric and character types.
Numeric Data Types
Numeric data types include integers and floats. A floating point [known as a float] number has decimal points even if that decimal point value is 0. For example: 1.13, 2.0 1234.345. If we have a column that contains both integers and floating point numbers, Pandas will assign the entire column to the float data type so the decimal points are not lost.
An integer will never have a decimal point. Thus 1.13 would be stored as 1. 1234.345 is stored as 1234. You will often see the data type Int64
in Python which stands for 64 bit integer. The 64 simply refers to the memory allocated to store data in each cell which effectively relates to how many digits it can store in each “cell”.
Allocating space ahead of time allows computers to optimize storage and processing efficiency.
Character Data Types
Strings, known as Objects in Pandas, are values that contain numbers and / or characters. For example, a string might be a word, a sentence, or several sentences. A Pandas object might also be a plot name like ‘plot1’. A string can also contain or consist of numbers. For instance, ‘1234’ could be stored as a string. As could ‘10.23’. However strings that contain numbers can not be used for mathematical operations!
Pandas and base Python use slightly different names for data types. More on this is in the table below:
object | string | The most general dtype. Will be assigned to your column if column has mixed types [numbers and strings]. |
int64 | int | Numeric characters. 64 refers to the memory allocated to hold this character. |
float64 | float | Numeric characters with decimals. If a column contains numbers and NaNs[see below], pandas will default to float64, in case your missing value has a decimal. |
datetime64, timedelta[ns] | N/A [but see the datetime module in Python’s standard library] | Values meant to hold time data. Look into these for time series experiments. |
Checking the format of our data
Now that we’re armed with a basic understanding of numeric and character data types, let’s explore the format of our survey data. We’ll be working with the same articles.csv
dataset that we’ve used in previous lessons.
# note that pd.read_csv is used because we imported pandas as pd
articles_df = pd.read_csv["articles.csv"]
Remember that we can check the type of an object like this:
pandas.core.frame.DataFrame
Next, let’s look at the structure of our articles data. In pandas, we
can check the type of one column in a DataFrame using the syntax dataFrameName[column_name].dtype
:
articles_df['Title'].dtype
A type ‘O’ just stands for “object” which in Pandas’ world is a string [characters].
articles_df['Author_Count'].dtype
The type int64
tells us that Python is storing each value within this column as a 64 bit integer. We can use the dat.dtypes
command to view the data type for each column in a DataFrame [all at once].
which returns:
id int64
Title object
Authors object
DOI object
URL object
Subjects object
ISSNs object
Citation object
LanguageId int64
LicenceId int64
Author_Count int64
First_Author object
Citation_Count int64
Day int64
Month int64
Year int64
dtype: object
Working With Integers and Floats
So we’ve learned that computers store numbers in one of two ways: as integers or as floating-point numbers [or floats]. Integers are the numbers we usually count with. Floats have fractional parts [decimal places]. Let’s next consider how the data type can impact mathematical operations on our data. Addition, subtraction, division and multiplication work on floats and integers as we’d expect.
If we divide one integer by another, we get a float. The result on Python 3 is different than in Python 2, where the result is an integer [integer division].
We can also convert a floating point number to an integer or an integer to floating point number. Notice that Python by default rounds down when it converts from floating point to integer.
# convert a to integer
a = 7.83
int[a]
# convert to float
b = 7
float[b]
Working With Our Articles Data
Getting back to our data, we
can modify the format of values within our data, if we want. For instance, we could convert the Author_Count
field to floating point values.
# convert the Author_Count field from an integer to a float
articles_df['Author_Count'] = articles_df['Author_Count'].astype['float64']
articles_df['Author_Count'].dtype
Missing Data Values - NaN
Dealing with missing data values is always a challenge. It’s sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? What value should we use? We need to know how missing values are represented in the dataset in order to make good decisions. If we’re lucky, we have some metadata that will tell us more about how null values were handled.
For instance, in some disciplines, like Remote Sensing, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you [or someone else] explores your data.
Where Are the NaN’s?
Let’s explore the NaN values in our data a bit further. Using the tools we learned in lesson 02, we can figure out how many rows contain NaN values for language. We can also create a new subset from our data that only contains rows with non null language [ie select meaningful weight values]:
len[articles_df[articles_df['DOI'].isnull[]]]
# how many rows have a set language?
len[articles_df[~articles_df['DOI'].isnull[]]]
We can replace all null values with a given value [let’s assume ‘EN’] using the .fillna[]
method [after making a copy of the data so we don’t lose our work]:
df1 = articles_df.copy[]
# fill all NaN values with 0
df1['DOI'] = df1['DOI'].fillna['UNKNOWN']
Challenge
Verify that
df1
indeed has a language value for all articles.
However we cannot know for certain that the articles in question are actually in English. We could also chose to create a subset of our data, only keeping rows that do not contain Null values.
The point is to make conscious decisions about how to manage missing data. This is where we think about how our data will be used and how these values will impact the scientific conclusions made from the data.
Python gives us all of the tools that we need to account for these issues. We just need to be cautious about how the decisions that we make impact scientific results.
Recap
What we’ve learned:
- How to explore the data types of columns within a DataFrame
- How to change the data type
- What NaN values are, how they might be represented, and what this means for your work
- How to replace NaN values, if desired
Key Points
Types of data: numeric [integers and floats], character, datetime.
Missing data: how to detect and work with missing data.