This post will cover and drives you through the basics of how to use the pandas library provided by the python.
With the help of the pandas library, you can easily play with your dataset and manipulate your dataset without any complex coding.
So let's start.
DATA STRUCTURE PROVIDED BY PYTHON
Python provides the following data structure:
- List
- Dictionary
- Series
- DataFrame
Whenever we import any dataset using pandas library, It treats it as a
DataFrame.
So, what is a DataFrame? Well, a DataFrame is a combination of lists and dictionaries.
I will introduce you to how we can use pandas for data exploration, Look into the following codes.
# Introduction
To import pandas use the following command
```python
import pandas as pd
```
## Dictionary
A dictionary is data structure that uses key: value pair to store data. you are familiar with arrays, they store values in index position, If you want to retrieve any data from the array you have to access it by its index number.
But, In the dictionary, we use the key instead of index to access the element. you can set any values for the key you want
```python
d = {"Roll": 27, "Name":"Srujan", "branch":"Information Technology", "Age": 20}
```
Now, If I want to access the age, I will just do,
```python
d["Age"]
```
20
## DataFrame
A DataFrame is a table-like data structure with rows and columns, you can create a DataFrame using pandas function DataFrame()
A dataframe is a combination of a dictionary and the list. For Example, If you want to construct the column say Roll, Then you will create a key as "Roll" and pass a list.
```python
data = {"Roll": [1,2,3,4,5,6] , "Name": ["Srujan", "Rohan", "Ayush", "Vaibhav", "Priya", "Akansha"], "Branch" : ["IT","CS","Mech","IT","CS","ET"], "Age":[19,20,20,21,22,23]}
```
```python
data
```
{'Roll': [1, 2, 3, 4, 5, 6],
'Name': ['Srujan', 'Rohan', 'Ayush', 'Vaibhav', 'Priya', 'Akansha'],
'Branch': ['IT', 'CS', 'Mech', 'IT', 'CS', 'ET'],
'Age': [19, 20, 20, 21, 22, 23]}
In order to convert this dictionary to a dataframe we will use DataFrame function from the pandas library.
```python
dataframe = pd.DataFrame(data)
```
```python
dataframe
```
|
Roll |
Name |
Branch |
Age |
0 |
1 |
Srujan |
IT |
19 |
1 |
2 |
Rohan |
CS |
20 |
2 |
3 |
Ayush |
Mech |
20 |
3 |
4 |
Vaibhav |
IT |
21 |
4 |
5 |
Priya |
CS |
22 |
5 |
6 |
Akansha |
ET |
23 |
# How To Read a Data-Set
Based on the type of extension of your dataset file, pandas provides various methods to read data. In our example, the dataset file is a CSV file so we will use the read_csv() method to read our data.
You can download the dataset from the below link-->
https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv
```python
data = pd.read_csv(r"auto.csv")
```
Now, In order to know how many rows and columns or how many instances and features you data set have we type
```python
data.shape
```
(204, 26)
This means, that our data-set contains 204 rows and 26 columns
Let's take small look at our dataset, for this, we will use the head() function. the head() function takes input the number of rows you want to display
```python
data.head(3)
```
|
3 |
? |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
88.6 |
... |
130 |
mpfi |
3.47 |
2.68 |
9.0 |
111 |
5000 |
21 |
27 |
13495 |
0 |
3 |
? |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
88.6 |
... |
130 |
mpfi |
3.47 |
2.68 |
9.0 |
111 |
5000 |
21 |
27 |
16500 |
1 |
1 |
? |
alfa-romero |
gas |
std |
two |
hatchback |
rwd |
front |
94.5 |
... |
152 |
mpfi |
2.68 |
3.47 |
9.0 |
154 |
5000 |
19 |
26 |
16500 |
2 |
2 |
164 |
audi |
gas |
std |
four |
sedan |
fwd |
front |
99.8 |
... |
109 |
mpfi |
3.19 |
3.40 |
10.0 |
102 |
5500 |
24 |
30 |
13950 |
3 rows × 26 columns
Can you see any headings in this data-set? No, This happens because the file of the dataset does not contain any heading, it just has it's values.
Don't worry, we can add the heading by ourselves.
```python
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
"num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
"peak-rpm","city-mpg","highway-mpg","price"]
```
```python
data = pd.read_csv(r"auto.csv", names = headers)
```
```python
data.head(2)
```
|
symboling |
normalized-losses |
make |
fuel-type |
aspiration |
num-of-doors |
body-style |
drive-wheels |
engine-location |
wheel-base |
... |
engine-size |
fuel-system |
bore |
stroke |
compression-ratio |
horsepower |
peak-rpm |
city-mpg |
highway-mpg |
price |
0 |
3 |
? |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
88.6 |
... |
130 |
mpfi |
3.47 |
2.68 |
9.0 |
111 |
5000 |
21 |
27 |
13495 |
1 |
3 |
? |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
88.6 |
... |
130 |
mpfi |
3.47 |
2.68 |
9.0 |
111 |
5000 |
21 |
27 |
16500 |
2 rows × 26 columns
This dataset is of an automobile company. Let's explore it, we can use info() method to get the details of each column of the dataset
```python
data.info()
```
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling 205 non-null int64
normalized-losses 205 non-null object
make 205 non-null object
fuel-type 205 non-null object
aspiration 205 non-null object
num-of-doors 205 non-null object
body-style 205 non-null object
drive-wheels 205 non-null object
engine-location 205 non-null object
wheel-base 205 non-null float64
length 205 non-null float64
width 205 non-null float64
height 205 non-null float64
curb-weight 205 non-null int64
engine-type 205 non-null object
num-of-cylinders 205 non-null object
engine-size 205 non-null int64
fuel-system 205 non-null object
bore 205 non-null object
stroke 205 non-null object
compression-ratio 205 non-null float64
horsepower 205 non-null object
peak-rpm 205 non-null object
city-mpg 205 non-null int64
highway-mpg 205 non-null int64
price 205 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB
The info() method will return the number of values and the datatype of the values each column have.
The first thing is to understand your dataset. When you understand the dataset well, then only you can process it for further analysis.
Let's understand some columns
## fuel-type
From the data.info() we can observe that fuel-type has 205 values, all are not-null and of the object type.
from data.head(2) you can observe that it is showing gas as the values, but are all the 205 cells contain the same value?
we can check this using a function value_counts()
```python
data['fuel-type'].value_counts()
```
gas 185
diesel 20
Name: fuel-type, dtype: int64
So, here we can see that the fuel-type contains only 2 values as gas and diesel were gas has occurred 185 times and diesel occurs 20 times.
Let's explore more about that column
```python
data['fuel-type'].describe()
```
count 205
unique 2
top gas
freq 185
Name: fuel-type, dtype: object
Here, count means the total number of values the column has.
unique represent that there are only 2 unique values i.e gas and diesel
top tells us that most of the values are "gas"
frequency tells us that "gas" comes total in 185 times.
## value_counts()
Remember that value_counts() function works where the columns have categorical or repeating values.
### Let's look at the column normalized-losses
```python
data["normalized-losses"].describe()
```
count 205
unique 52
top ?
freq 41
Name: normalized-losses, dtype: object
```python
data["normalized-losses"].head(5)
```
0 ?
1 ?
2 ?
3 164
4 164
Name: normalized-losses, dtype: object
Observe, When I used describe() function, it shows that the most occurring value is "?"
# Dealing with missing values
In dataset the missing values are represented by ? or Nan or simply with blank cell.
To deal with missing values we can
1. Remove the row which contains the missing value
2. Fill The Missing Value With Mean value of that column.
Removing the row is beneficial when the missing values are less.
otherwise, we have to fill that missing values with the mean.
Let us find out which columns have the missing values and by how much
# The isnull() Function
The isnull() function will return us the dataframe which contains true and false values were true means missing value and false means, not a missing value
```python
data.isnull()
```
|
symboling |
normalized-losses |
make |
fuel-type |
aspiration |
num-of-doors |
body-style |
drive-wheels |
engine-location |
wheel-base |
... |
engine-size |
fuel-system |
bore |
stroke |
compression-ratio |
horsepower |
peak-rpm |
city-mpg |
highway-mpg |
price |
0 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
1 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
2 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
3 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
4 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
200 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
201 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
202 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
203 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
204 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
205 rows × 26 columns
Here you can observe that all the values are false! which means that there are no missing values in the data set.
But wait, while observing the normalized-losses column we have some "?" values.
This happens because pandas only consider NaN values as missing or null values. So first in order to correct our dataset we have to replace all the "?" values to NaN values.
```python
import numpy as np
```
# The replace() function
for more details visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html
```python
data.replace("?", np.nan, inplace = True)
```
here inplace = True means save the changes to that variable again.
Now, let us use the isnull() function again
```python
data.isnull().head(5)
```
|
symboling |
normalized-losses |
make |
fuel-type |
aspiration |
num-of-doors |
body-style |
drive-wheels |
engine-location |
wheel-base |
... |
engine-size |
fuel-system |
bore |
stroke |
compression-ratio |
horsepower |
peak-rpm |
city-mpg |
highway-mpg |
price |
0 |
False |
True |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
1 |
False |
True |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
2 |
False |
True |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
3 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
4 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
... |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
5 rows × 26 columns
Observe the normalized-losses column, they have some True values.
Now, lets have a short and easy to look summary on columns which have missing values only. we will use sum() function with isnull() so it will give us a shortlist with columns having null values only
```python
data.isnull().sum()
```
symboling 0
normalized-losses 41
make 0
fuel-type 0
aspiration 0
num-of-doors 2
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4
dtype: int64
So, here we can see that columns
1. normalized-losses have 41 null values
2. num-of-doors have 2 null values
3. bore has 2
4. stroke have 2
6. horsepower have 2
7. peak-rpm have 2
8. price have 4
Let's take care of these columns, for ease let's begin with columns who have less missing values
## num-of-doors
```python
data['num-of-doors'].describe()
```
count 203
unique 2
top four
freq 114
Name: num-of-doors, dtype: object
```python
data['num-of-doors'].value_counts()
```
four 114
two 89
Name: num-of-doors, dtype: int64
here you can observe that the most of the values or doors are "four", So we will use the replace function to replace nan values with four
```python
data['num-of-doors'].replace(np.nan, "four", inplace=True)
```
Let's check if still any null values are there or not
```python
data['num-of-doors'].isnull().sum()
```
0
## bore
```python
data['bore'].describe()
```
count 201
unique 38
top 3.62
freq 23
Name: bore, dtype: object
```python
```
Here the bore is not a categorical column, it contains some numbers. So we will find the mean and replace the nan with mean value, But observe that the dtype of the column is an object. first, we need to convert the dtype to float
```python
data['bore'] = data['bore'].astype("float")
```
```python
data['bore'].replace(np.nan, data['bore'].mean(), inplace = True)
```
```python
data['bore'].isnull().sum()
```
0
You can also do it in a single line, just look at the code below
```python
data[["stroke", "horsepower","peak-rpm"]].describe()
```
|
stroke |
horsepower |
peak-rpm |
count |
201 |
203 |
203 |
unique |
36 |
59 |
23 |
top |
3.40 |
68 |
5500 |
freq |
20 |
19 |
37 |
```python
data["stroke"].replace(np.nan, data["stroke"].mean, inplace=True)
```
```python
data["horsepower"].replace(np.nan, data["horsepower"].mean, inplace=True)
```
```python
data["peak-rpm"].replace(np.nan, data["peak-rpm"].mean, inplace=True)
```
```python
data["normalized-losses"] = data["normalized-losses"].astype("float")
```
```python
data["normalized-losses"].replace(np.nan, data["normalized-losses"].mean(), inplace=True)
```
# Price
We will delete the rows in the price column because the price column is the output of the dataset and we can not put estimated values in output or target column as this could disturb the nature or pattern between input and output
```python
# simply drop whole row with NaN in "price" column
data.dropna(subset=["price"], axis=0, inplace=True)
# reset index, because we droped two rows
data.reset_index(drop=True, inplace=True)
```
```python
data.isnull().sum()
```
symboling 0
normalized-losses 0
make 0
fuel-type 0
aspiration 0
num-of-doors 0
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 0
stroke 0
compression-ratio 0
horsepower 0
peak-rpm 0
city-mpg 0
highway-mpg 0
price 0
dtype: int64
Now as you can see our data set is free of all the nan or missing values
>/pre>
# groupby() function
While analyzing our data, we may want to divide our dataset into some categories and may want to study the values of every column according to the category in which we divided our dataset.
For example, Say I want to know the mean values of every column for each value in body-style.
So let's first look into the values of body-style columns.
```python
data["body-style"].describe()
```
count 201
unique 5
top sedan
freq 94
Name: body-style, dtype: object
```python
data["body-style"].value_counts()
```
sedan 94
hatchback 68
wagon 25
hardtop 8
convertible 6
Name: body-style, dtype: int64
Here you can observe that the body-style column has 5 categories. And I want to know the mean price of each category.
For that, I need to split by data-set into corresponding categories and then find the mean.
```python
data.groupby('body-style').mean()
```
|
symboling |
normalized-losses |
wheel-base |
length |
width |
height |
curb-weight |
engine-size |
bore |
compression-ratio |
city-mpg |
highway-mpg |
price |
body-style |
|
|
|
|
|
|
|
|
|
|
|
|
|
convertible |
2.833333 |
127.333333 |
92.700000 |
170.383333 |
65.583333 |
51.433333 |
2801.666667 |
157.166667 |
3.491667 |
8.933333 |
20.500000 |
26.000000 |
21890.500000 |
hardtop |
1.875000 |
128.625000 |
98.500000 |
176.937500 |
66.612500 |
52.850000 |
2810.625000 |
176.250000 |
3.608750 |
10.725000 |
21.625000 |
27.250000 |
22208.500000 |
hatchback |
1.617647 |
130.897059 |
95.435294 |
166.288235 |
65.104412 |
52.133824 |
2322.852941 |
112.852941 |
3.236015 |
9.042941 |
26.602941 |
32.382353 |
9957.441176 |
sedan |
0.329787 |
120.893617 |
100.750000 |
178.046809 |
66.317021 |
54.387234 |
2625.893617 |
131.691489 |
3.345106 |
10.965957 |
25.053191 |
30.574468 |
14459.755319 |
wagon |
-0.160000 |
98.560000 |
102.156000 |
181.304000 |
66.256000 |
56.728000 |
2784.240000 |
123.840000 |
3.406400 |
10.316000 |
24.040000 |
28.720000 |
12371.960000 |
The groupby() function divided our data-set for the categories available in the body-style column and the mean() function calculates the mean for each value in body style.
Now, I am intrested in only knowing the price so i can do,
```python
data.groupby('body-style').mean()['price']
```
body-style
convertible 21890.500000
hardtop 22208.500000
hatchback 9957.441176
sedan 14459.755319
wagon 12371.960000
Name: price, dtype: float64
data.groupby('body-style').mean(), when we write this, it returns us the a dataframe. Now we can access any column using ["colimn name"]
for further details on groupby() visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
# crosstab() function
While analyzing our dataset we may often want to compare the values of 2 columns, say I want to the relationship between fuel-type and engine-type.
I am interested in knowing about which fuel-type uses which engine.
for that I will use the crosstab(), it takes two parameters, index, and columns.
```python
data['engine-type'].value_counts()
```
ohc 145
ohcf 15
ohcv 13
l 12
dohc 12
rotor 4
Name: engine-type, dtype: int64
```python
pd.crosstab(data['fuel-type'],data['engine-type'])
```
engine-type |
dohc |
l |
ohc |
ohcf |
ohcv |
rotor |
fuel-type |
|
|
|
|
|
|
diesel |
0 |
5 |
15 |
0 |
0 |
0 |
gas |
12 |
7 |
130 |
15 |
13 |
4 |
You can observe that the data frame we got here has index values as diesel and gas, columns as engine-type.
It shows that how many diesel vehicle has which engine-type.
for further knowledge on crosstab() visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
So, you got complete knowledge as to how you can use the pandas library to analyze and tweak your data for further analysis.
In further posts, we will discuss various machine learning algorithms and their uses.
Social Plugin