Data Analysis Basics With Pandas For Beginners.


This post will cover and drives you through the basics of how to use the pandas library provided by the python.



With the help of the pandas library, you can easily play with your dataset and manipulate your dataset without any complex coding.



So let's start.




DATA STRUCTURE PROVIDED BY PYTHON

Python provides the following data structure:



List
Dictionary
Series
DataFrame

Whenever we import any dataset using pandas library, It treats it as a DataFrame.



So, what is a DataFrame? Well, a DataFrame is a combination of lists and dictionaries.



I will introduce you to how we can use pandas for data exploration, Look into the following codes.



# Introduction

To import pandas use the following command


```python
import pandas as pd
```

## Dictionary

A dictionary is data structure that uses key: value pair to store data. you are familiar with arrays, they store values in index position, If you want to retrieve any data from the array you have to access it by its index number.

But, In the dictionary, we use the key instead of index to access the element. you can set any values for the key you want


```python
d = {"Roll": 27, "Name":"Srujan", "branch":"Information Technology", "Age": 20}
```

Now, If I want to access the age, I will just do,


```python
d["Age"]
```




    20



## DataFrame 

A DataFrame is a table-like data structure with rows and columns, you can create a DataFrame using pandas function DataFrame()

A dataframe is a combination of a dictionary and the list. For Example, If you want to construct the column say Roll, Then you will create a key as "Roll" and pass a list.


```python
data = {"Roll": [1,2,3,4,5,6] , "Name": ["Srujan", "Rohan", "Ayush", "Vaibhav", "Priya", "Akansha"], "Branch" : ["IT","CS","Mech","IT","CS","ET"], "Age":[19,20,20,21,22,23]}
```


```python
data
```




    {'Roll': [1, 2, 3, 4, 5, 6],
     'Name': ['Srujan', 'Rohan', 'Ayush', 'Vaibhav', 'Priya', 'Akansha'],
     'Branch': ['IT', 'CS', 'Mech', 'IT', 'CS', 'ET'],
     'Age': [19, 20, 20, 21, 22, 23]}



In order to convert this dictionary to a dataframe we will use DataFrame function from the pandas library.


```python
dataframe = pd.DataFrame(data)
```


```python
dataframe
```

	Roll	Name	Branch	Age
0	1	Srujan	IT	19
1	2	Rohan	CS	20
2	3	Ayush	Mech	20
3	4	Vaibhav	IT	21
4	5	Priya	CS	22
5	6	Akansha	ET	23

# How To Read a Data-Set

Based on the type of extension of your dataset file, pandas provides various methods to read data. In our example, the dataset file is a CSV file so we will use the read_csv() method to read our data.

You can download the dataset from the below link-->

https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv


```python
data = pd.read_csv(r"auto.csv")
```

Now, In order to know how many rows and columns or how many instances and features you data set have we type


```python
data.shape
```




    (204, 26)



This means, that our data-set contains 204 rows and 26 columns

Let's take small look at our dataset, for this, we will use the head() function. the head() function takes input the number of rows you want to display


```python
data.head(3)
```

	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
1	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
2	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950

3 rows × 26 columns

Can you see any headings in this data-set? No, This happens because the file of the dataset does not contain any heading, it just has it's values.

Don't worry, we can add the heading by ourselves.


```python
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
```


```python
data = pd.read_csv(r"auto.csv", names = headers) 
```


```python
data.head(2)
```

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500

2 rows × 26 columns

This dataset is of an automobile company. Let's explore it, we can use info() method to get the details of each column of the dataset


```python
data.info()
```

    
    RangeIndex: 205 entries, 0 to 204
    Data columns (total 26 columns):
    symboling            205 non-null int64
    normalized-losses    205 non-null object
    make                 205 non-null object
    fuel-type            205 non-null object
    aspiration           205 non-null object
    num-of-doors         205 non-null object
    body-style           205 non-null object
    drive-wheels         205 non-null object
    engine-location      205 non-null object
    wheel-base           205 non-null float64
    length               205 non-null float64
    width                205 non-null float64
    height               205 non-null float64
    curb-weight          205 non-null int64
    engine-type          205 non-null object
    num-of-cylinders     205 non-null object
    engine-size          205 non-null int64
    fuel-system          205 non-null object
    bore                 205 non-null object
    stroke               205 non-null object
    compression-ratio    205 non-null float64
    horsepower           205 non-null object
    peak-rpm             205 non-null object
    city-mpg             205 non-null int64
    highway-mpg          205 non-null int64
    price                205 non-null object
    dtypes: float64(5), int64(5), object(16)
    memory usage: 41.8+ KB
    

The info() method will return the number of values and the datatype of the values each column have.

The first thing is to understand your dataset. When you understand the dataset well, then only you can process it for further analysis.

Let's understand some columns

## fuel-type

From the data.info() we can observe that fuel-type has 205 values, all are not-null and of the object type.

from data.head(2) you can observe that it is showing gas as the values, but are all the 205 cells contain the same value?

we can check this using a function value_counts()


```python
data['fuel-type'].value_counts()
```




    gas       185
    diesel     20
    Name: fuel-type, dtype: int64



So, here we can see that the fuel-type contains only 2 values as gas and diesel were gas has occurred 185 times and diesel occurs 20 times.

Let's explore more about that column


```python
data['fuel-type'].describe()
```




    count     205
    unique      2
    top       gas
    freq      185
    Name: fuel-type, dtype: object



Here, count means the total number of values the column has.

unique represent that there are only 2 unique values i.e gas and diesel

top tells us that most of the values are "gas"

frequency tells us that "gas" comes total in 185 times.


## value_counts()

Remember that value_counts() function works where the columns have categorical or repeating values.

### Let's look at the column normalized-losses


```python
data["normalized-losses"].describe()
```




    count     205
    unique     52
    top         ?
    freq       41
    Name: normalized-losses, dtype: object




```python
data["normalized-losses"].head(5)
```




    0      ?
    1      ?
    2      ?
    3    164
    4    164
    Name: normalized-losses, dtype: object



Observe, When I used describe() function, it shows that the most occurring value is "?"


# Dealing with missing values

In dataset the missing values are represented by ? or Nan or simply with blank cell.

To deal with missing values we can
    1. Remove the row which contains the missing value
    2. Fill The Missing Value With Mean value of that column.
    
Removing the row is beneficial when the missing values are less.

otherwise, we have to fill that missing values with the mean.

Let us find out which columns have the missing values and by how much

# The isnull() Function

The isnull() function will return us the dataframe which contains true and false values were true means missing value and false means, not a missing value


```python
data.isnull()
```

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
200	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
201	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
202	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
203	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
204	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

205 rows × 26 columns

Here you can observe that all the values are false! which means that there are no missing values in the data set.

But wait, while observing the normalized-losses column we have some "?" values.

This happens because pandas only consider NaN values as missing or null values. So first in order to correct our dataset we have to replace all the "?" values to NaN values.


```python
import numpy as np
```

# The replace() function

for more details visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html


```python
data.replace("?", np.nan, inplace = True)
```

here inplace = True means save the changes to that variable again.

Now, let us use the isnull() function again


```python
data.isnull().head(5)
```

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

5 rows × 26 columns

Observe the normalized-losses column, they have some True values.

Now, lets have a short and easy to look summary on columns which have missing values only. we will use sum() function with isnull() so it will give us a shortlist with columns having null values only


```python
data.isnull().sum()
```




    symboling             0
    normalized-losses    41
    make                  0
    fuel-type             0
    aspiration            0
    num-of-doors          2
    body-style            0
    drive-wheels          0
    engine-location       0
    wheel-base            0
    length                0
    width                 0
    height                0
    curb-weight           0
    engine-type           0
    num-of-cylinders      0
    engine-size           0
    fuel-system           0
    bore                  4
    stroke                4
    compression-ratio     0
    horsepower            2
    peak-rpm              2
    city-mpg              0
    highway-mpg           0
    price                 4
    dtype: int64



So, here we can see that columns
    
    1. normalized-losses have 41 null values
    2. num-of-doors have 2 null values
    3. bore has 2
    4. stroke have 2
    6. horsepower have 2
    7. peak-rpm have 2
    8. price have 4 
    

Let's take care of these columns, for ease let's begin with columns who have less missing values

## num-of-doors


```python
data['num-of-doors'].describe()
```




    count      203
    unique       2
    top       four
    freq       114
    Name: num-of-doors, dtype: object




```python
data['num-of-doors'].value_counts()
```




    four    114
    two      89
    Name: num-of-doors, dtype: int64



here you can observe that the most of the values or doors are "four", So we will use the replace function to replace nan values with four


```python
data['num-of-doors'].replace(np.nan, "four", inplace=True)
```

Let's check if still any null values are there or not


```python
data['num-of-doors'].isnull().sum()
```




    0



## bore


```python
data['bore'].describe()
```




    count      201
    unique      38
    top       3.62
    freq        23
    Name: bore, dtype: object




```python

```

Here the bore is not a categorical column, it contains some numbers. So we will find the mean and replace the nan with mean value, But observe that the dtype of the column is an object. first, we need to convert the dtype to float


```python
data['bore'] =  data['bore'].astype("float")
```


```python
data['bore'].replace(np.nan, data['bore'].mean(), inplace = True)
```


```python
data['bore'].isnull().sum()
```




    0



You can also do it in a single line, just look at the code below


```python
data[["stroke", "horsepower","peak-rpm"]].describe()
```

	stroke	horsepower	peak-rpm
count	201	203	203
unique	36	59	23
top	3.40	68	5500
freq	20	19	37

```python
data["stroke"].replace(np.nan, data["stroke"].mean, inplace=True)
```


```python
data["horsepower"].replace(np.nan, data["horsepower"].mean, inplace=True)
```


```python
data["peak-rpm"].replace(np.nan, data["peak-rpm"].mean, inplace=True)
```


```python
data["normalized-losses"] = data["normalized-losses"].astype("float")
```


```python
data["normalized-losses"].replace(np.nan, data["normalized-losses"].mean(), inplace=True)
```

# Price

We will delete the rows in the price column because the price column is the output of the dataset and we can not put estimated values in output or target column as this could disturb the nature or pattern between input and output


```python
# simply drop whole row with NaN in "price" column
data.dropna(subset=["price"], axis=0, inplace=True)

# reset index, because we droped two rows
data.reset_index(drop=True, inplace=True)
```


```python
data.isnull().sum()
```




    symboling            0
    normalized-losses    0
    make                 0
    fuel-type            0
    aspiration           0
    num-of-doors         0
    body-style           0
    drive-wheels         0
    engine-location      0
    wheel-base           0
    length               0
    width                0
    height               0
    curb-weight          0
    engine-type          0
    num-of-cylinders     0
    engine-size          0
    fuel-system          0
    bore                 0
    stroke               0
    compression-ratio    0
    horsepower           0
    peak-rpm             0
    city-mpg             0
    highway-mpg          0
    price                0
    dtype: int64



Now as you can see our data set is free of all the nan or missing values

>/pre>

# groupby() function

While analyzing our data, we may want to divide our dataset into some categories and may want to study the values of every column according to the category in which we divided our dataset.

For example, Say I want to know the mean values of every column for each value in body-style.

So let's first look into the values of body-style columns.


```python
data["body-style"].describe()
```




    count       201
    unique        5
    top       sedan
    freq         94
    Name: body-style, dtype: object




```python
data["body-style"].value_counts()
```




    sedan          94
    hatchback      68
    wagon          25
    hardtop         8
    convertible     6
    Name: body-style, dtype: int64



Here you can observe that the body-style column has 5 categories. And I want to know the mean price of each category.

For that, I need to split by data-set into corresponding categories and then find the mean.


```python
data.groupby('body-style').mean()
```

	symboling	normalized-losses	wheel-base	length	width	height	curb-weight	engine-size	bore	compression-ratio	city-mpg	highway-mpg	price
body-style
convertible	2.833333	127.333333	92.700000	170.383333	65.583333	51.433333	2801.666667	157.166667	3.491667	8.933333	20.500000	26.000000	21890.500000
hardtop	1.875000	128.625000	98.500000	176.937500	66.612500	52.850000	2810.625000	176.250000	3.608750	10.725000	21.625000	27.250000	22208.500000
hatchback	1.617647	130.897059	95.435294	166.288235	65.104412	52.133824	2322.852941	112.852941	3.236015	9.042941	26.602941	32.382353	9957.441176
sedan	0.329787	120.893617	100.750000	178.046809	66.317021	54.387234	2625.893617	131.691489	3.345106	10.965957	25.053191	30.574468	14459.755319
wagon	-0.160000	98.560000	102.156000	181.304000	66.256000	56.728000	2784.240000	123.840000	3.406400	10.316000	24.040000	28.720000	12371.960000

The groupby() function divided our data-set for the categories available in the body-style column and the mean() function calculates the mean for each value in body style.

Now, I am intrested in only knowing the price so i can do,


```python
data.groupby('body-style').mean()['price']
```




    body-style
    convertible    21890.500000
    hardtop        22208.500000
    hatchback       9957.441176
    sedan          14459.755319
    wagon          12371.960000
    Name: price, dtype: float64



data.groupby('body-style').mean(), when we write this, it returns us the a dataframe. Now we can access any column using ["colimn name"]

for further details on groupby() visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

# crosstab() function

While analyzing our dataset we may often want to compare the values of 2 columns, say I want to the relationship between fuel-type and engine-type.

I am interested in knowing about which fuel-type uses which engine.

for that I will use the crosstab(), it takes two parameters, index, and columns.


```python
data['engine-type'].value_counts()
```




    ohc      145
    ohcf      15
    ohcv      13
    l         12
    dohc      12
    rotor      4
    Name: engine-type, dtype: int64




```python
pd.crosstab(data['fuel-type'],data['engine-type'])
```

engine-type	dohc	l	ohc	ohcf	ohcv	rotor
fuel-type
diesel	0	5	15	0	0	0
gas	12	7	130	15	13	4

You can observe that the data frame we got here has index values as diesel and gas, columns as engine-type.

It shows that how many diesel vehicle has which engine-type.

for further knowledge on crosstab() visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html

So, you got complete knowledge as to how you can use the pandas library to analyze and tweak your data for further analysis.

In further posts, we will discuss various machine learning algorithms and their uses.

CREATION CODES

Data Analysis Basics With Pandas For Beginners.

DATA STRUCTURE PROVIDED BY PYTHON

Posted by Srujan Kachhwaha

Search The Web creationcodes

Followers

Popular Posts

K-Nearest Neighbor(KNN) Algorithm In Machine Learning

Basic Concepts of Machine learning - Building Foundation.

Learn Multiple Linear Regression(MLR) A Beginners Guide.

Search By Date

TOPICS PUBLISHED

Footer Menu Widget

CREATION CODES

Data Analysis Basics With Pandas For Beginners.

DATA STRUCTURE PROVIDED BY PYTHON

Posted by Srujan Kachhwaha

You may like these posts

Social Plugin

Search The Web creationcodes

Followers

Popular Posts

K-Nearest Neighbor(KNN) Algorithm In Machine Learning

Basic Concepts of Machine learning - Building Foundation.

Learn Multiple Linear Regression(MLR) A Beginners Guide.

Search By Date

TOPICS PUBLISHED

Footer Menu Widget