Data Analysis Basics With Pandas For Beginners.


This post will cover and drives you through the basics of how to use the pandas library provided by the python.

With the help of the pandas library, you can easily play with your dataset and manipulate your dataset without any complex coding.

So let's start.

DATA STRUCTURE PROVIDED BY PYTHON

Python provides the following data structure:
  • List
  • Dictionary
  • Series
  • DataFrame
Whenever we import any dataset using pandas library, It treats it as a DataFrame.
So, what is a DataFrame? Well, a DataFrame is a combination of lists and dictionaries.

I will introduce you to how we can use pandas for data exploration, Look into the following codes. # Introduction To import pandas use the following command ```python import pandas as pd ``` ## Dictionary A dictionary is data structure that uses key: value pair to store data. you are familiar with arrays, they store values in index position, If you want to retrieve any data from the array you have to access it by its index number. But, In the dictionary, we use the key instead of index to access the element. you can set any values for the key you want ```python d = {"Roll": 27, "Name":"Srujan", "branch":"Information Technology", "Age": 20} ``` Now, If I want to access the age, I will just do, ```python d["Age"] ``` 20 ## DataFrame A DataFrame is a table-like data structure with rows and columns, you can create a DataFrame using pandas function DataFrame() A dataframe is a combination of a dictionary and the list. For Example, If you want to construct the column say Roll, Then you will create a key as "Roll" and pass a list. ```python data = {"Roll": [1,2,3,4,5,6] , "Name": ["Srujan", "Rohan", "Ayush", "Vaibhav", "Priya", "Akansha"], "Branch" : ["IT","CS","Mech","IT","CS","ET"], "Age":[19,20,20,21,22,23]} ``` ```python data ``` {'Roll': [1, 2, 3, 4, 5, 6], 'Name': ['Srujan', 'Rohan', 'Ayush', 'Vaibhav', 'Priya', 'Akansha'], 'Branch': ['IT', 'CS', 'Mech', 'IT', 'CS', 'ET'], 'Age': [19, 20, 20, 21, 22, 23]} In order to convert this dictionary to a dataframe we will use DataFrame function from the pandas library. ```python dataframe = pd.DataFrame(data) ``` ```python dataframe ```
Roll Name Branch Age
0 1 Srujan IT 19
1 2 Rohan CS 20
2 3 Ayush Mech 20
3 4 Vaibhav IT 21
4 5 Priya CS 22
5 6 Akansha ET 23
# How To Read a Data-Set

Based on the type of extension of your dataset file, pandas provides various methods to read data. In our example, the dataset file is a CSV file so we will use the read_csv() method to read our data.

You can download the dataset from the below link-->

https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv


```python
data = pd.read_csv(r"auto.csv")
```

Now, In order to know how many rows and columns or how many instances and features you data set have we type


```python
data.shape
```




    (204, 26)



This means, that our data-set contains 204 rows and 26 columns

Let's take small look at our dataset, for this, we will use the head() function. the head() function takes input the number of rows you want to display


```python
data.head(3)
```
3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
1 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
2 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
3 rows × 26 columns
Can you see any headings in this data-set? No, This happens because the file of the dataset does not contain any heading, it just has it's values.

Don't worry, we can add the heading by ourselves.


```python
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
```


```python
data = pd.read_csv(r"auto.csv", names = headers) 
```


```python
data.head(2)
```
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 rows × 26 columns
This dataset is of an automobile company. Let's explore it, we can use info() method to get the details of each column of the dataset


```python
data.info()
```

    
    RangeIndex: 205 entries, 0 to 204
    Data columns (total 26 columns):
    symboling            205 non-null int64
    normalized-losses    205 non-null object
    make                 205 non-null object
    fuel-type            205 non-null object
    aspiration           205 non-null object
    num-of-doors         205 non-null object
    body-style           205 non-null object
    drive-wheels         205 non-null object
    engine-location      205 non-null object
    wheel-base           205 non-null float64
    length               205 non-null float64
    width                205 non-null float64
    height               205 non-null float64
    curb-weight          205 non-null int64
    engine-type          205 non-null object
    num-of-cylinders     205 non-null object
    engine-size          205 non-null int64
    fuel-system          205 non-null object
    bore                 205 non-null object
    stroke               205 non-null object
    compression-ratio    205 non-null float64
    horsepower           205 non-null object
    peak-rpm             205 non-null object
    city-mpg             205 non-null int64
    highway-mpg          205 non-null int64
    price                205 non-null object
    dtypes: float64(5), int64(5), object(16)
    memory usage: 41.8+ KB
    

The info() method will return the number of values and the datatype of the values each column have.

The first thing is to understand your dataset. When you understand the dataset well, then only you can process it for further analysis.

Let's understand some columns

## fuel-type

From the data.info() we can observe that fuel-type has 205 values, all are not-null and of the object type.

from data.head(2) you can observe that it is showing gas as the values, but are all the 205 cells contain the same value?

we can check this using a function value_counts()


```python
data['fuel-type'].value_counts()
```




    gas       185
    diesel     20
    Name: fuel-type, dtype: int64



So, here we can see that the fuel-type contains only 2 values as gas and diesel were gas has occurred 185 times and diesel occurs 20 times.

Let's explore more about that column


```python
data['fuel-type'].describe()
```




    count     205
    unique      2
    top       gas
    freq      185
    Name: fuel-type, dtype: object



Here, count means the total number of values the column has.

unique represent that there are only 2 unique values i.e gas and diesel

top tells us that most of the values are "gas"

frequency tells us that "gas" comes total in 185 times.

## value_counts()

Remember that value_counts() function works where the columns have categorical or repeating values.

### Let's look at the column normalized-losses


```python
data["normalized-losses"].describe()
```




    count     205
    unique     52
    top         ?
    freq       41
    Name: normalized-losses, dtype: object




```python
data["normalized-losses"].head(5)
```




    0      ?
    1      ?
    2      ?
    3    164
    4    164
    Name: normalized-losses, dtype: object



Observe, When I used describe() function, it shows that the most occurring value is "?"

# Dealing with missing values

In dataset the missing values are represented by ? or Nan or simply with blank cell.

To deal with missing values we can
    1. Remove the row which contains the missing value
    2. Fill The Missing Value With Mean value of that column.
    
Removing the row is beneficial when the missing values are less.

otherwise, we have to fill that missing values with the mean.

Let us find out which columns have the missing values and by how much

# The isnull() Function

The isnull() function will return us the dataframe which contains true and false values were true means missing value and false means, not a missing value


```python
data.isnull()
```
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 False False False False False False False False False False ... False False False False False False False False False False
1 False False False False False False False False False False ... False False False False False False False False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
200 False False False False False False False False False False ... False False False False False False False False False False
201 False False False False False False False False False False ... False False False False False False False False False False
202 False False False False False False False False False False ... False False False False False False False False False False
203 False False False False False False False False False False ... False False False False False False False False False False
204 False False False False False False False False False False ... False False False False False False False False False False
205 rows × 26 columns
Here you can observe that all the values are false! which means that there are no missing values in the data set.

But wait, while observing the normalized-losses column we have some "?" values.

This happens because pandas only consider NaN values as missing or null values. So first in order to correct our dataset we have to replace all the "?" values to NaN values.

```python
import numpy as np
```

# The replace() function

for more details visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html


```python
data.replace("?", np.nan, inplace = True)
```

here inplace = True means save the changes to that variable again.

Now, let us use the isnull() function again


```python
data.isnull().head(5)
```

symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 False True False False False False False False False False ... False False False False False False False False False False
1 False True False False False False False False False False ... False False False False False False False False False False
2 False True False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False
5 rows × 26 columns
Observe the normalized-losses column, they have some True values.

Now, lets have a short and easy to look summary on columns which have missing values only. we will use sum() function with isnull() so it will give us a shortlist with columns having null values only


```python
data.isnull().sum()
```




    symboling             0
    normalized-losses    41
    make                  0
    fuel-type             0
    aspiration            0
    num-of-doors          2
    body-style            0
    drive-wheels          0
    engine-location       0
    wheel-base            0
    length                0
    width                 0
    height                0
    curb-weight           0
    engine-type           0
    num-of-cylinders      0
    engine-size           0
    fuel-system           0
    bore                  4
    stroke                4
    compression-ratio     0
    horsepower            2
    peak-rpm              2
    city-mpg              0
    highway-mpg           0
    price                 4
    dtype: int64



So, here we can see that columns
    
    1. normalized-losses have 41 null values
    2. num-of-doors have 2 null values
    3. bore has 2
    4. stroke have 2
    6. horsepower have 2
    7. peak-rpm have 2
    8. price have 4 
    

Let's take care of these columns, for ease let's begin with columns who have less missing values

## num-of-doors


```python
data['num-of-doors'].describe()
```




    count      203
    unique       2
    top       four
    freq       114
    Name: num-of-doors, dtype: object




```python
data['num-of-doors'].value_counts()
```




    four    114
    two      89
    Name: num-of-doors, dtype: int64



here you can observe that the most of the values or doors are "four", So we will use the replace function to replace nan values with four


```python
data['num-of-doors'].replace(np.nan, "four", inplace=True)
```

Let's check if still any null values are there or not


```python
data['num-of-doors'].isnull().sum()
```




    0



## bore


```python
data['bore'].describe()
```




    count      201
    unique      38
    top       3.62
    freq        23
    Name: bore, dtype: object




```python

```

Here the bore is not a categorical column, it contains some numbers. So we will find the mean and replace the nan with mean value, But observe that the dtype of the column is an object. first, we need to convert the dtype to float


```python
data['bore'] =  data['bore'].astype("float")
```


```python
data['bore'].replace(np.nan, data['bore'].mean(), inplace = True)
```


```python
data['bore'].isnull().sum()
```




    0



You can also do it in a single line, just look at the code below


```python
data[["stroke", "horsepower","peak-rpm"]].describe()
```
stroke horsepower peak-rpm
count 201 203 203
unique 36 59 23
top 3.40 68 5500
freq 20 19 37
```python
data["stroke"].replace(np.nan, data["stroke"].mean, inplace=True)
```


```python
data["horsepower"].replace(np.nan, data["horsepower"].mean, inplace=True)
```


```python
data["peak-rpm"].replace(np.nan, data["peak-rpm"].mean, inplace=True)
```


```python
data["normalized-losses"] = data["normalized-losses"].astype("float")
```


```python
data["normalized-losses"].replace(np.nan, data["normalized-losses"].mean(), inplace=True)
```

# Price

We will delete the rows in the price column because the price column is the output of the dataset and we can not put estimated values in output or target column as this could disturb the nature or pattern between input and output


```python
# simply drop whole row with NaN in "price" column
data.dropna(subset=["price"], axis=0, inplace=True)

# reset index, because we droped two rows
data.reset_index(drop=True, inplace=True)
```


```python
data.isnull().sum()
```




    symboling            0
    normalized-losses    0
    make                 0
    fuel-type            0
    aspiration           0
    num-of-doors         0
    body-style           0
    drive-wheels         0
    engine-location      0
    wheel-base           0
    length               0
    width                0
    height               0
    curb-weight          0
    engine-type          0
    num-of-cylinders     0
    engine-size          0
    fuel-system          0
    bore                 0
    stroke               0
    compression-ratio    0
    horsepower           0
    peak-rpm             0
    city-mpg             0
    highway-mpg          0
    price                0
    dtype: int64



Now as you can see our data set is free of all the nan or missing values

>/pre>
# groupby() function

While analyzing our data, we may want to divide our dataset into some categories and may want to study the values of every column according to the category in which we divided our dataset.

For example, Say I want to know the mean values of every column for each value in body-style.

So let's first look into the values of body-style columns.


```python
data["body-style"].describe()
```




    count       201
    unique        5
    top       sedan
    freq         94
    Name: body-style, dtype: object




```python
data["body-style"].value_counts()
```




    sedan          94
    hatchback      68
    wagon          25
    hardtop         8
    convertible     6
    Name: body-style, dtype: int64



Here you can observe that the body-style column has 5 categories. And I want to know the mean price of each category.

For that, I need to split by data-set into corresponding categories and then find the mean.


```python
data.groupby('body-style').mean()
```
symboling normalized-losses wheel-base length width height curb-weight engine-size bore compression-ratio city-mpg highway-mpg price
body-style
convertible 2.833333 127.333333 92.700000 170.383333 65.583333 51.433333 2801.666667 157.166667 3.491667 8.933333 20.500000 26.000000 21890.500000
hardtop 1.875000 128.625000 98.500000 176.937500 66.612500 52.850000 2810.625000 176.250000 3.608750 10.725000 21.625000 27.250000 22208.500000
hatchback 1.617647 130.897059 95.435294 166.288235 65.104412 52.133824 2322.852941 112.852941 3.236015 9.042941 26.602941 32.382353 9957.441176
sedan 0.329787 120.893617 100.750000 178.046809 66.317021 54.387234 2625.893617 131.691489 3.345106 10.965957 25.053191 30.574468 14459.755319
wagon -0.160000 98.560000 102.156000 181.304000 66.256000 56.728000 2784.240000 123.840000 3.406400 10.316000 24.040000 28.720000 12371.960000
The groupby() function divided our data-set for the categories available in the body-style column and the mean() function calculates the mean for each value in body style.

Now, I am intrested in only knowing the price so i can do,


```python
data.groupby('body-style').mean()['price']
```




    body-style
    convertible    21890.500000
    hardtop        22208.500000
    hatchback       9957.441176
    sedan          14459.755319
    wagon          12371.960000
    Name: price, dtype: float64



data.groupby('body-style').mean(), when we write this, it returns us the a dataframe. Now we can access any column using ["colimn name"]

for further details on groupby() visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
# crosstab() function

While analyzing our dataset we may often want to compare the values of 2 columns, say I want to the relationship between fuel-type and engine-type.

I am interested in knowing about which fuel-type uses which engine.

for that I will use the crosstab(), it takes two parameters, index, and columns.


```python
data['engine-type'].value_counts()
```




    ohc      145
    ohcf      15
    ohcv      13
    l         12
    dohc      12
    rotor      4
    Name: engine-type, dtype: int64




```python
pd.crosstab(data['fuel-type'],data['engine-type'])
```
engine-type dohc l ohc ohcf ohcv rotor
fuel-type
diesel 0 5 15 0 0 0
gas 12 7 130 15 13 4
You can observe that the data frame we got here has index values as diesel and gas, columns as engine-type.

It shows that how many diesel vehicle has which engine-type.

for further knowledge on crosstab() visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html

So, you got complete knowledge as to how you can use the pandas library to analyze and tweak your data for further analysis.

In further posts, we will discuss various machine learning algorithms and their uses.