Foundation of Data Science: Unit IV: Python Libraries for Data Wrangling

Aggregation and Grouping

Python Libraries for Data Wrangling

The date column can be parsed using the extremely handy dateutil library.Once the data has been loaded into Python, Pandas makes the calculation of different statistics very simple.

Aggregation and Grouping

 • Pandas aggregation methods are as follows:

a) count() Total number of items

b) first(), last(): First and last item

c) mean(), median(): Mean and median

d) min(), max(): Minimum and maximum

e) std(), var(): Standard deviation and variance

f) mad(): Mean absolute deviation

g) prod(): Product of all items

h) sum(): Sum of all items.

• Sample CSV file is as follows:

• The date column can be parsed using the extremely handy dateutil library.

  import pandas as pd

  importdateutil

  # Load data from csv file

  data = pd.DataFrame.from_csv('phone_data.csv')

  # Convert date from string to date times

  data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)

• Once the data has been loaded into Python, Pandas makes the calculation of different statistics very simple. For example, mean, max, min, standard deviations and more for columns are easily calculable:

# How many rows the dataset

data['item'].count()

Out[38]: 830

# What was the longest phone call / data entry?

data['duration'].max()

Out[39]: 10528.0

# How many seconds of phone calls are recorded in total?

data['duration'][data['item'] == 'call'].sum()

Out[40]: 92321.0

# How many entries are there for each month?

data['month'].value_counts()

Out[41]:

2014-11 230

2015-01 205

2014-12 157

2015-02 137

2015-03 101

dtype: int64

# Number of non-null unique network entries

data['network'].nunique()

Out[42]: 9

groupby() function :

• groupby essentially splits the data into different groups depending on a variable of user choice.

• The groupby() function returns a GroupBy object, but essentially describes how the rows of the original data set has been split. The GroupBy object groups variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group.

• Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object to obtain summary statistics for each group.

• The GroupBy object supports column indexing in the same way as the DataFrame and returns a modified GroupBy object.

Foundation of Data Science: Unit IV: Python Libraries for Data Wrangling : Tag: : Python Libraries for Data Wrangling - Aggregation and Grouping