Foundation of Data Science: Unit IV: Python Libraries for Data Wrangling

Combining Datasets

Python Libraries for Data Wrangling

Whether it is to concatenate several datasets from different csv files or to merge sets of aggregated data from different google analytics accounts, combining data from various sources is critical to drawing the right conclusions and extracting optimal value from data analytics.

Combining Datasets

• Whether it is to concatenate several datasets from different csv files or to merge sets of aggregated data from different google analytics accounts, combining data from various sources is critical to drawing the right conclusions and extracting optimal value from data analytics.

• When using pandas, data scientists often have to concatenate multiple pandas DataFrame; either vertically (adding lines) or horizontally (adding columns).

DataFrame.append

• This method allows to add another dataframe to an existing one. While columns with matching names are concatenated together, columns with different labels are filled with NA.

>>>df1

ints bools

0 0 True

11 False

2 2 True

>>> df2

   ints floats

0 3 1.5

1 4 2.5

2 5 3.5

>>> df1.append(df2).

ints  bools floats

0 0  True NaN

1 1  False NaN

2 2  True NaN

0 3  NaN  1.5

1 4  NaN  2.5

2 5  NaN  3.5

• In addition to this, DataFrame.append provides other flexibilities such as resetting the resulting index, sorting the resulting data or raising an error when the resulting index includes duplicate records.

Pandas.concat

• We can concat dataframes both vertically (axis=0) and horizontally (axis=1) by using the Pandas.concat function. Unlike DataFrame.append, Pandas.concat is not a method but a function that takes a list of objects as input. On the other hand, columns with different labels are filled with NA values as for DataFrame.append.

>>> df3

bools floats

0 False 4.5

1 True 5.5

2 False 6.5

>>>pd.concat([df1, df2, df3])

ints bools floats

0 0.0 True NaN

1 1.0 False NaN

2 2.0 True NaN

0 3.0 NaN 1.5

1 4.0 NaN 2.5

2 5.0 NaN 3.5

0 NaN False 4.5

1 NaN True 5.5

2 NaN False 6.5

Foundation of Data Science: Unit IV: Python Libraries for Data Wrangling : Tag: : Python Libraries for Data Wrangling - Combining Datasets