Foundation of Data Science: Unit IV: Python Libraries for Data Wrangling

Hierarchical indexing

Python Libraries for Data Wrangling

Hierarchical indexing is a method of creating structured group relationships in data.

Hierarchical Indexing

• Hierarchical indexing is a method of creating structured group relationships in data.

• A MultiIndex or Hierarchical index comes in when our DataFrame has more than two dimensions. As we already know, a Series is a one-dimensional labelled NumPy array and a DataFrame is usually a two-dimensional table whose columns are Series. In some instances, in order to carry out some sophisticated data analysis and manipulation, our data is presented in higher dimensions.

• A MultiIndex adds at least one more dimension to the data. A Hierarchical Index as the name suggests is ordering more than one item in terms of their ranking.

• To createDataFrame with player ratings of a few players from the Fifa 19 dataset.

          In [1]: import pandas as pd

          In [2]: data = {'Position': ['GK', 'GK', 'GK', 'DF', 'DF', 'DF",

          'MF', 'MF", 'MF', 'CF', 'CF', 'CF'],

          'Name': ['De Gea', 'Coutois', 'Allison', 'VanDijk',

           'Ramos', 'Godin', 'Hazard', 'Kante', 'De Bruyne', 'Ronaldo'

           'Messi', 'Neymar'],

          'Overall': ['91','88', '89', '89', '91', '90', '91', '90', '92', '94', '93', '92'],

           'Rank': ['1st', '3rd', '2nd', '3rd','1st', '2nd', '2nd', '3rd', '1st', '1st', '2nd', '3rd']}

In [3]: fifa19 = pd.DataFrame(data, columns=['Position', 'Name', 'Overall', 'Rank']) 

In [4]: fifa19

Out[4]:


• From above Dataframe, we notice that the index is the default Pandas index; the columns 'Position' and 'Rank' both have values or objects that are repeated. This could sometimes pose a problem for us when we want to analyse the data. What we would like to do is to use meaningful indexes that uniquely identify each row and makes it easier to get a sense of the data we are working with. This is where MultiIndex or Hierarchical Indexing comes in.

 • We do this by using the set_index() method. For Hierarchical indexing, we use set_index() method for passing a list to represent how we want the rows to be identified uniquely.

In [5]: fif19.set_index(['Position', 'Rank'], drop = False)

In [6]: fifa19

Out[6];

• We can see from the code above that we have set our new indexes to 'Position' and 'Rank' but there is a replication of these columns. This is because we passed drop-False which keeps the columns where they are. The default method, however, is drop-True so without indicating drop=False the two columns will be set as the indexes and the columns deleted automatically.

In [7]: fifa19.set_index(['Position', 'Rank'])

Out[7]: Name Overall

Position Rank

GK 1st De Gea91

GK 3rd Coutios88

GK 2nd Allison 89

DF 3rd Van Dijk 89

DF 1st Ramos 91

DF 2nd Godin 90

MF 2nd Hazard 91

MF 3rd Kante90

MF 1st De Bruyne 92

CF 1st Ronaldo 94

CF 2nd Messi93

CF 3rd Neymar92

• We use set_index() with an ordered list of column labels to make the new indexes. To verify that we have indeed set our DataFrame to a hierarchical index, we call the .index attribute.

In [8]: fifa19-fifa 19.set_index(['Position', 'Rank'])

In [9]: fifa19.index

Out[9]: MultiIndex(levels = [['CF', 'DF', 'GK', 'MF'],

['1st', '2nd', '3rd']],

codes = [[2, 2, 2, 1, 1, 1, 3, 3, 3, 0, 0, 0],

[0, 2, 1, 2,0,1, 1, 2, 0, 0, 1, 2]],

names= ['Position', 'Rank'])

Foundation of Data Science: Unit IV: Python Libraries for Data Wrangling : Tag: : Python Libraries for Data Wrangling - Hierarchical indexing