Foundation of Data Science: Unit IV: Python Libraries for Data Wrangling

Data Wrangling

Data Science

Data Wrangling is the process of transforming data from its original "raw" form into a more digestible format and organizing sets from various sources into a singular coherent whole for further processing.

UNIT IV: Python Libraries for Data Wrangling

Syllabus

Basics of Numpy arrays - aggregations - computations on arrays - comparisons, masks, boolean logic - fancy indexing - structured arrays - Data manipulation with Pandas - data indexing and selection operating on data - missing data - Hierarchical indexing - combining datasets - aggregation and grouping - pivot tables.

Data Wrangling

• Data Wrangling is the process of transforming data from its original "raw" form into a more digestible format and organizing sets from various sources into a singular coherent whole for further processing.

• Data wrangling is also called as data munging.

• The primary purpose of data wrangling can be described as getting data in coherent shape. In other words, it is making raw data usable. It provides substance for further proceedings.

• Data wrangling covers the following processes:

1. Getting data from the various source into one place

2. Piecing the data together according to the determined setting

3. Cleaning the data from the noise or erroneous, missing elements.

• Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time.

• There are typically six iterative steps that make up the data wrangling process:

1. Discovering: Before you can dive deeply, you must better understand what is in your data, which will inform how you want to analyze it. How you wrangle customer data, for example, may be informed by where they are located, what they bought, or what promotions they received.

2. Structuring: This means organizing the data, which is necessary because raw data comes in many different shapes and sizes. A single column may turn into several rows for easier analysis. One column may become two. Movement of data is made for easier computation and analysis.

3. Cleaning: What happens when errors and outliers skew your data? You clean the data. What happens when state data is entered as AP or Andhra Pradesh or Arunachal Pradesh? You clean the data. Null values are changed and standard formatting implemented, ultimately increasing data quality.

4. Enriching Here you take stock in your data and strategize about how other additional data might augment it. Questions asked during this data wrangling step might be : what new types of data can I derive from what I already have or what other information would better inform my decision making about this current data?

5. Validating Validation rules are repetitive programming sequences that verify data consistency, quality, and security. Examples of validation include ensuring uniform distribution of attributes that should be distributed normally (e.g. birth dates) or confirming accuracy of fields through a check across data.

6. Publishing: Analysts prepare the wrangled data for use downstream, whether by a particular user or software and document any particular steps taken or logic used to wrangle said data. Data wrangling gurus understand that implementation of insights relies upon the ease with which it can be accessed and utilized by others.

Foundation of Data Science: Unit IV: Python Libraries for Data Wrangling : Tag: : Data Science - Data Wrangling