Foundation of Data Science: Unit II: Describing Data

Describing Data with Averages

Data Science

Averages consist of numbers (or words) about which the data are, in some sense, centered.

Describing Data with Averages

• Averages consist of numbers (or words) about which the data are, in some sense, centered. They are often referred to as measures of central tendency. It is already covered in section 1.12.1.

1.12.1 Measuring the Central Tendency

• We look at various ways to measure the central tendency of data, include: Mean, Weighted mean, Trimmed mean, Median, Mode and Midrange.

1. Mean :

• The mean of a data set is the average of all the data values. The sample mean x is the point estimator of the population mean μ.

2. Median :

Sum of the values of then observations Number of observations in the sample

Sum of the values of the N observations Number of observations in the population

• The median of a data set is the value in the middle when the data items are arranged in ascending order. Whenever a data set has extreme values, the median is the preferred measure of central location.

• The median is the measure of location most often reported for annual income and property value data. A few extremely large incomes of property values can inflate the mean.

• For an off number of observations:

7 observations== 26, 18, 27, 12, 14, 29, 19.

Numbers in ascending order 12, 14, 18, 19, 26, 27, 29

• The median is the middle value.

    Median=19

• For an even number of observations :

  8 observations=26 18 29 12 14 27 30 19

Numbers in ascending order=12, 14, 18, 19, 26, 27, 29,30

The median is the average of the middle two values.

3. Mode:

• The mode of a data set is the value that occurs with greatest frequency. The greatest frequency can occur at two or more different values. If the data have exactly two modes, the data have exactly two modes, the data are bimodal. If the data have more than two modes, the data are multimodal.

• Weighted mean : Sometimes, each value in a set may be associated with a weight, the weights reflect the significance, importance or occurrence frequency attached to their respective values.

• Trimmed mean: A major problem with the mean is its sensitivity to extreme (e.g., outlier) values. Even a small number of extreme values can corrupt the mean. The trimmed mean is the mean obtained after cutting off values at the high and low

extremes.

• For example, we can sort the values and remove the top and bottom 2 % before computing the mean. We should avoid trimming too large a portion (such as 20 %) at both ends as this can result in the loss of valuable information.

Holistic measure is a measure that must be computed on the entire data set as a whole. It cannot be computed by partitioning the given data into subsets and merging the values obtained for the measure in each subset.

Foundation of Data Science: Unit II: Describing Data : Tag: : Data Science - Describing Data with Averages