Averages consist of numbers (or words) about which the data are, in some sense, centered.
Describing Data with
Averages
•
Averages consist of numbers (or words) about which the data are, in some sense,
centered. They are often referred to as measures
of central tendency. It is already covered
in section 1.12.1.
1.12.1
Measuring the Central Tendency
• We
look at various ways to measure the central tendency of data, include: Mean,
Weighted mean, Trimmed mean, Median, Mode and Midrange.
1. Mean :
• The
mean of a data set is the average of all the data values. The sample mean x is
the point estimator of the population mean μ.
2. Median :
Sum of
the values of then observations Number of observations in the sample
Sum of
the values of the N observations Number of observations in the population
• The
median of a data set is the value in the middle when the data items are arranged
in ascending order. Whenever a data set has extreme values, the median is the
preferred measure of central location.
• The
median is the measure of location most often reported for annual income and
property value data. A few extremely large incomes of property values can
inflate the mean.
• For an
off number of observations:
7
observations== 26, 18, 27, 12, 14, 29, 19.
Numbers
in ascending order 12, 14, 18, 19, 26, 27, 29
• The
median is the middle value.
Median=19
• For an
even number of observations :
8 observations=26 18 29 12 14 27 30 19
Numbers
in ascending order=12, 14, 18, 19, 26, 27, 29,30
The
median is the average of the middle two values.
3. Mode:
• The
mode of a data set is the value that occurs with greatest frequency. The
greatest frequency can occur at two or more different values. If the data have
exactly two modes, the data have exactly two modes, the data are bimodal. If
the data have more than two modes, the data are multimodal.
• Weighted mean :
Sometimes, each value in a set may be associated with a weight, the weights
reflect the significance, importance or occurrence frequency attached to their
respective values.
• Trimmed mean: A
major problem with the mean is its sensitivity to extreme (e.g., outlier)
values. Even a small number of extreme values can corrupt the mean. The trimmed
mean is the mean obtained after cutting off values at the high and low
extremes.
• For
example, we can sort the values and remove the top and bottom 2 % before
computing the mean. We should avoid trimming too large a portion (such as 20 %)
at both ends as this can result in the loss of valuable information.
• Holistic measure is a measure that
must be computed on the entire data set as a whole. It cannot be computed by
partitioning the given data into subsets and merging the values obtained for
the measure in each subset.
Foundation of Data Science: Unit II: Describing Data : Tag: : Data Science - Describing Data with Averages
Foundation of Data Science
CS3352 3rd Semester CSE Dept | 2021 Regulation | 3rd Semester CSE Dept 2021 Regulation