Data is measurable units of information gathered or captured from activity of people, places and things.
UNIT I : Introduction
Syllabus
Data Science : Benefits and uses - facets of data Defining research goals - Retrieving data - Data preparation - Exploratory Data analysis - build the model presenting findings and building applications Warehousing - Basic Statistical descriptions of Data.
Data
Science
• Data is
measurable units of information gathered or captured from activity of people,
places and things.
• Data
science is an interdisciplinary field that seeks to extract knowledge or
insights from various forms of data. At its core, Data Science aims to discover
and extract actionable knowledge from data that can be used to make sound
business decisions and predictions. Data science combines math and statistics,
specialized programming, advanced analytics, Artificial Intelligence (AI) and
machine learning with specific subject matter expertise to uncover actionable
insights hidden in an organization's data.
• Data
science uses advanced analytical theory and various methods such as time series
analysis for predicting future. From historical data, Instead of knowing how
many products sold in previous quarter, data science helps in forecasting
future product sales and revenue more accurately.
• Data
science is devoted to the extraction of clean information from raw data to form
actionable insights. Data science practitioners apply machine learning
algorithms to numbers, text, images, video, audio and more to produce
artificial intelligence systems to perform tasks that ordinarily require human
intelligence.
• The
data science field is growing rapidly and revolutionizing so many industries.
It has incalculable benefits in business, research and our everyday lives.
• As a
general rule, data scientists are skilled in detecting patterns hidden within
large volumes of data and they often use advanced algorithms and implement
machine learning models to help businesses and organizations make accurate
assessments and predictions. Data science and big data evolved from statistics
and traditional data management but are now considered to be distinct
disciplines.
• Life
cycle of data science:
1. Capture: Data acquisition, data
entry, signal reception and data extraction.
2. Maintain Data warehousing, data
cleansing, data staging, data processing and data architecture.
3. Process Data mining, clustering and
classification, data modeling and data summarization.
4. Analyze : Data reporting, data
visualization, business intelligence and decision making.
5.
Communicate: Exploratory and confirmatory analysis,
predictive analysis, regression, text mining and qualitative analysis.
• Big data can be defined as very large volumes
of data available at various sources,
in varying degrees of complexity, generated at different speed i.e. velocities
and varying degrees of ambiguity, which cannot be processed using traditional
technologies, processing methods, algorithms or any commercial off-the-shelf
solutions.
• 'Big
data' is a term used to describe collection of data that is huge in size and
yet growing exponentially with time. In short, such a data is so large and
complex that none of the traditional data management tools are able to store it
or process it efficiently.
•
Characteristics of big data are volume, velocity and variety. They are often
referred to as the three V's.
1. Volume Volumes of data are larger than
that conventional relational database infrastructure can cope with. It consisting
of terabytes or petabytes of data.
2. Velocity: The term 'velocity' refers
to the speed of generation of data. How fast the data is generated and
processed to meet the demands, determines real potential in the data. It is
being created in or near real-time.
3. Variety: It refers to heterogeneous
sources and the nature of data, both structured and unstructured.
• These
three dimensions are also called as three V's of Big Data.
• Two
other characteristics of big data is veracity and value.
a) Veracity:
• Veracity
refers to source reliability, information credibility and content validity.
•
Veracity refers to the trustworthiness of the data. Can the manager rely on the
fact that the data is representative? Every good manager knows that there are
inherent discrepancies in all the data collected.
• Spatial veracity: For vector data
(imagery based on points, lines and polygons), the quality varies. It depends
on whether the points have been GPS determined or determined by unknown origins
or manually. Also, resolution and projection issues can alter veracity.
• For
geo-coded points, there may be errors in the address tables and in the point
location algorithms associated with addresses.
• For
raster data (imagery based on pixels), veracity depends on accuracy of
recording instruments in satellites or aerial devices and on timeliness.
b) Value :
• It
represents the business value to be derived from big data.
• The
ultimate objective of any big data project should be to generate some sort of
value for the company doing all the analysis. Otherwise, user just performing
some technological task for technology's sake.
• For
real-time spatial big data, decisions can be enhance through visualization of
dynamic change in such spatial phenomena as climate, traffic,
social-media-based attitudes and massive inventory locations.
• Exploration
of data trends can include spatial proximities and relationships.
• Once
spatial big data are structured, formal spatial analytics can be applied, such
as spatial autocorrelation, overlays, buffering, spatial cluster techniques and
location quotients.
• Data
science example and applications :
a) Anomaly detection: Fraud,
disease and crime
b) Classification:
Background checks; an email server classifying emails as "important"
c) Forecasting: Sales,
revenue and customer retention
d) Pattern detection: Weather
patterns, financial market patterns
e) Recognition : Facial,
voice and text
f) Recommendation: Based
on learned preferences, recommendation engines can refer user to movies,
restaurants and books
g) Regression: Predicting food delivery
times, predicting home prices based on amenities
h) Optimization:
Scheduling ride-share pickups and package deliveries
•
Benefits of Big Data :
1.
Improved customer service
2.
Businesses can utilize outside intelligence while taking decisions
3.
Reducing maintenance costs
4.
Re-develop our products : Big Data can also help us understand how others
perceive our products so that we can adapt them or our marketing, if need be.
5. Early
identification of risk to the product/services, if any
6.
Better operational efficiency
• Some
of the examples of big data are:
1. Social media :
Social media is one of the biggest contributors to the flood of data we have
today. Facebook generates around 500+ terabytes of data everyday in the form of
content generated by the users like status messages, photos and video uploads,
messages, comments etc.
2. Stock exchange : Data
generated by stock exchanges is also in terabytes per day. Most of this data is
the trade data of users and companies.
3. Aviation industry: A
single jet engine can generate around 10 terabytes of data during a 30 minute
flight.
4. Survey data: Online
or offline surveys conducted on various topics which typically has hundreds and
thousands of responses and needs to be processed for analysis and visualization
by creating a cluster of population and their associated responses.
5. Compliance data : Many
organizations like healthcare, hospitals, life sciences, finance etc has to
file compliance reports.
Foundation of Data Science: Unit I: Introduction : Tag: : Definition, Characteristics, Comparison, Benefits, Uses - Data Science and Big Data
Foundation of Data Science
CS3352 3rd Semester CSE Dept | 2021 Regulation | 3rd Semester CSE Dept 2021 Regulation