Foundation of Data Science: Unit I: Introduction

Data Science and Big Data

Definition, Characteristics, Comparison, Benefits, Uses

Data is measurable units of information gathered or captured from activity of people, places and things.

UNIT I : Introduction

Syllabus

Data Science : Benefits and uses - facets of data Defining research goals - Retrieving data - Data preparation - Exploratory Data analysis - build the model presenting findings and building applications Warehousing - Basic Statistical descriptions of Data.

Data Science

• Data is measurable units of information gathered or captured from activity of people, places and things.

• Data science is an interdisciplinary field that seeks to extract knowledge or insights from various forms of data. At its core, Data Science aims to discover and extract actionable knowledge from data that can be used to make sound business decisions and predictions. Data science combines math and statistics, specialized programming, advanced analytics, Artificial Intelligence (AI) and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization's data.

• Data science uses advanced analytical theory and various methods such as time series analysis for predicting future. From historical data, Instead of knowing how many products sold in previous quarter, data science helps in forecasting future product sales and revenue more accurately.

• Data science is devoted to the extraction of clean information from raw data to form actionable insights. Data science practitioners apply machine learning algorithms to numbers, text, images, video, audio and more to produce artificial intelligence systems to perform tasks that ordinarily require human intelligence.

• The data science field is growing rapidly and revolutionizing so many industries. It has incalculable benefits in business, research and our everyday lives.

• As a general rule, data scientists are skilled in detecting patterns hidden within large volumes of data and they often use advanced algorithms and implement machine learning models to help businesses and organizations make accurate assessments and predictions. Data science and big data evolved from statistics and traditional data management but are now considered to be distinct disciplines.

• Life cycle of data science:

1. Capture: Data acquisition, data entry, signal reception and data extraction.

2. Maintain Data warehousing, data cleansing, data staging, data processing and data architecture.

3. Process Data mining, clustering and classification, data modeling and data summarization.

4. Analyze : Data reporting, data visualization, business intelligence and decision making.

5. Communicate: Exploratory and confirmatory analysis, predictive analysis, regression, text mining and qualitative analysis.

Big Data

•  Big data can be defined as very large volumes of data available at various     sources, in varying degrees of complexity, generated at different speed i.e. velocities and varying degrees of ambiguity, which cannot be processed using traditional technologies, processing methods, algorithms or any commercial off-the-shelf solutions.

• 'Big data' is a term used to describe collection of data that is huge in size and yet growing exponentially with time. In short, such a data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

Characteristics of Big Data

• Characteristics of big data are volume, velocity and variety. They are often referred to as the three V's.

1. Volume Volumes of data are larger than that conventional relational database infrastructure can cope with. It consisting of terabytes or petabytes of data.

2. Velocity: The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. It is being created in or near real-time.

3. Variety: It refers to heterogeneous sources and the nature of data, both structured and unstructured.

• These three dimensions are also called as three V's of Big Data.

• Two other characteristics of big data is veracity and value.

a) Veracity:

• Veracity refers to source reliability, information credibility and content validity.

• Veracity refers to the trustworthiness of the data. Can the manager rely on the fact that the data is representative? Every good manager knows that there are inherent discrepancies in all the data collected.

Spatial veracity: For vector data (imagery based on points, lines and polygons), the quality varies. It depends on whether the points have been GPS determined or determined by unknown origins or manually. Also, resolution and projection issues can alter veracity.

• For geo-coded points, there may be errors in the address tables and in the point location algorithms associated with addresses.

• For raster data (imagery based on pixels), veracity depends on accuracy of recording instruments in satellites or aerial devices and on timeliness.

b) Value :

• It represents the business value to be derived from big data.

• The ultimate objective of any big data project should be to generate some sort of value for the company doing all the analysis. Otherwise, user just performing some technological task for technology's sake.

• For real-time spatial big data, decisions can be enhance through visualization of dynamic change in such spatial phenomena as climate, traffic, social-media-based attitudes and massive inventory locations.

• Exploration of data trends can include spatial proximities and relationships.

• Once spatial big data are structured, formal spatial analytics can be applied, such as spatial autocorrelation, overlays, buffering, spatial cluster techniques and location quotients.

Difference between Data Science and Big Data

Comparison between Cloud Computing and Big Data

Benefits and Uses of Data Science

• Data science example and applications :

a) Anomaly detection: Fraud, disease and crime

b) Classification: Background checks; an email server classifying emails as "important"

c) Forecasting: Sales, revenue and customer retention

d) Pattern detection: Weather patterns, financial market patterns

e) Recognition : Facial, voice and text

f) Recommendation: Based on learned preferences, recommendation engines can refer user to movies, restaurants and books

g) Regression: Predicting food delivery times, predicting home prices based on amenities

h) Optimization: Scheduling ride-share pickups and package deliveries

Benefits and Use of Big Data

• Benefits of Big Data :

1. Improved customer service

2. Businesses can utilize outside intelligence while taking decisions

3. Reducing maintenance costs

4. Re-develop our products : Big Data can also help us understand how others perceive our products so that we can adapt them or our marketing, if need be.

5. Early identification of risk to the product/services, if any

6. Better operational efficiency

• Some of the examples of big data are:

1. Social media : Social media is one of the biggest contributors to the flood of data we have today. Facebook generates around 500+ terabytes of data everyday in the form of content generated by the users like status messages, photos and video uploads, messages, comments etc.

2. Stock exchange : Data generated by stock exchanges is also in terabytes per day. Most of this data is the trade data of users and companies.

3. Aviation industry: A single jet engine can generate around 10 terabytes of data during a 30 minute flight.

4. Survey data: Online or offline surveys conducted on various topics which typically has hundreds and thousands of responses and needs to be processed for analysis and visualization by creating a cluster of population and their associated responses.

5. Compliance data : Many organizations like healthcare, hospitals, life sciences, finance etc has to file compliance reports.

Foundation of Data Science: Unit I: Introduction : Tag: : Definition, Characteristics, Comparison, Benefits, Uses - Data Science and Big Data