Very large amount of data will generate in big data and data science. These data is various types
Facets of Data
• Very
large amount of data will generate in big data and data science. These data is
various types and main categories of data are as follows:
a)
Structured
b)
Natural language
c)
Graph-based
d)
Streaming
e)
Unstructured
f)
Machine-generated
g) Audio, video and images
• Structured
data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for
storing structured data.
• The
term structured data refers to data that is identifiable because it is
organized in a structure. The most common form of structured data or records is
a database where specific information is stored based on a methodology of
columns and rows.
•
Structured data is also searchable by data type within content. Structured data
is understood by computers and is also efficiently organized for human readers.
• An
Excel table is an example of structured data.
• Unstructured
data is data that does not follow a specified format. Row and columns are not
used for unstructured data. Therefore it is difficult to retrieve required
information. Unstructured data has no identifiable structure.
• The
unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured
data.
• Even
today in most of the organizations more than 80 % of the data are in
unstructured form. This carries lots of information. But extracting information
from these various sources is a very big challenge.
•
Characteristics of unstructured data:
1. There
is no structural restriction or binding for the data.
2. Data
can be of any type.
3.
Unstructured data does not follow any structural rules.
4. There
are no predefined formats, restriction or sequence for unstructured data.
5. Since
there is no structural binding for unstructured data, it is unpredictable in
nature.
•
Natural language is a special type of unstructured data.
• Natural
language processing enables machines to recognize characters, words and
sentences, then apply meaning and understanding to that information. This helps
machines to understand language as humans do.
•
Natural language processing is the driving force behind machine intelligence in
many modern real-world applications. The natural language processing community
has had success in entity recognition, topic recognition, summarization, text
completion and sentiment analysis.
•For
natural language processing to help machines understand human language, it must
go through speech recognition, natural language understanding and machine
translation. It is an iterative process comprised of several layers of text
analysis.
• Machine-generated
data is an information that is created without human interaction as a result of
a computer process or application activity. This means that data entered
manually by an end-user is not recognized to be machine-generated.
• Machine
data contains a definitive record of all activity and behavior of our customers,
users, transactions, applications, servers, networks, factory machinery and so
on.
• It's
configuration data, data from APIs and message queues, change events, the
output of diagnostic commands and call detail records, sensor data from remote
equipment and more.
•
Examples of machine data are web server logs, call detail records, network
event logs and telemetry.
• Both
Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate
machine data. Machine data is generated continuously by every processor-based
system, as well as many consumer-oriented systems.
• It can
be either structured or unstructured. In recent years, the increase of machine
data has surged. The expansion of mobile devices, virtual servers and desktops,
as well as cloud- based services and RFID technologies, is making IT
infrastructures more complex.
•Graphs
are data structures to describe relationships and interactions between entities
in complex systems. In general, a graph contains a collection of entities
called nodes and another collection of interactions between a pair of nodes
called edges.
• Nodes
represent entities, which can be of any object type that is relevant to our
problem domain. By connecting nodes with edges, we will end up with a graph
(network) of nodes.
• A
graph database stores nodes and relationships instead of tables or documents.
Data is stored just like we might sketch ideas on a whiteboard. Our data is
stored without restricting it to a predefined model, allowing a very flexible
way of thinking about and using it.
• Graph
databases are used to store graph-based data and are queried with specialized
query languages such as SPARQL.
• Graph
databases are capable of sophisticated fraud
prevention. With graph databases, we can use relationships to process
financial and purchase transactions in near-real time. With fast graph queries,
we are able to detect that, for example, a potential purchaser is using the
same email address and credit card as included in a known fraud case.
• Graph
databases can also help user easily detect relationship patterns such as
multiple people associated with a personal email address or multiple people
sharing the same IP address but residing in different physical addresses.
• Graph
databases are a good choice for recommendation applications. With graph
databases, we can store in a graph relationships between information categories
such as customer interests, friends and purchase history. We can use a highly
available graph database to make product recommendations to a user based on
which products are purchased by others who follow the same sport and have
similar purchase history.
• Graph
theory is probably the main method in social network analysis in the early
history of the social network concept. The approach is applied to social
network analysis in order to determine important features of the network such
as the nodes and links (for example influencers and the followers).
•
Influencers on social network have been identified as users that have impact on
the activities or opinion of other users by way of followership or influence on
decision made by other users on the network as shown in Fig. 1.2.1.
• Graph
theory has proved to be very effective on large-scale datasets such as social
network data. This is because it is capable of by-passing the building of an
actual visual representation of the data to run directly on data matrices.
• Audio,
image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.
•The
terms audio and video commonly refers to the time-based media storage format
for sound/music and moving pictures information. Audio and video digital
recording, also referred as audio and video codecs, can be uncompressed,
lossless compressed or lossy compressed depending on the desired quality and
use cases.
• It is
important to remark that multimedia data is one of the most important sources
of information and knowledge; the integration, transformation and indexing of
multimedia data bring significant challenges in data management and analysis.
Many challenges have to be addressed including big data, multidisciplinary
nature of Data Science and heterogeneity.
• Data
Science is playing an important role to address these challenges in multimedia
data. Multimedia data usually contains various forms of media, such as text,
image, video, geographic coordinates and even pulse waveforms, which come from
multiple sources. Data Science can be a key instrument covering big data,
machine learning and data mining solutions to store, handle and analyze such
heterogeneous data.
Streaming
data is data that is generated continuously by thousands of data sources, which
typically send in the data records simultaneously and in small sizes (order of
Kilobytes).
•
Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game
player activity, information from social networks, financial trading floors or
geospatial services and telemetry from connected devices or instrumentation in
data centers.
Foundation of Data Science: Unit I: Introduction : Tag: : Data Science - Facets of Data
Foundation of Data Science
CS3352 3rd Semester CSE Dept | 2021 Regulation | 3rd Semester CSE Dept 2021 Regulation