Foundation of Data Science: Unit I: Introduction

Facets of Data

Data Science

Home | All Subjects | CSE Department | Foundation of Data Science

Very large amount of data will generate in big data and data science. These data is various types

Facets of Data

• Very large amount of data will generate in big data and data science. These data is various types and main categories of data are as follows:

a) Structured

b) Natural language

c) Graph-based

d) Streaming

e) Unstructured

f) Machine-generated

g) Audio, video and images

Structured Data

• Structured data is arranged in rows and column format. It helps for application to retrieve and process data easily. Database management system is used for storing structured data.

• The term structured data refers to data that is identifiable because it is organized in a structure. The most common form of structured data or records is a database where specific information is stored based on a methodology of columns and rows.

• Structured data is also searchable by data type within content. Structured data is understood by computers and is also efficiently organized for human readers.

• An Excel table is an example of structured data.

Unstructured Data

• Unstructured data is data that does not follow a specified format. Row and columns are not used for unstructured data. Therefore it is difficult to retrieve required information. Unstructured data has no identifiable structure.

• The unstructured data can be in the form of Text: (Documents, email messages, customer feedbacks), audio, video, images. Email is an example of unstructured data.

• Even today in most of the organizations more than 80 % of the data are in unstructured form. This carries lots of information. But extracting information from these various sources is a very big challenge.

• Characteristics of unstructured data:

1. There is no structural restriction or binding for the data.

2. Data can be of any type.

3. Unstructured data does not follow any structural rules.

4. There are no predefined formats, restriction or sequence for unstructured data.

5. Since there is no structural binding for unstructured data, it is unpredictable in nature.

Natural Language

• Natural language is a special type of unstructured data.

• Natural language processing enables machines to recognize characters, words and sentences, then apply meaning and understanding to that information. This helps machines to understand language as humans do.

• Natural language processing is the driving force behind machine intelligence in many modern real-world applications. The natural language processing community has had success in entity recognition, topic recognition, summarization, text completion and sentiment analysis.

•For natural language processing to help machines understand human language, it must go through speech recognition, natural language understanding and machine translation. It is an iterative process comprised of several layers of text analysis.

Machine - Generated Data

• Machine-generated data is an information that is created without human interaction as a result of a computer process or application activity. This means that data entered manually by an end-user is not recognized to be machine-generated.

• Machine data contains a definitive record of all activity and behavior of our customers, users, transactions, applications, servers, networks, factory machinery and so on.

• It's configuration data, data from APIs and message queues, change events, the output of diagnostic commands and call detail records, sensor data from remote equipment and more.

• Examples of machine data are web server logs, call detail records, network event logs and telemetry.

• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate machine data. Machine data is generated continuously by every processor-based system, as well as many consumer-oriented systems.

• It can be either structured or unstructured. In recent years, the increase of machine data has surged. The expansion of mobile devices, virtual servers and desktops, as well as cloud- based services and RFID technologies, is making IT infrastructures more complex.

Graph-based or Network Data

•Graphs are data structures to describe relationships and interactions between entities in complex systems. In general, a graph contains a collection of entities called nodes and another collection of interactions between a pair of nodes called edges.

• Nodes represent entities, which can be of any object type that is relevant to our problem domain. By connecting nodes with edges, we will end up with a graph (network) of nodes.

• A graph database stores nodes and relationships instead of tables or documents. Data is stored just like we might sketch ideas on a whiteboard. Our data is stored without restricting it to a predefined model, allowing a very flexible way of thinking about and using it.

• Graph databases are used to store graph-based data and are queried with specialized query languages such as SPARQL.

• Graph databases are capable of sophisticated fraud prevention. With graph databases, we can use relationships to process financial and purchase transactions in near-real time. With fast graph queries, we are able to detect that, for example, a potential purchaser is using the same email address and credit card as included in a known fraud case.

• Graph databases can also help user easily detect relationship patterns such as multiple people associated with a personal email address or multiple people sharing the same IP address but residing in different physical addresses.

• Graph databases are a good choice for recommendation applications. With graph databases, we can store in a graph relationships between information categories such as customer interests, friends and purchase history. We can use a highly available graph database to make product recommendations to a user based on which products are purchased by others who follow the same sport and have similar purchase history.

• Graph theory is probably the main method in social network analysis in the early history of the social network concept. The approach is applied to social network analysis in order to determine important features of the network such as the nodes and links (for example influencers and the followers).

• Influencers on social network have been identified as users that have impact on the activities or opinion of other users by way of followership or influence on decision made by other users on the network as shown in Fig. 1.2.1.

• Graph theory has proved to be very effective on large-scale datasets such as social network data. This is because it is capable of by-passing the building of an actual visual representation of the data to run directly on data matrices.

Audio, Image and Video

• Audio, image and video are data types that pose specific challenges to a data scientist. Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for computers.

•The terms audio and video commonly refers to the time-based media storage format for sound/music and moving pictures information. Audio and video digital recording, also referred as audio and video codecs, can be uncompressed, lossless compressed or lossy compressed depending on the desired quality and use cases.

• It is important to remark that multimedia data is one of the most important sources of information and knowledge; the integration, transformation and indexing of multimedia data bring significant challenges in data management and analysis. Many challenges have to be addressed including big data, multidisciplinary nature of Data Science and heterogeneity.

• Data Science is playing an important role to address these challenges in multimedia data. Multimedia data usually contains various forms of media, such as text, image, video, geographic coordinates and even pulse waveforms, which come from multiple sources. Data Science can be a key instrument covering big data, machine learning and data mining solutions to store, handle and analyze such heterogeneous data.

Streaming Data

Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously and in small sizes (order of Kilobytes).

• Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, ecommerce purchases, in-game player activity, information from social networks, financial trading floors or geospatial services and telemetry from connected devices or instrumentation in data centers.