A distributed database system consists of loosely coupled sites (computer) that share no physical components and each site is associated a database system.
Unit: V: Advanced Topics
Syllabus
Distributed Databases: Architecture,
Data Storage, Transaction Processing, Query processing and optimization - NOSQL
Databases: Introduction - CAP Theorem - Document Based systems - Key value
Stores - Column Based Systems - Graph Databases. Database Security: Security
issues - Access control based on privileges - Role Based access control - SQL
Injection - Statistical Database security - Flow control - Encryption and
Public Key infrastructures - Challenges.
Distributed Databases
AU: Dec.-04,07,16,17,19,
May-03,06,14,16,17,19, Marks 16
Definition of distributed databases:
• A distributed database system
consists of loosely coupled sites (computer) that share no physical components
and each site is associated a database system.
• The software that maintains and manages the working
of distributed databases is called distributed database management system.
• The database system that runs on
each site is independent of each other. Refer Fig. 5.1.1.
The transactions can access data at one or more sites.
Advantages of distributed database system
(1) There is fast data processing as several sites participate in
request processing. 2.
(2) Reliability and availability
of this system is high.
(3) It possess reduced operating cost.
(4) It is easier to expand the
system by adding more sites.
(5) It has improved sharing ability and local
autonomy.
Disadvantages of distributed database system
(1) The system becomes complex to manage and control.
(2) The security issues must be carefully managed.
(3) The system require deadlock handling during the
transaction processing otherwise the entire system may be in inconsistent
state.
(4) There is need of some standardization for
processing of distributed database system.
Difference between distributed DBMS and centralized
DBMS
Uses of distributed system:
(1) Often distributed databases are used by organizations that have
numerous offices in different geographical locations. Typically an individual
branch is interacting primarily with the data that pertain to its own
operations, with a much less frequent need for general company data. In such a
situation, distributed systems are useful.
(2) Using distributed system, one can give permissions to single
sections of the overall database, for better internal and external protection.
(3) If we need to add a new location to a business, it is simple to
create an additional node within the database, making distribution highly
scalable.
There are two types of distributed databases -
(1) Homogeneous databases
• The homogeneous databases are kind
of database systems in which all sites have identical software running on them.
Refer Fig. 5.1.2.
• In this system, all the sites are aware of the
other sites present in the system and they all cooperate in processing user's
request.
• Each site present in the system, surrenders part of
its autonomy in terms of right to change schemas or software.
• The homogeneous database system
appears as a single system to the user.
(2) Heterogeneous databases
• The heterogeneous databases are kind
of database systems in which different sites have different schema or software.
Refer Fig. 5.1.3.
• The participating sites are not aware of other
sites present in the system.
• These sites provide limited
facilities for cooperation in transaction processing.
• Following is an architecture of
distributed databases. In this architecture the local database is maintained by
each site.
• Each
site is interconnected by communication network.
When user makes a request for particular data at site Si
then it is first searched at the local database. If the data is not present in
the local database then the request for that data is passed to all the other
sites via communication network. Each site then searches for that data at its
local database. When data is found at particular site say Sj then it is
transmitted to site Si via communication network.
There are two approaches of storing relation r in distributed database -
(1) Replication: System maintains multiple copies of data,
stored in different sites, for grind faster retrieval and fault tolerance.
(2) Fragmentation: Relation is partitioned
into several fragments stored in distinct sites.
• Concept: Data replication means storing a
copy or replica of a relation fragments in two or more sites.
• There are two methods of data
replication replication. (1) Full replication (2) Partial
replication
• Full replication: In this approach the
entire relation is stored at all the sites. In this approach full redundant
databases are those in which every site contains a copy of entire database.
• Partial replication: In this approach only
some fragments of relation are replicated on the sites.
Advantages:
(1) Availability: Data
replication facilitates increased availability of data.
(2) Parallelism: Queries can be processed by several sites in
parallel.
(3) Faster accessing: The relation r is locally available at each site, hence data accessing
becomes faster.
Disadvantages:poi
(1) Increased cost of update: The major disadvantage of data
replication is increased betcost of updated. That means each replica of
relation r must be updated from all the sites if user makes a request for some
updates in relation.
(2) Increased complexity in concurrency control: It becomes complex to
implement the concurrency control mechanism for all the sites containing
replica.
• Concept: Data fragmentation is a division of relation r into fragments r1,r2,
r3,...,rn which contain sufficient information to reconstruct relation r.
• There are two approaches of data fragmentation -
(1) Horizontal fragmentation and (2) Vertical fragmentation.
• Horizontal fragmentation: In this approach, each tuple of r is assigned to one or more fragments.
If relation R is fragmented in r1 and r2 fragments, then to bring these
fragments back to R we must use union operation. That means R=r1ur2
• Vertical fragmentation: In this approach, the relation r is fragmented based on one or more
columns. If relation R is fragmented into r1 and r2 fragments using vertical
fragmentation then to bring these fragments back to original relation R we must
use join operation. That means R= r1 r2
• For example - Consider following
relation r
Student(RollNo, Marks, City)
The values in this schema are inserted as
Horizontal Fragmentation 1:
SELECT * FROM Student WHERE Marks >50 AND
City='Pune'
We will get
Horizontal Fragmentation 2:
SELECT * FROM Student WHERE Marks >50 AND City="Mumbai'
We will get
Vertical Fragmentation 1 :
SELECT * FROM RollNo
Vertical Fragmentation 2:
SELECT * FROM city
Basic Concepts
In distributed system transaction initiated at one site can access or
update data at other sites. Let us discuss various basic concepts used during
transaction processing in distributed systems -
• Local and global transactions :
Local transaction Ti is said to be local if it is initiated
at site Si and can access or update data at site Si only.
Global transaction Ti initiated by site Si is said to be
global if it can access or update data at site Si, Sj,Sk and so on.
• Coordinating and participating sites:
The site at which the transaction is initiated is called coordinating
site. The participating sites are those sites at which the sub-transactions are
executing. For example - If site S1 initiates the transaction T1 then it is
called coordinating site. Now assume that transaction T1 (initiated at S1) can
access site S2 and S3. Then sites S2 and S3 are called participating sites.
To access the data on site S2, the transaction T1 needs
another transaction T12 on site S2 similarly to access the data on site S3, the
transaction T2 needs some transaction say T13 on site S3. Then transactions T12
and T13 are called sub-transactions. The above described scenario can be
represented by following Fig. 5.1.6.
• Transaction manager :
The transaction manager manages the execution of those transactions (or
subtransactions) that access data stored in a local site.
(1) To maintain the log for recovery purpose.
(2) Participating in coordinating the concurrent execution of the
transactions executing balls at that site.
• Transaction coordinator:
The transaction coordinator coordinates the execution of the
various transactions (both local and global) initiated at that site.
The tasks of Transaction coordinator are -
(1) Starting the execution of transactions that originate at the site.
(2) Distributing subtransactions at appropriate sites for
execution
Let TC denotes the transaction coordinator and TM denotes the
transaction manager, then the system architecture can be represented as,
There are four types of failure modes,
1. Failure of site
2. Loss of messages
3. Failure of communication link
4. Network partition
The most common type of failure in distributed system is
loss or corruption of messages. The system uses Transmission Control
Protocol(TCP) to handle such error. This is a standard connection oriented
protocol in which message is transmitted from one end to another using wired
connection.
• If two nodes are not
directly connected, messages from one to another must be routed through
sequence of communication links. If the communication link fails, the messages
are rerouted by alternative links.
• A system is partitioned if it has
been split into two subsystems. This is called partitions. Lack of connection
between the subsystems also cause failure in distributed system.
Two Phase Commit Protocol
• The atomicity is an important property of any
transaction processing. What is this atomicity property? This property means
either the transaction will execute completely or it won't execute at all.
• The commit protocol ensures the atomicity across
the sites in following ways -
i) A transaction which executes at multiple sites must
either be committed at all the sites, or aborted at all the sites.
ii) Not acceptable to have a transaction committed at one site and aborted at another.
• There are two types of important
sites involving in this protocol -
• One Coordinating site
• One or more participating sites.
Two phase commit protocol
This protocol works in two phases - i) Voting phase and ii) Decision
phase.
Phase 1: Obtaining decision or voting phase
Step 1: Coordinator
site Ci asks all participants to prepare to commit transaction Ti.
• Ci adds the records <prepareT> to the log and writes
the log to stable storage.
• It then sends prepare T messages to all participating sites at which T will get executed.
Step 2: Upon
receiving message, transaction manager at participating site determines if it
can commit the transaction
• If not, add a record <no T> to the log and send abort
T message to coordinating site Ci.
• If the transaction can be committed, then :
• Add the record <ready T> to the log
• Force all records for T to stable storage
• Send ready T message to
Ci.
Phase 2: Recoding decision phase
• T can be committed of Ci received a ready T message
from all the participating sites otherwise T must be aborted.o
• Coordinator adds a decision record, <commit
T> or <abort T>, to the log and forces record onto stable storage.
Once the record stable storage it is irrevocable (even if failures occur)
• Coordinator sends a message to each
participant informing it of the decision
(commit or abort)
• Participants take appropriate action locally.
Failure of site
There are various cases at which failure may occur,
(1) Failure of participating sites
• If any of the participating sites gets failed then
when participating site Si recovers, it examines the log entry made by it to
take the decision about executing transaction.
• If the log contains <commit T> record: participating site executes
redo (T)
• If the log contains <abort T> record: participating site executes
undo (T)
• If the log contains <ready T> record: participating site must
consult Coordinating site to take decision about execution of transaction T.
• If T committed, redo (T)
• If T aborted, undo (T)
• If the log of participating site contains no record
then that means Si gets failed before responding to Prepare T message from
coordinating site. In this case it must abort T
(2) Failure of coordinator
• If coordinator fails while the commit protocol for
T is executing then participating sites must take decision about execution
of transaction T:
i) If an active participating
site contains a <commit T> record in its log, then T site must be
committed.
ii) If an active participating site contains an <abort T> record
in its log, then T must be aborted.
iii) If some active participating site does not contain a <ready
T> record in its log, then the failed coordinator Ci cannot have decided to
commit T. Can therefore abort T.
iv) If none of the above cases holds, then all participating active
sites must have a <ready T> record in their logs, but no additional
control records (such as <abort T> of <commit T>). In this case
active sites must wait for coordinator site Ci to recover, to find decision.
Two phase locking protocol has blocking problem.
What is blocking problem?
It is a stage at which active participating sites may have to wait for
failed coordinator site to recover.
The solution to this problem is to use three phase locking protocol.
Three Phase Commit Protocol
• The three phase locking
is an extension of two phase locking protocol in which eliminates the blocking
problem.
• Various assumptions that
are made for three phase commit protocol are -
• No network partitioning.
• At any point at least one site must be up.
• At the most k sites (participating as well as coordinating) can fail.
• Phase 1: This phase is similar to phase 1 of two phase protocol. That means
Coordinator site Ci asks all participants to prepare to commit transaction Ti.
The coordinator then makes the decision about commit or abort based on the response
from all the participating sites.
• Phase 2: In phase 2 coordinator makes a decision as in 2 Phase Commit which is
called the pre-commit decision <Pre-commit, T>, and records it in
multiple (at least K) participating sites.
• Phase 3: In phase 3, coordinator sends commit/abort message to all participating
Brits sites.
• Under three phase protocol, the
knowledge of pre-commit decision can be used to commit despite coordinator site
failure. That means if the coordinating site in case gets failed then one of
the participating site becomes the coordinating site and der consults other
participating sites to know the Pre-commit message which they possess. Thus
using this pre-commit t message the decision about commit/abort is taken by
this new coordinating site.
• This protocol avoids blocking problem as long as
less than k sites fail.
Advantage of three phase commit protocol
(1) It avoid blocking problem.
Disadvantage of three phase commit protocol
(1) The overhead is increased.
Distributed database query is processed using following
steps-
(1) Query Mapping:
• The input query on distributed data is specified
using query language.
• This query language is then translated into
algebraic query.
• During this translation the global conceptual
schema is referred.
• During the translation some important actions such
as normalization, analysis for semantic errors, simplification are carried out
then input query is restructured into algebraic query.
(2) Localization:
• In this step, the replication of information is
handled.
• The distributed query is mapped on
the global schema to separate queries on individual fragments.
(3) Global Query Optimization: optimization means selecting a strategy from list of candidate queries
which is closest to optimal. For optimization, the cost is computed. The total
cost is combination of CPU cost, I/O cost and communication costs.
(4) Local Query Optimization: This step is common to all sites in
distributed database. The techniques of local query optimization are similar to
those used in centralized systems.
Review Questions
1. What are the various features of distributed
database versus centralized database system? AU:
Dec.-17, Marks 6, May-17, Marks 8
2. Explain the architecture of a distributed
database. AU: Dec.-16, Marks 7
3. Explain about distributed databases and their
characteristics, functions and advantages and disadvantages. AU:
Dec.-07, May-14, Marks 8, May-16, Marks 16
4. Explain design of distributed database. AU: Dec.-04, Marks 8
5. Discuss homogeneous and heterogeneous databases reference
to distributed databases. AU:
May-03, Marks 8
6. Discuss in detail about the
distributed databases. AU: May-19,
Marks 13
7. What are data fragmentations? Explain various
approaches for fragmenting a relation with example. AU:
May-06, Marks 6
8. Explain in detail various approaches used for
storing a relation in distributed databases. AU: Dec.-04, Marks 8, Dec.-19, Marks
9
Database Management System: Unit V: Advanced Topics : Tag: : Advanced Topics - Database Management System - Distributed Databases
Database Management System
CS3492 4th Semester CSE Dept | 2021 Regulation | 4th Semester CSE Dept 2021 Regulation