Retrieving required data is second phase of data science project. Sometimes Data scientists need to go into the field and design a data collection process.
Retrieving Data
• Retrieving
required data is second phase of data science project. Sometimes Data
scientists need to go into the field and design a data collection process. Many
companies will have already collected and stored the data and what they don't
have can often be bought from third parties.
• Most
of the high quality data is freely available for public and commercial use.
Data can be stored in various format. It is in text file format and tables in
database. Data may be internal or external.
1. Start working on internal data, i.e. data
stored within the company
• First
step of data scientists is to verify the internal data. Assess the relevance
and quality of the data that's readily in company. Most companies have a
program for maintaining key data, so much of the cleaning work may already be
done. This data can be stored in official data repositories such as databases,
data marts, data warehouses and data lakes maintained by a team of IT
professionals.
• Data
repository is also known as a data library or data archive. This is a general
term to refer to a data set isolated to be mined for data reporting and
analysis. The data repository is a large database infrastructure, several
databases that collect, manage and store data sets for data analysis, sharing
and reporting.
• Data
repository can be used to describe several ways to collect and store data:
a) Data
warehouse is a large data repository that aggregates data usually from multiple
sources or segments of a business, without the data being necessarily related.
b) Data
lake is a large data repository that stores unstructured data that is
classified and tagged with metadata.
c) Data
marts are subsets of the data repository. These data marts are more targeted to
what the data user needs and easier to use.
d)
Metadata repositories store data about data and databases. The metadata
explains where the data source, how it was captured and what it represents.
e) Data
cubes are lists of data with three or more dimensions stored as a table.
Advantages of data repositories:
i. Data
is preserved and archived.
ii. Data
isolation allows for easier and faster data reporting.
iii.
Database administrators have easier time tracking problems.
iv.
There is value to storing and analyzing data.
Disadvantages of data repositories :
i.
Growing data sets could slow down systems.
ii. A
system crash could affect all the data.
iii.
Unauthorized users can access all sensitive data more easily than if it was
distributed across several locations.
2. Do not be afraid to shop around
• If
required data is not available within the company, take the help of other
company, which provides such types of database. For example, Nielsen and GFK
are provides data for retail industry. Data scientists also take help of
Twitter, LinkedIn and Facebook.
• Government's
organizations share their data for free with the world. This data can be of
excellent quality; it depends on the institution that creates and manages it.
The information they share covers a broad range of topics such as the number of
accidents or amount of drug abuse in a certain region and its demographics.
3. Perform data quality checks to avoid later
problem
•
Allocate or spend some time for data correction and data cleaning. Collecting suitable,
error free data is success of the data science project.
• Most
of the errors encounter during the data gathering phase are easy to spot, but
being too careless will make data scientists spend many hours solving data
issues that could have been prevented during data import.
• Data
scientists must investigate the data during the import, data preparation and
exploratory phases. The difference is in the goal and the depth of the
investigation.
• In
data retrieval process, verify whether the data is right data type and data is
same as in the source document.
• With
data preparation process, more elaborate checks performed. Check any shortcut
method is used. For example, check time and data format.
• During
the exploratory phase, Data scientists focus shifts to what he/she can learn
from the data. Now Data scientists assume the data to be clean and look at the
statistical properties such as distributions, correlations and outliers.
Foundation of Data Science: Unit I: Introduction : Tag: : Data Science - Retrieving Data
Foundation of Data Science
CS3352 3rd Semester CSE Dept | 2021 Regulation | 3rd Semester CSE Dept 2021 Regulation