What is Data Profiling?
Data profiling is the process of reviewing data to assess its structure, content, interrelationships, and quality. It provides data scientists with a high-level view of a data set's overall level of quality. Data profiling is generally not a manual process — instead, data scientists use analytical algorithms to examine the following characteristics:
- Data types.
- Mean.
- Minimum and maximum.
- Percentile.
- Frequency.
- Lengths.
- Recurring patterns.
- Metadata.
- Distributions.
- Key candidates and foreign-key candidates.
- Functional dependencies.
- Embedded value dependencies.
- Anomalies.
Common use cases for data profiling include:
- Preparing data for use in a data warehouse or a business intelligence project.
- Identifying data quality issues during conversion and migration projects.
- Performing a general assessment of a source system's data quality.
Why is Data Profiling Important?
Data profiling plays a critical role in ensuring the quality and integrity of the data within a system. Without data profiling, a business may end up making crucial strategic decisions based on faulty or incomplete data. With data profiling, however, an organization may benefit from:
- Improved data quality.
- More accurate, credible data.
- More effective strategic decision making.
- More efficient, streamlined database management.
Data Mining vs. Data Profiling
Although there is some overlap between data profiling and data mining, the end goal of each process is different. Data mining simply identifies patterns or trends in a data set. Data profiling, meanwhile, analyzes and summarizes a data set's characteristics with the goal of supporting data management processes.
The Primary Data Profiling Techniques
There are many different data profiling techniques, but some of the most crucial include:
- Distinct count and percent: This identifies natural keys — values that identify records in a data set.
- Percent of zero/blank/null values: Flags data that's either missing or unknown.
- String Length: Identifies general data types and sizes in a database.
- Cardinality: Assesses relationships between different datasets, an important functionality for business intelligence tools.
- Key Integrity: Flags orphan keys in data, and ensures that keys are otherwise always present.
- Pattern/Frequency Distributions: Used to determine whether data fields are properly formatted.
These techniques each falls into one of the following categories:
- Structure Discovery: Validates consistency and formatting. Leverages basic mathematical checks and statistics to provide information about data accuracy and validity.
- Content Discovery: Assesses individual records to determine data quality and identify systemic issues such as an incorrectly formatting phone number or street address.
- Relationship Discovery: Identifies key relationships between and within datasets. This could include relationships between database tables or connections between cells in spreadsheets.
Understanding the Data Profiling Process
Data profiling can be a fairly complex process. How it looks also varies depending on the dataset being evaluated and which data profiling technique you use. It also tends to be an iterative process rather than an incremental one.
With that said, most data profiling tends to follow the same basic framework:
- Identify the data set and determine if it's suitable for analysis and profiling.
- Identify the data set's source, origin, and context.
- Inspect a sample of the data to identify its basic characteristics.
- Identify and address any quality issues present in the data set, a process known as data cleansing.
- Apply your chosen data profiling techniques.