When conducting a requirements analysis for a project, the BEST approach would be to

Data Requirements Analysis

David Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011

9.4 The Data Requirements Analysis Process

The data requirements analysis process employs a top-down approach that emphasizes business-driven needs, so the analysis is conducted to ensure the identified requirements are relevant and feasible. The process incorporates data discovery and assessment in the context of explicitly qualified business data consumer needs. Having identified the data requirements, candidate data sources are determined and their quality is assessed using the data quality assessment process described in chapter 11. Any inherent issues that can be resolved immediately are addressed using the approaches described in chapter 12, and those requirements can be used for instituting data quality control, as described in chapter 13.

The data requirements analysis process consists of these phases:

1.

Identifying the business contexts

2.

Conducting stakeholder interviews

3.

Synthesizing expectations and requirements

4.

Developing source-to-target mappings

Once these steps are completed, the resulting artifacts are reviewed to define data quality rules in relation to the dimensions of data quality described in chapter 8.

9.4.1 Identifying the Business Contexts

The business contexts associated with data consumption and reuse provide the scope for the determination of data requirements. Conferring with enterprise architects to understand where system boundaries intersect with lines of business will provide a good starting point for determining how (and under what circumstances) data sets are used.

Figure 9.2 shows the steps in this phase of the process:

When conducting a requirements analysis for a project, the BEST approach would be to

Figure 9.2. Identifying the business contexts.

1.

Identify relevant stakeholders: Stakeholders may be identified through a review of existing system documentation or may be identified by the data quality team through discussions with business analysts, enterprise analysts, and enterprise architects. The pool of relevant stakeholders may include business program sponsors, business application owners, business process managers, senior management, information consumers, system owners, as well as frontline staff members who are the beneficiaries of shared or reused data.

2.

Acquire documentation: The data quality analyst must become familiar with overall goals and objectives of the target information platforms to provide context for identifying and assessing specific information and data requirements. To do this, it is necessary to review existing artifacts that provide details about the consuming systems, requiring a review of project charters, project scoping documents, requirements, design, and testing documentation. At this stage, the analysts should accumulate any available documentation artifacts that can help in determining collective data use.

3.

Document goals and objectives: Determining existing performance measures and success criteria provides a baseline representation of high-level system requirements for summarization and categorization. Conceptual data models may exist that can provide further clarification and guidance regarding the functional and operational expectations of the collection of target systems.

4.

Summarize scope of capabilities: Create graphic representations that convey the high-level functions and capabilities of the targeted systems, as well as providing detail of functional requirements and target user profiles. When combined with other context knowledge, one may create a business context diagram or document that summarizes and illustrates the key data flows, functions, and capabilities of the downstream information consumers.

5.

Document impacts and constraints: Constraints are conditions that affect or prevent the implementation of system functionality, whereas impacts are potential changes to characteristics of the environment to accommodate the implementation of system functionality. Identifying and understanding all relevant impacts and constraints to the target systems are critical, because the impacts and constraints often define, limit, and frame the data controls and rules that will be managed as part of the data quality environment. Not only that, source-to-target mappings may be impacted by constraints or dependencies associated with the selection of candidate data sources.

The resulting artifacts describe the high-level functions of downstream systems, and how organizational data is expected to meet those systems' needs. Any identified impacts or constraints of the targeted systems, such as legacy system dependencies, global reference tables, existing standards and definitions, and data retention policies, will be documented. In addition, this phase will provide a preliminary view of global reference data requirements that may impact source data element selection and transformation rules. Time stamps and organization standards for time, geography, availability and capacity of potential data sources, frequency and approaches for data extractions, and transformations are additional data points for identifying potential impacts and requirements.

9.4.2 Conduct Stakeholder Interviews

Reviewing existing documentation only provides a static snapshot of what may (or may not) be true about the state of the data environment. A more complete picture can be assembled by collecting what might be deemed “hard evidence” from the key individuals associated with the business processes that use data. Therefore, our next phase (shown in Figure 9.3) is to conduct conversations with the previously identified key stakeholders, note their critical areas of concern, and summarize those concerns as a way to identify gaps to be filled in the form of data requirements.

When conducting a requirements analysis for a project, the BEST approach would be to

Figure 9.3. Conducting stakeholder interviews.

This phase of the process consists of these five steps:

1.

Identify candidates and review roles: Review the general roles and responsibilities of the interview candidates to guide and focus the interview questions within their specific business process (and associated application) contexts.

2.

Develop interview questions: The next step in interview preparation is to create a set of questions designed to elicit the business information requirements. The formulation of questions can be driven by the context information collected during the initial phase of the process. There are two broad categories of questions – directed questions, which are specific and aimed at gathering details about the functions and processes within a department or area, and open-ended questions, which are less specific and often lead to dialogue and conversation. They are more focused on trying to understand the information requirements for operational management and decision making.

3.

Schedule and conduct interviews: Interviews with executive stakeholders should be scheduled earlier, because their time is difficult to secure. Information obtained during executive stakeholder interviews provides additional clarity regarding overall goals and objectives and may result in refinement of subsequent interviews. Interviews should be scheduled at a location where the participants will not be interrupted.

4.

Summarize and identify gaps: Review and organize the notes from the interviews, including the attendees list, general notes, and answers to the specific questions. By considering the business definitions that were clarified related to various aspects of the business (especially in relation to known reference data dimensions, such as time, geography, and regulatory issues), one continues to formulate a fuller determination of system constraints and data dependencies.

5.

Resolve gaps and finalize results: Completion of the initial interview summaries will identify additional questions or clarifications required from the interview candidates. At that point the data quality practitioner can cycle back with the interviewee to resolve outstanding issues.

Once any outstanding questions have been answered, the interview results can be combined with the business context information (as described in section 9.4.1) to enable the data quality analyst to define specific steps and processes for the request for and documentation of business information requirements.

9.4.3 Synthesize Requirements

This next phase synthesizes the results of the documentation scan and the interviews to collect metadata and data expectations as part of the business process flows. The analysts will review the downstream applications' use of business information (as well as questions to be answered) to identify named data concepts and types of aggregates, and associated data element characteristics.

Figure 9.4 shows the sequence of these steps:

When conducting a requirements analysis for a project, the BEST approach would be to

Figure 9.4. Synthesizing the results.

1.

Document information workflow: Create an information flow model that depicts the sequence, hierarchy, and timing of process activities. The goal is to use this workflow to identify locations within the business processes where data quality controls can be introduced for continuous monitoring and measurement.

2.

Identify required data elements: Reviewing the business questions will help segregate the required (or commonly used) data concepts (party, product, agreement, etc.) from the characterizations or aggregation categories (e.g., grouped by geographic region). This drives the determination of required reference data and potential master data items.

3.

Specify required facts: These facts represent specific pieces of business information that are tracked, managed, used, shared, or forwarded to a reporting and analytics facility in which they are counted or measured (such as quantity or volume). In addition, the data quality analyst must document any qualifying characteristics of the data that represent conditions or dimensions that are used to filter or organize your facts (such as time or location). The metadata for these data concepts and facts will be captured within a metadata repository for further analysis and resolution.

4.

Harmonize data element semantics: A metadata glossary captures all the business terms associated with the business workflows, and classifies the hierarchical composition of any aggregated or analyzed data concepts. Most glossaries may contain a core set of terms across similar projects along with additional project specific terms. When possible, use existing metadata repositories to capture the approved organization definition.

The use of common terms becomes a challenge in data requirements analysis, particularly when common use precludes the existence of agreed-to definitions. These issues become acute when aggregations are applied to counts of objects that may share the same name but don't really share the same meaning. This situation will lead to inconsistencies in reporting, analyses, and operational activities, which in turn will lead to loss of trust in data. Harmonization and metadata resolution are discussed in greater detail in chapter 10.

9.4.4 Source-to-Target Mapping

The goal of source-to-target mapping is to clearly specify the source data elements that are used in downstream applications. In most situations, the consuming applications may use similar data elements from multiple data sources; the data quality analyst must determine if any consolidation and/or aggregation requirements (i.e., transformations) are required, and determine the level of atomic data needed for drill-down, if necessary. Any transformations specify how upstream data elements are modified for downstream consumption and business rules applied as part of the information flow. During this phase, the data analyst may identify the need for reference data sets. As we will see in chapter 10, reference data sets are often used by data elements that have low cardinality and rely on standardized values.

Figure 9.5 shows the sequence of these steps:

When conducting a requirements analysis for a project, the BEST approach would be to

Figure 9.5. Source-to-target mapping.

1.

Propose target models: Evaluate the catalog of identified data elements and look for those that are frequently created, referenced, or modified. By considering both the conceptual and the logical structures of these data elements and their enclosing data sets, the analyst can identify potential differences and anomalies inherent in the metadata, and then resolve any critical anomalies across data element sizes, types, or formats. These will form the core of a data sharing model, which represents the data elements to be taken from the sources, potentially transformed, validated, and then provided to the consuming applications.

2.

Identify candidate data sources: Consult the data management teams to review the candidate data sources containing the identified data elements, and review the collection of data facts needed by the consuming applications. For each fact, determine whether it corresponds to a defined data concept or data element, exists in any data sets in the organization, or is a computed value (and if so, what are the data elements that are used to compute that value), and then document each potential data source.

3.

Develop source-to-target mappings: Because this analysis should provide enough input to specify which candidate data sources can be extracted, the next step is to consider how that data is to be transformed into a common representation that is then normalized in preparation for consolidation. The consolidation processes collect the sets of objects and prepare them for populating the consuming applications. During this step, the analysts enumerate which source data elements contribute to target data elements, specify the transformations to be applied, and note where it relies on standardizations and normalizations revealed during earlier stages of the process.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000099

Data Requirements Analysis

David Loshin, in Business Intelligence (Second Edition), 2013

Summary

This provides a good starting point in the data requirements analysis process that can facilitate the data selection process. By the end of these exercises (which may require multiple iterations), you may be able to identify source applications whose data subsystems contain instances that are suitable for integration into a business analytics environment. Yet there are still other considerations: just because the data sets are available and accessible does not mean they can satisfy the analytics consumers’ needs, especially if the data sets are not of a high enough level of quality. It is therefore also critically important to assess the data quality expectations and apply a validation process to determine if the quality levels of candidate data sources can meet the collected downstream user needs, and this will be covered in subsequent chapters.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123858894000077

Bringing It All Together

David Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011

20.2.4 Data Requirements Analysis

Organizational data quality management almost does not make sense outside of the context of growing information reuse, alternating opinions regarding centralization/decentralization of data, or increasing scrutiny from external parties. The fact that data sets are reused for purposes that were never intended implies a greater need for identifying, clarifying, and documenting the collected data requirements from across the application landscape, as well as instituting accountability for ensuring that the quality characteristics expected by all data consumers are met.

Inconsistencies due to intermediate transformations and cleansings have plagued business reporting and analytics, requiring recurring time investments for reviews and reconciliations. However, attempting to impose restrictions upstream often are pushed back, resulting in a less than optimal situation. Data requirements analysis is a process intended to accumulate data requirements from across the spectrum of downstream data consumers. Demonstrating that all application's are accountable for making the best effort for ensuring the quality of data for all downstream purposes and that the organization benefits as a whole when ensuring that those requirements are met.

Whereas traditional requirements analysis centers on functional needs, data requirements analysis complements the functional requirements process and focuses on the information needs, providing a standard set of procedures for identifying, analyzing, and validating data requirements and quality for data-consuming applications. Data requirements analysis helps in:

Articulating a clear understanding of data needs of all consuming business processes,

Identifying relevant data quality dimensions associated with those data needs,

Assessing the quality and suitability of candidate data sources,

Aligning and standardizing the exchange of data across systems;

Implementing production procedures for monitoring the conformance to expectations and correcting data as early as possible in the production flow, and

Continually reviewing to identify improvement opportunities in relation to downstream data needs.

Analysis of system goals, objectives, and stakeholder desires is conducted to elicit business information characteristics that drive the definition of data and information requirements that are relevant, add value, and can be observed. The data requirements analysis process employs a top-down approach that incorporates data discovery and assessment in the context of explicitly qualified business data consumer needs. Candidate data sources are determined, assessed, and qualified within the context of the requirements, and any inherent issues that can be resolved immediately are addressed using the approaches described in chapter 12. The data requirements analysis process consists of these phases:

1.

Identifying the business contexts

2.

Conducting stakeholder interviews

3.

Synthesizing expectations and requirements

4.

Developing source-to-target mappings

Data quality rules defined as a result of the requirements analysis process can be engineered into the organization's system development life cycle (SDLC) for validation, monitoring, and observance of agreed-to data quality standards.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000208

Domain modeling

Marco Brambilla, Piero Fraternali, in Interaction Flow Modeling Language, 2015

3.11.1 Designing the Core Subschema

The process of defining a core subschema from the description of the core concepts identified in the data requirements analysis is straightforward:

1.

The core concept is represented by a class (called core class).

2.

Properties with a single, atomic value become attributes of the core class. The identifying properties become the primary key of the core class.

3.

Properties with multiple or structured values become internal components of the core class.

Internal components are represented as classes connected to the core class via a part-of association. Two cases are possible, which differ in the multiplicity constraints of the association connecting the component to the core class:

1.

If the connecting association has a 1:1 multiplicity constraint for the component, the component is a proper subpart of the core concept. In this case, no instance of the internal component can exist in absence of the core class instance it belongs to, and multiple core objects cannot share the same instance of the internal component. Internal components of this kind are sometimes called “weak classes” in data modeling terminology, or “part-of components” in object-oriented terminology.

2.

If the association between the core class and the component has 0:∗ multiplicity for the internal component, the notion of “component” is interpreted in a broader sense. The internal component is considered a part of the core concept, even if an instance of it may exist independently of the connection to a core class instance and can be shared among different core objects. Nonetheless, the internal component is not deemed an essential data asset of the application and thus is not elevated to the status of a core concept.

Figure 3.17 illustrates the typical domain model of a core subschema, including one core class, two proper nonshared internal components, and one shared component.

When conducting a requirements analysis for a project, the BEST approach would be to

Figure 3.17. Typical core subschema.

Note that a shared component may be part of one or more concepts, but it is not treated as an independent object for the purpose of the application. Such a consideration is useful for building the front-end model, which should present or manage components as parts of their “enclosing” core concepts and not as standalone objects.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128001080000035

Inspection, Monitoring, Auditing, and Tracking

David Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011

17.6 Putting It Together

Data quality incident management combines different technologies to enable proactive management of existing, known data quality rules derived from both the data requirements analysis and the data quality assessment processes, including data profiling, metadata management, and rule validation. The introduction of an incident management system provides a forum for collecting knowledge about emergent and outstanding data quality issues and can guide the governance activities to ensure that data errors are prioritized, the right individuals are notified, and that the actions taken are aligned with the expectations set out in the data quality service level agreement.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000178

Remediation and Improvement Planning

David Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011

12.1 Triage

Limitations to staffing will influence the data quality team to consider the best allocation of resources to address issues. There will always be a backlog of issues for review and consideration, revealed either by direct reports from data consumers or results of data quality assessments. But in order to achieve the “best bang for the buck,” and most effectively use the available staff and resources, one can prioritize the issues for review and potential remediation as a by-product of weighing feasibility and cost effectiveness of a solution against the recognized business impact of the issue. In essence, one gets the optimal value when the lowest costs are incurred to resolve the issues with the greatest perceived negative impact.

When a data quality issue has been identified, the triage process will take into account these aspects of the identified issue:

Criticality: the degree to which the business processes are impaired by the existence of the issue

Frequency: how often the issue has appeared

Feasibility of correction: the likelihood of expending the effort to correct the results of the failure

Feasibility of prevention: the likelihood of expending the effort to eliminate the root cause or institute continuous monitoring to detect the issues

The triage process is performed to understand these aspects in terms of the business impact, the size of the problem, as well as the number of individuals or systems affected. Triage enables the data quality practitioner to review the general characteristics of the problem and business impacts in preparation for assigning a level of severity and priority.

12.1.1 The Prioritization Matrix

By its very nature, the triage process must employ some protocols for immediate assessment of any issue that has been identified, as well as prioritize those issues in the context of existing issues. A prioritization matrix is a tool that can help provide clarity for deciding relative importance, getting agreement on priorities, and then determining the actions that are likely to provide best results within appropriate time frames. Collecting data about the issue's criticality, frequency, and the feasibility of the corrective and preventative actions enables a more confident decision-making process for prioritization.

Different approaches can be taken to assemble a prioritization matrix, especially when determining weighting strategies and allocations. In one example, shown in Table 12.1, the columns of the matrix show the evaluation criteria. There is one row for each data quality issue. In this example, weights are assigned to the criteria based on the degree to which the score would contribute to the overall prioritization. In this example, the highest weight is assigned to the criticality. The data quality practitioner will gather information as input to the scoring process, and each of the criteria's weighted scores is calculated, and summed in the total.

Table 12.1. Example Prioritization Matrix

CriteriaCriticality Weight = 4Frequency Weight = 1Correction Feasibility Weight = 1Prevention Feasibility Weight = 2Total
IssuesScoreWeighted scoreScoreWeighted scoreScore Weighted scoreScoreWeighted score

The weights must be determined in relation to the business context and the expectations as directed by the results of the data requirements analysis process (as discussed in chapter 9). As these requirements are integrated into a data quality service level agreement (or DQ SLA, as is covered in chapter 13), the criteria for weighting and evaluation are adjusted accordingly. In addition, the organization's level of maturity in data quality and data governance may also inform the determination of scoring protocols as well as weightings.

12.1.2 Gathering Knowledge

There may be little to no background information associated with any identified or reported data quality issue, so the practitioner will need to gather knowledge to evaluate the prioritization criteria, using guidance based on the data requirements. The assignment of points can be based on the answers to a sequence of questions intended to tease out the details associated with criticality and frequency, such as the following:

Have any business processes/activities been impacted by the data issue?

If so, how many business processes/activities are impacted by the data issue?

What business applications have failed as a result of the data issue?

If so, how many business processes have failed?

How many individuals are affected?

How many systems are affected?

What types of systems are affected?

How many records are affected?

How many times has this issue been reported? Within what time frame?

How long has this been an issue?

Then, based on the list of individuals and systems affected, the data quality analyst can review business impacts within the context of both known and newly discovered issues, asking questions such as these:

What are the potential business impacts?

Is this an issue that has already been anticipated based on the data requirements analysis process?

Has this issue introduced delays or halts in production information processing that must be performed within existing constraints?

Has this issue introduced delays in the development or deployment of critical business systems?

The next step is to evaluate what data sets have been affected and what, if any, immediate corrective actions need to be taken, such as whether any data sets need to be recreated, modified, or corrected, or if any business processes need to be rolled back to a previous state. The following types of questions are used in this evaluation:

Are there short-term corrective measures that can be taken to restart halted processes?

Are there long-term measures that can be taken to identify when the issue occurs in the future?

Are there system modifications that can be performed to eliminate the issue's occurrence altogether?

The answers to these questions will present alternatives for correction as well as prevention, which can be assessed in terms of their feasibility.

12.1.3 Assigning Criticality

Having collected knowledge about each issue, the data quality analyst can synthesize the intentions of the data quality requirements with what has been learned during the triage process to determine the level of severity and assign priority for resolution. The collected information can be used to populate the prioritization matrix, assign scores, and apply weights. Issues can be assigned a priority score based on the results of the weightings applied in the prioritization matrix. In turn, each issue can be prioritized, from both a relative standpoint (i.e., which issues take relative precedence compare to others) and an absolute standpoint (i.e., is a specific issue high or low priority). This prioritization can also be assigned in the context of those issues identified during a finite time period (“this past week”) or in relation to the full set of open data quality issues.

Data issue priority will be defined by the members of the various data governance groups. As an example, an organization may define four levels of priority, such as those shown in Table 12.2.

Table 12.2. Example Classifications of Severity or Criticality

ClassificationDescriptionImplications
Business critical The existence of a business critical problem prevents necessary business activities from completing, and must be resolved before those activities can continue. Addressing the issue demands immediate attention and overrules activities associated with issues or a lower priority.
Serious Serious issues pose measurably high impacts to the business, but the issue does not prevent critical business processes from completing. These issues require evaluation and must be addressed, but are superseded by business critical issues.
Tolerable With tolerable issues, there are identified impacts to the business, but they require additional research to determine whether correction and elimination are economically feasible. It is not clear if the negative business impacts exceed the total costs of remediation; further investigation is necessary.
Acknowledged Acknowledged issues are recognized and documented, but the scale of the business impact does not warrant the additional investment in remediation. It is clear that the negative business impacts do not exceed the total costs of remediation; no further investigation is necessary.

Depending on the scoring process, the weighting, and the assessment, any newly reported issue can be evaluated and assigned a priority that should direct the initiation of specific remediation actions. Issues can be recategorized as well. For example, issues categorized as tolerable may be downgraded to acknowledged once the evaluation determines that the costs for remediation exceed the negative impact. Similarly, once a work-around has been determined for a business critical issue, that issue may no longer prevent necessary business activities from continuing, in which case it could be reclassified as a serious issue.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123737175000129

Coordination

David Loshin, in Master Data Management, 2009

2.3 Stakeholders

Who are the players in an MDM environment? There are many potential stakeholders across the enterprise:

Senior management

Business clients

Application owners

Information architects

Data governance and data quality practitioners

Metadata analysts

System developers

Operations staff

Here we explore who the stakeholders are and what their expected participation should be over the course of program development.

2.3.1 Senior Management

Clearly, without the support of the senior management, it would be difficult to execute any enterprise activity. At the senior level, managers are motivated to demonstrate that their (and their teams’) performances have contributed to the organization's successful achievement of its business objectives. Transitioning to a master data environment should enable more nimbleness and agility in both ensuring the predictable behavior of existing applications and systems and rapidly developing support for new business initiatives. This core message drives senior-level engagement.

Senior management also plays a special role in ensuring that the rest of the organization remains engaged. Adopting a strategic view to oversee the long-term value of the transition and migration should trump short-term tactical business initiatives. In addition, the senior managers should also prepare the organization for the behavioral changes that will be required by the staff as responsibilities and incentives evolve from focusing on vertical business area success to how line-of-business triumphs contribute to overall organizational success.

2.3.2 Business Clients

For each of the defined lines of business, there are representative clients whose operations and success rely on the predictable, high availability of application data. For the most part, unless the business client is intricately involved in the underlying technology associated with the business processes, it almost doesn't matter how the system works, but rather that the system works. Presuming that the data used within the existing business applications meet the business user's expectations, incorporating the business client's data into a master repository is only relevant to the business client if the process degrades data usability.

However, the business client may derive value from improvements in data quality as a by-product of data consolidation, and future application development will be made more efficient when facilitated through a service model that supports application integration with enterprise master data services. Supporting the business client implies a number of specific actions and responsibilities, two of which are particularly relevant. First, the MDM program team must capture and document the business client's data expectations and application service-level expectations and assure the client that those expectations will be monitored and met. Second, because it is essential for the team to understand the global picture of master object use, it is important for the technical team to assess which data objects are used by the business applications and how those objects are used. Therefore, as subject matter experts, it is imperative that the business clients participate in the business process modeling and data requirements analysis process.

2.3.3 Application Owners

Any applications that involve the use of data objects to be consolidated within an MDM environment will need to be modified to adjust to the use of master data instead of local versions or replicas. This means that the use of the master data asset must be carefully socialized with the application owners, because they become the “gatekeepers” to MDM success. As with the business owners, each application owner will be concerned with ensuring predictable behavior of the business applications and may even see master data management as a risk to continued predictable behavior, as it involves a significant transition from one underlying (production) data asset to a potentially unproven one.

The application owner is a key stakeholder, then, as the successful continued predictable operation of the application depends on the reliability and quality of the master repository. When identifying data requirements in preparation for developing a master data model, it will be necessary to engage the application owner to ensure that operational requirements are documented and incorporated into the model (and component services) design.

2.3.4 Information Architects

Underlying any organizational information initiative is a need for information models in an enterprise architecture. The models for master data objects must accommodate the current needs of the existing applications while supporting the requirements for future business changes. The information architects must collaborate to address both aspects of application needs and fold those needs into the data requirements process for the underlying models and the representation framework that will be employed.

2.3.5 Data Governance and Data Quality

An enterprise initiative introduces new constraints on the ways that individuals create, access and use, modify, and retire data. To ensure that these constraints are not violated, the data governance and data quality staff must introduce stewardship, ownership, and management policies as well as the means to monitor observance to these policies.

A success factor for MDM is its ubiquity; the value becomes apparent to the organization as more lines of business participate, both as data suppliers and as master data consumers. This suggests that MDM needs governance to encourage collaboration and participation across the enterprise, but it also drives governance by providing a single point of truth. Ultimately, the use of the master data asset as an acknowledged high-quality resource is driven by transparent adherence to defined information policies specifying the acceptable levels of data quality for shared information. MDM programs require some layer of governance, whether that means incorporating metadata analysis and registration, developing “rules of engagement” for collaboration, defining data quality expectations and rules, monitoring and managing quality of data and changes to master data, providing stewardship to oversee automation of linkage and hierarchies, or offering processes for researching root causes and the subsequent elimination of sources of flawed data.

2.3.6 Metadata Analysts

Metadata represent a key component to MDM as well as the governance processes that underlie it, and managing metadata must be closely linked to information and application architecture as well as data governance. Managing all types of metadata (not just technical or structural) will provide the “glue” to connect these together. In this environment, metadata incorporate the consolidated view of the data elements and their corresponding definitions, formats, sizes, structures, data domains, patterns, and the like, and they provide an excellent platform for metadata analysts to actualize the value proposed by a comprehensive enterprise metadata repository.

2.3.7 System Developers

Aspects of performance and storage change as replicated data instances are absorbed into the master data system. Again, the determination of the underlying architecture approach will impact production systems as well as new development projects and will change the way that the application framework uses the underlying data asset (as is discussed in Chapters 9, 11 and 12Chapter 9Chapter 11Chapter 12). System analysts and developers will need to restructure their views of systemic needs as the ability to formulate system services grows at the core level, at a level targeted at the ways that conceptual data objects are used, and at the application interface level.

2.3.8 Operations Staff

One of the hidden risks of moving toward a common repository for master data is the fact that often, to get the job done, operations staff may need to bypass the standard protocols for data access and modification. In fact, in some organizations, this approach to bypassing standard interfaces is institutionalized, with metrics associated with the number of times that “fixes” or modifications are applied to data using direct access (e.g., updates via SQL) instead of going through the preferred channels.

Alternatively, desktop applications are employed to supplement existing applications and as a way to gather the right amount of information to complete a business process. Bypassing standard operating procedures and desktop supplements pose an interesting challenge to the successful MDM program, in absorbing what might be termed “finely grained distributed data” into the master framework as well as taming the behavior that essentially allows for leaks in the enterprise master data framework. In other words, the folks with their boots on the ground may need to change their habits as key data entities are captured and migrated into a master environment.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123742254000023

Data Consolidation and Integration

David Loshin, in Master Data Management, 2009

10.6 Consolidation

Consolidation is the result of the tasks applied to data integration. Identifying values are parsed, standardized, and normalized across the same data domains; subjected to classification and blocking schemes; and then submitted to the unique identification service to analyze duplicates, look for hierarchical groupings, locate an existing record, or determine that one does not exist. The existence of multiple instances of the same entity raises some critical questions shown in the sidebar.

The answers to the questions frame the implementation and tuning of matching strategies and the resulting consolidation algorithms. The decision to merge records into a single repository depends on a number of different inputs, and these are explored in greater detail in Chapter 9.

Critical Questions about Multiple Instances of Entities

What are the thresholds that indicate when matches exist?

When are multiple instances merged into a single representation in a master repository as opposed to registration within a master registry?

If the decision is to merge into a single record, are there any restrictions or constraints on how that merging may be done?

At what points in the processing stream is consolidation performed?

If merges can be done in the hub, can consuming systems consume and apply that merge event?

Which business rules determine which values are forwarded into the master copy—in other words, what are the survivorship rules?

If the merge occurs and is later found to be incorrect, can you undo the action?

How do you apply transactions against the merged entity to the subsequently unmerged entities after the merge is undone?

10.6.1 Similarity Thresholds

When performing approximate matching, what criteria are used for distinguishing a match from a nonmatch? With exact matching, it is clear whether or not two records refer to the same object. With approximate matching, however, there is often not a definitive answer, but rather some point along a continuum indicating the degree to which two records match. Therefore, it is up to the information architect and the business clients to define the point at which two values are considered to be a match, and this is specified using a threshold score.

When identifying values are submitted to the integration service, a search is made through the master index for potential matches, and then a pair-wise comparison is performed to determine the similarity score. If that similarity score is above the threshold, it is considered a match. We can be more precise and actually define three score ranges: a high threshold above which indicates a match; a low threshold under which is considered not a match; and any scores between those thresholds, which require manual review to determine whether the identifying values should be matched or not.

This process of incorporating people into the matching process can have its benefits, especially in a learning environment. The user may begin the matching process by specifying specific thresholds, but as the process integrates user decisions about what kinds of questionable similarity values indicate matches and which do not, a learning heuristic may both automatically adjust the thresholds as well as the similarity scoring to yield a finer accuracy of similarity measurement.

10.6.2 Survivorship

Consolidation, to some extent, implies merging of information, and essentially there are two approaches: on the one hand, there is value in ensuring the existence of a “golden copy” of data, which suggests merging multiple instances as a cleansing process performed before persistence (if using a hub). On the other hand, different applications have different requirements for how data is used, and merging records early in the work streams may introduce inconsistencies for downstream processing, which suggests delaying the merging of information until the actual point of use. These questions help to drive the determination of the underlying architecture.

Either way, operational merging raises the concept of survivorship. Survivorship is the process applied when two (or more) records representing the same entity contain conflicting information to determine which record's value survives in the resulting merged record. This proc-ess must incorporate data or business rules into the consolidation process, and these rules reflect the characterization of the quality of the data sources as determined during the source data analysis described in Chapter 2, the kinds of transactions being performed, and the business client data quality expectations as discussed in Chapter 5.

Ultimately, every master data attribute depends on the data values within the corresponding source data sets identified as candidates and validated through the data requirements analysis process. The master data attribute's value is populated as directed by a source-to-target mapping based on the quality and suitability of every candidate source. Business rules delineate valid source systems, their corresponding priority, qualifying conditions, transformations, and the circumstances under which these rules are applied. These rules are applied at different locations within the processing streams depending on the business application requirements and how those requirements have directed the underlying system and service architectures.

Another key concept to remember with respect to survivorship is the retention policy for source data associated with the master view. Directed data cleansing and data value survivorship applied when each data instance is brought into the environment provides a benefit when those processes ensure the correctness of the single view at the point of entry. Yet because not all data instances imported into the system are used, cleansing them may turn out to be additional work that might not have been immediately necessary. Cleansing the data on demand would limit the work to what is needed by the business process, but it introduces complexity in managing multiple instances and history regarding when the appropriate survivorship rules should have been applied.

A hybrid idea is to apply the survivorship rules to determine its standard form, yet always maintain a record of the original (unmodified) input data. The reason is that a variation in a name or address provides extra knowledge about the master object, such as an individual's nickname or a variation in product description that may occur in other situations. Reducing each occurrence of a variation into a single form removes knowledge associated with potential aliased identifying data, which ultimately reduces your global knowledge of the underlying object. But if you can determine that the input data is just variations of one (or more) records that are already known, storing newly acquired versions linked to the cleansed form will provide greater knowledge moving forward, as well as enabling traceability.

10.6.3 Integration Errors

We have already introduced the concepts of the two types of errors that may be encountered during data integration. The first type of error is called a false positive, and it occurs when two data instances representing two distinct real-life entities are incorrectly assumed to refer to the same entity and are inadvertently merged into a single master representation. False positives violate the uniqueness constraint that a master representation exists for every unique entity. The second type of error is called a false negative, and it occurs when two data instances representing the same real-world entity are not determined to match, with the possibility of creating a duplicate master representation. False negatives violate the uniqueness constraint that there is one and only one master representation for every unique entity.

Despite the program's best laid plans, it is likely that a number of both types of errors will occur during the initial migration of data into the master repository, as additional data sets are merged in and as data come into the master environment from applications in production. Preparing for this eventuality is an important task:

Determine the risks and impacts associated with both types of errors and raise the level of awareness appropriately. For example, false negatives in a marketing campaign may lead to a prospective customer being contacted more than once, whereas a false negative for a terrorist screening may have a more devastating impact. False positives in product information management may lead to confused inventory management in some cases, whereas in other cases they may lead to missed opportunities for responding to customer requests for proposals.

Devise an impact assessment and resolution scheme. Provide a process for separating the unique identities from the merged instance upon identification of a false positive, in which two entities are incorrectly merged, and determine the distinguishing factors that can be reincorporated into the identifying attribute set, if necessary. Likewise, provide a means for resolving duplicated data instances and determining what prevented those two instances from being identified as the same entity.

These tasks both suggest maintaining historical information about the way that the identity resolution process was applied, what actions were taken, and ways to unravel these actions when either false positives or false negatives are identified.

10.6.4 Batch versus Inline

There are two operational paradigms for data consolidation: batch and inline. The batch approach collects static views of a number of data sets and imports them into a single location (such as a staging area or loaded into a target database), and then the combined set of data instances is subjected to the consolidation tasks of parsing, standardization, blocking, and matching, as described in Section 10.4. The inline approach embeds the consolidation tasks within operational services that are available at any time new information is brought into the system. Inlined consolidation compares every new data instance with the existing master registry to determine if an equivalent instance already exists within the environment. In this approach, newly acquired data instances are parsed and standardized in preparation for immediate comparison against the versions managed within the master registry, and any necessary modifications, corrections, or updates are applied as the new instance either is matched against existing data or is identified as an entity that has not yet been seen.

The approaches taken depend on the selected base architecture and the application requirements for synchronization and for consistency. Batch consolidation is often applied as part of the migration process to accumulate the data from across systems that are being folded into a master environment. The batch processing allows for the standardization of the collected records to seek out unique entities and resolve any duplicates into a single identity. Inlined consolidation is the approach used in operations mode to ensure that as data come into the environment, they are directly synchronized with the master.

10.6.5 History and Lineage

Knowing that both false positives and false negatives will occur directs the inclusion of a means to roll back modifications to master objects on determination that an error has occurred. The most obvious way to enable this capability is to maintain a full history associated with every master data value. In other words, every time a modification is made to a value in a master record, the system must log the change that was made, the source of the modification (e.g., the data source and set of rules triggered to modify the value), and the date and time that the modification was made. Using this information, when the existence of an error is detected, a lineage service can traverse the historical record for any master data object to determine at which point a change was made that introduced the error.

Addressing the error is more complicated, because not only does the error need to be resolved through a data value rollback to the point in time that the error was introduced, but any additional modifications dependent on that flawed master record must also be identified and rolled back. The most comprehensive lineage framework will allow for backtracking as well as forward tracking from the rollback point to seek out and resolve any possible errors that the identified flaw may have triggered. However, the forward tracking may be overkill if the business requirements do not insist on complete consistency—in this type of situation, the only relevant errors are the ones that prevent business tasks from successfully completing; proactively addressing potential issues may not be necessary until the impacted records are actually used.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123742254000102

What are the main approaches of requirement analysis?

A detailed analysis of the requirements is provided by the business analyst using different approaches listed below: State and project concepts. Data flow diagrams. Entity relationship diagrams.

What is the best method for gathering requirements?

11 Requirements Gathering Techniques for Agile Product Teams.
Interviews..
Questionnaires or Surveys..
User Observation..
Document Analysis..
Interface Analysis..
Workshops..
Brainstorming..
Role-Play..

What are the two main techniques of requirement analysis?

Below is a list of different business Requirements Analysis Techniques: Business process modeling notation (BPMN) UML (Unified Modeling Language)

What is requirement approach?

A Requirements Approach is a roadmap for the developing requirements for a project. The Roadmap should cover all phases of requirements development: Planning - Develop a plan for gathering and communicating requirements. Eliciting and Validating - Extract information to prepare to document the requirements.