2 - Structured Data Integration

Designing Data Integration in a Multi database with a Unified Data Model

Data integration within a multi database context involves several critical steps to ensure that disparate data sources can be effectively combined and queried. Below is a comprehensive breakdown of the process, enhanced with additional information to aid in your studies.

Source Schema Identification:
- Identify the various schemas of data sources involved in the integration process. Understanding the structure and organization of these schemas is crucial for determining how data can be combined.
- Sources may include databases, files, or external APIs, each with its unique schema.
Source Schema Reverse Engineering:
- This involves analyzing the data source conceptual schemas to create a detailed representation of the data structures and their relationships.
- Techniques like entity-relationship modeling or UML (Unified Modeling Language) diagrams can be used to visualize the schema components.
Conceptual Schemas Integration and Restructuring:
- Merge the identified schemas into a unified conceptual schema. This step addresses potential conflicts between the schemas, such as naming inconsistencies or differing data types.
- Conflict resolution strategies may include renaming attributes, converting data types, or creating a hierarchy of attributes.
Conceptual to Logical Translation:
- Translate the integrated conceptual schema into a logical schema that can be implemented in a specific database management system.
- This process often involves normalization, which organizes data to reduce redundancy and improve data integrity.
Mapping Between the Global Logical Schema and Individual Source Schemas:
- Define the relationships between the global logical schema and each source schema. This step, also known as logical view definition, is essential for querying the integrated system.
- Various mapping techniques, such as Global As View (GAV) and Local As View (LAV), can be employed to facilitate this process:
  - GAV (Global As View): In this approach, the global schema is constructed based on the views of the individual data sources. Each source schema is defined as a view over the global schema.
  - LAV (Local As View): The global schema is created independently of the source schemas. Instead, each source schema is treated as a view of the global schema, defining how local data fits into the overarching model.
Post-Integration Query Answering through Data Views:
- Once integration is complete, queries can be made against the global schema. The system will determine how to retrieve the corresponding data from the individual sources based on the defined mappings.
- This may involve translating queries from the global schema context back into the source schemas.

Definition

A data integration system can be formally defined as a triple , where:

: Represents the global schema, which provides a unified view of the integrated data.

: Denotes the set of source schemas from which data is being integrated.

: Consists of the set of mappings that specify how the data from the sources relates to the global schema.

When a query is issued against the integrated system, it is constructed in terms of the global schema . The challenge lies in accurately identifying which pieces of real data in the underlying data sources correspond to the virtual data defined in .

When integrating data from sources that employ different data models (e.g., relational, XML, NoSQL), model transformation may be required. This involves converting data from one model to another to facilitate integration and querying. The mapping techniques discussed earlier (GAV and LAV) remain applicable, but additional considerations may arise depending on the complexity and characteristics of the different data models involved.

Global As View (GAV) Approach

In the Global As View (GAV) approach, the global schema is constructed based on the integration of the individual data source schemas. Each element of the global schema is explicitly defined in terms of queries over the data source schemas, meaning that the global schema is essentially a set of views derived from these sources.

Definition

A GAV mapping consists of a set of assertions that describe how each element of the global schema is computed from the source schemas. Formally, for every element in , there is a corresponding query over the source schemas :

This mapping ensures that each global schema element can be directly traced to specific data in the source schemas, making it easier to understand how global data is derived.

Example
Consider the following table definitions for two data sources, and :
-- SOURCE 1
Product(Code, Name, Description, Warnings, Notes, CatID)
Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price)
 
-- SOURCE 2
Product(Code, Name, Size, Color, Description, Type, Price, Q.ty
If we define a global schema as follows:
CREATE VIEW GLOB-PROD AS
SELECT Code AS PCode, VersionCode as VCode, Version.Name AS Name, Size, Color, Version.Description as Description, CatID, Price, Stock
FROM SOURCE1.Product, SOURCE1.Version
WHERE Code = ProductCode
UNION
SELECT Code AS PCode, null as VCode, Name, Size, Color, Description,Type as CatID, Price, Q.ty AS Stock
FROM SOURCE2.Product
we can see how the global schema is constructed by combining data from the two sources. Queries against the global schema will be translated into queries against the source schemas, allowing the system to retrieve the relevant data from the sources. If we have a query like:
SELECT PCode, VCode, Price, Stock FROM GLOB-PROD WHERE Size = "V" AND Color = "Red"
the system will translate this query into the following form to retrieve the necessary data from the source schemas:
SELECT Code, VersionCode, Price, Stock FROM SOURCE1.Product, SOURCE1.Version WHERE Code = ProductCode AND Size = "V" AND Color = "Red"
UNION
SELECT Code, null, Price, Q.ty FROM SOURCE2.Product WHERE Size = "V" AND Color = "Red"
The GAV approach works well when the data sources are stable and well-defined, allowing for explicit mappings between the global schema and the source schemas. However, it can be challenging to extend the system when new data sources are added or existing sources change, as the global schema may need to be redefined to accommodate these modifications.

Pros	Cons
Works well with stable data sources	Difficult to extend when new data sources need to be added
Mapping is explicit and clear	Requires redefinition of global schema with source changes
Easier to optimize queries due to fixed mappings	Less flexible for dynamic or evolving data environments

Local As View (LAV) Approach

In the Local As View (LAV) approach, the global schema is designed independently of the data source schemas. Instead of deriving the global schema from the sources (as in GAV), each data source schema is defined as a view over the global schema. The LAV approach allows for more flexibility, especially when new sources are added or existing sources change.

Definition

A LAV mapping is composed of assertions that define how each element in a source relates to the global schema . For every element in the source schema , there is a query that describes how this source element can be viewed in terms of the global schema :

This type of mapping indicates that the content of each source is described as a view over the global schema.

Example
Consider the same example with two data sources, and , and the global schema . In the LAV approach, the source schemas are defined as views over the global schema:
-- SOURCE 1
CREATE VIEW SOURCE1.Product AS
SELECT PCode AS Code, Name, Description, Warnings, Notes, CatID
FROM GLOB-PROD
 
CREATE VIEW SOURCE1.Version AS
SELECT PCode AS ProductCode, VCode as VersionCode, Size, Color, Name, Description , Stock, Price
FROM GLOB-PROD
 
-- SOURCE 2
CREATE VIEW SOURCE2.Product AS
SELECT PCode AS Code, Name, Size, Color, Description, CatID as Type, Price, Stock AS Q.ty
FROM GLOB-PROD
In this case, the source schemas are defined as views over the global schema, indicating how the local data fits into the overarching model. Queries against the global schema will be processed by extracting the relevant data from the sources based on these views.

Query Processing in LAV

One of the key challenges in the LAV approach is query processing. Since queries are expressed in terms of the global schema, but the data resides in the sources, the system must perform reasoning to “invert” the views. Unlike in GAV, where queries can be unfolded directly through simple mappings, LAV requires more complex strategies.

Inverse View Problem:
The system must figure out how to retrieve relevant data from the sources, given that the mapping defines the opposite transformation. This means the system needs to “rewrite” queries in terms of the source schemas and combine the results accordingly.
Query Rewriting and Execution:
Query processing involves constructing a plan that rewrites the query into a set of subqueries to be executed against the sources. The system then combines these partial results to answer the original query. For example, if a query requires data from both SOURCE1.VERSION and SOURCE2.PRODUCT, the system needs to retrieve and integrate this information from both sources.
Reasoning for Query Execution:
Since there is no simple rule for unfolding the views (as in GAV), query execution requires more reasoning. The system needs to infer which data from the sources can satisfy the query based on the global schema, and how to combine results from different sources.

Pros	Cons
Stability and Scalability: Highly modular, suitable for frequently added or updated sources	Complex Query Processing: Inverse transformation needed to retrieve data from sources
Flexibility in Evolving Systems: Simplifies adding new data sources or modifying existing ones	Increased Complexity in Reasoning: Requires significant computational effort and time for query execution

The LAV approach provides greater flexibility and scalability than GAV, particularly in systems where data sources are dynamic and subject to change. However, it comes at the cost of more complex query processing, which requires reasoning and more sophisticated query rewriting techniques. This makes LAV a better choice for environments with evolving data sources, while GAV may be more appropriate for systems with stable and well-defined data sources.

Comparison Between GAV and LAV

GAV LAV
Mapping quality depends on how well the sources have been compiled into the global schema Mapping quality depends on how well the sources have been characterized in relation to the global schema
When a source changes or a new one is added, the global schema may need to be redefined High modularity and extensibility: source changes or additions only require updating the source definition
Query processing is straightforward due to direct mapping Query processing is more complex and requires reasoning

GAV	LAV
Mapping quality depends on how well the sources have been compiled into the global schema	Mapping quality depends on how well the sources have been characterized in relation to the global schema
When a source changes or a new one is added, the global schema may need to be redefined	High modularity and extensibility: source changes or additions only require updating the source definition
Query processing is straightforward due to direct mapping	Query processing is more complex and requires reasoning

Example: Query Processing in LAV with Two Data Sources
Consider a schema with two sources, S1 and S2, and a global schema G.

S1 contains information about American comedies made after 1960, defined by the view R1(Title, Year, Director).

S2 contains information about movies made after 1990 and their reviews, defined by the view R2(Title, Review).

The global schema G combines data from both sources and is structured as follows:
MOVIE(Title, Year, Director, Genre)
AMERDIR(Director)
REVIEW(Title, Review)
In the LAV approach, each source is defined as a view over the global schema:
S1 (R1) contains comedies made after 1960 with American directors. Its view over the global schema is:
R1(Title, Year, Director) ← MOVIE(Title, Year, Director, Genre), AMERDIR(Director), Genre = 'comedy', Year ≥ 1960
S2 (R2) contains movies made after 1990 along with their reviews. Its view over the global schema is:
R2(Title, Review) ← MOVIE(Title, Year, Director, Genre), REVIEW(Title, Review), Year ≥ 1990
We are interested in finding all comedies with their reviews that were filmed since 1950. The query on the global schema would look like this:
Ans(Title, Review) ← MOVIE(Title, Year, Director, Genre), REVIEW(Title, Review), Genre = 'comedy', Year ≥ 1950
However, since we are using the LAV approach, the global schema must be “rewritten” in terms of the views from the data sources (S1 and S2). Here’s how the query processing works:
Extract Titles from S1:
S1 contains comedies filmed after 1960 with American directors. Therefore, we retrieve all the titles of comedies from R1.
Title ← R1(Title, Year, Director)
Extract Reviews from S2:
S2 provides reviews for movies filmed after 1990. We retrieve all the reviews from R2.
Review ← R2(Title, Review)
Join the Results at the Global Level:
At the global level, we now need to join the results from S1 (comedies) and S2 (reviews) based on the movie title.
Ans(Title, Review) ← R1(Title, Year, Director), R2(Title, Review), Year ≥ 1990
Due to the limited information in S1 and S2, the result will include only comedies directed by American directors and filmed after 1990, with their corresponding reviews. The information from S2 is incomplete because it only covers movies made after 1990, while the query asks for comedies since 1950. However, since S2 does not contain any data about movies before 1990, it cannot provide reviews for those earlier films.

In general, when using LAV, sources are often incomplete with respect to the global schema. The global schema might represent a broader scope (e.g., all comedies since 1950), but the available data sources may only cover a subset (e.g., comedies after 1960, or reviews after 1990). Therefore, the query results can be incomplete if the data sources do not fully represent the global schema. This is a typical scenario in data integration, where sources provide only partial information about what is expected in the global schema.

Sound, Complete, and Exact Mappings

Definitions

A mapping defined over some data sources is sound when it provides a subset of the data that is available in the data source that corresponds to the definition.

A mapping is complete if it provides a superset of the available data in the data source that corresponds to the definition.

A mapping is exact if it provides all and only data corresponding to the definition: it is both sound and complete.

In GAV (with integrity constraints) the mapping can be exact or only sound. In LAV, the global schema can be defined independently of the sources’ schemata. The mapping can be exact or only complete: what about S2 before, which contained products but not versions, i.e. only a part of the information of the same kind in the global system.

The word “complete” may arise confusion. In the LAV case, the mapping can be exact or only complete, due to the incompleteness of one or more sources. The mapping is complete because the global schema covers the source contents, but the sources are incomplete because they do not cover the data as “expected” from the global schema, which has been defined independently of the source contents.

Operators

Take into account the following tables:

R1				R2
SSN	NAME	AGE	SALARY	SSN	NAME	SALARY	PHONE
123456789	JOHN	34	30K	234567891	KETTY	20K	1234567
234567891	KETTY	27	25K	345678912	WANG	22K	2345678
345678912	WANG	39	32K	456789123	MARY	34K	3456789

where R1 and R2 are two sources. In theses tables, we can perform the following operations:

Outerunion R1 OU R2: the union of the two tables, with the attributes of both tables. If some attributes are missing in one of the tables, they are filled with NULL values.

SSN	NAME	AGE	SALARY	PHONE
123456789	JOHN	34	30K	NULL
234567891	KETTY	27	25K	NULL
345678912	WANG	39	32K	NULL
234567891	KETTY	NULL	20K	1234567
345678912	WANG	NULL	22K	2345678
456789123	MARY	NULL	34K	3456789

Join R1 JOIN R2: the join of the two tables, with the attributes of both tables. This operator removes all the tuples where the SSN is not present in both tables. If two attributes have the same name, both are present (with different names) in the result.

SSN	NAME	AGE	SALARY1	SALARY2	PHONE
234567891	KETTY	27	25K	20K	1234567
345678912	WANG	39	32K	22K	2345678

Outerjoin R1 OJ R2: the outer join of the two tables, with the attributes of both tables. This operator keeps all the tuples of the first table, and fills with NULL values the attributes of the second table where the SSN is not present in the second table.

SSN	NAME	AGE	SALARY	PHONE
123456789	JOHN	34	30K	NULL
234567891	KETTY	27	NULL	1234567
345678912	WANG	39	NULL	2345678
456789123	MARY	NULL	34K	3456789

Generalization R1 Ge R2: the generalization of the two tables, with the attributes of both tables. This operator keeps all the tuples of the first table, and fills with NULL values the attributes of the second table where the SSN is not present in the second table.

SSN	NAME	SALARY
123456789	JOHN	30K
234567891	KETTY	NULL
345678912	WANG	NULL
456789123	MARY	34K

Data Inconsistencies in Data Integration

When integrating data from multiple sources, inconsistencies can arise, leading to challenges in accurately representing real-world entities. These inconsistencies may occur whether a schema exists or not. When different databases contain instances of the same real-world object, discrepancies in the values can occur.

Record Linkage (Entity Resolution): The process of identifying and linking records that refer to the same real-world entities across different datasets. This step is crucial for ensuring that integrated data accurately reflects the entities they represent.
Data Fusion: Once it has been established that two items refer to the same entity, data fusion involves reconciling any inconsistent information between these records. This might include determining which values to retain, merge, or discard based on various criteria or rules.

Data integration frequently requires recognizing strings that refer to the same real-world entities, such as variations in names or addresses.

Example

For example, “Politecnico di Milano” and “Politecnico Milano” might refer to the same institution, while “Via Ponzio 34” and “Via Ponzio 34/5” may represent the same address.

Record Linkage (Entity Resolution)

Regardless of the data model used, identifying when two datasets contain overlapping or identical information is essential. Although the relational data model is often used in these discussions due to its prevalence, the principles are applicable across different models. When fields are distinct, it becomes easier to spot similarities. For example, if one dataset contains “Lonardo” and another contains “Leonardo,” recognizing that “Leonardo” is a campus of Politecnico would allow the system to infer that these represent the same entity.

To effectively perform record linkage, we need to determine which pairs of strings (representing entities) match. Each pair identified as referring to the same entity is called a match.

Similarity Measure:
A similarity measure quantifies how closely two strings (or sets of strings) match, resulting in a value that ranges from 0 to 1. Here:
- indicates a perfect match (the strings are identical).
- Values closer to 0 indicate a lower degree of similarity.
We can establish a threshold (where ) to determine if two strings match: if , we consider and to be a match.

There are various types of similarity measures used for record linkage:
1. Sequence-Based Similarity: These measures focus on the sequences of characters in the strings. Examples include:
  - Edit Distance: Measures the minimum number of edits required to transform one string into another.
  - Needleman-Wunsch: Used for global sequence alignment.
  - Smith-Waterman: Used for local sequence alignment.
  - Jaro and Jaro-Winkler: Focus on the number and order of matching characters.
2. Set-Based Similarity: These measures treat strings as sets of elements. Examples include:
  - Overlap: Measures the size of the intersection divided by the size of the union of two sets.
  - Jaccard: Measures the size of the intersection divided by the size of the union, useful for comparing sets.
  - TF/IDF: Measures the importance of a term in a document relative to a collection of documents.
3. Hybrid Similarity: Combines elements of both sequence-based and set-based measures. Examples include:
  - Generalized Jaccard: An extension of the Jaccard index.
  - Soft TF/IDF: A modified version of TF/IDF that takes into account similarities in word usage.
  - Monge-Elkan: A measure designed for comparing two sets of words.
4. Phonetic Similarity: These measures determine how similar two strings sound when pronounced. An example is:
  - Soundex: An algorithm for encoding names based on their phonetic similarity.

Edit (Levenshtein) Distance

The edit distance (or Levenshtein distance) is a widely used metric that quantifies the minimum number of operations required to transform one string into another string . The allowable operations are:

Character Insertion: Adding a new character into the string.
Character Deletion: Removing an existing character from the string.
Character Replacement: Changing one character in the string to another character.

The similarity between two strings can be calculated using the edit distance:

Example

For the strings and , the edit distance . This can be achieved through the following operations:

Delete the character “d” from .

Delete the character “i” from .

s(\text{Politecnico di Milano}, \text{Politecnico Milano}) &= 1 - \frac{2}{\max(19, 17)}\ &= 1 - \frac{2}{19} \approx 1 - 0.105… \ &\approx 0.894… \approx 89.4 % \end{aligned}$$

This indicates a high degree of similarity between the two strings.

Calculating for all pairs of strings in a dataset results in a quadratic time complexity, , where is the number of strings. Various algorithms and techniques, such as blocking or canopy clustering, have been proposed to optimize the matching process by reducing the number of comparisons needed.

Set-Based Similarity Measures

Set-based similarity measures treat strings as multisets of tokens. This involves tokenizing the strings and computing a similarity measure based on the resulting sets of tokens. For instance, if we tokenize strings into bigrams (substrings of length 2), the Jaccard similarity can be defined as follows:

Example

For the strings “pino” and “pin” the sets of tokes are:

Using the Jaccard measure:

This indicates a moderate similarity based on the overlap of tokens.

Phonetic Similarity Measures

Phonetic similarity measures assess how alike two strings sound, rather than their literal character composition. One of the most common phonetic algorithms is Soundex. This method generates a four-character code based on the pronunciation of a word, grouping similar-sounding letters under the same code.

Soundex is particularly useful in genealogical research, where names may evolve phonetically over generations.

Example

For example, “Legoff” and “Legough” may yield the same Soundex code, indicating that they sound similar despite being spelled differently.

While Soundex can effectively identify similar-sounding names, it may lead to false positives with common words that sound alike, such as “coughing” and “coffin.” Additionally, phonetic measures are heavily language-dependent, as pronunciation can vary significantly between languages. For instance, the combination “gn” is pronounced differently in English (as in “gnome”) compared to Italian (where it’s pronounced as “ny”).

Record Matching

Record matching is a critical component in data integration and involves identifying and linking records that refer to the same real-world entity across different databases. Various approaches to record matching can be classified into three primary categories:

Rule-based Matching
Learning-based Matching (supervised or unsupervised)
Probabilistic Matching

Before implementing matching techniques, blocking is often employed. This process involves dividing the dataset into smaller subsets (blocks) of item pairs that are likely to match, significantly reducing the number of comparisons needed.

Rule-Based Matching

Rule-based matching relies on manually defined rules that specify conditions under which two tuples are considered a match. For example, two records may be identified as referring to the same individual if they have the same SSN. While straightforward, this method is time-consuming and labor-intensive since rules must be crafted for each attribute.

Example

Given two tables:

SSN NAME AGE SALARY POSITION
123456789 JOHN 34 30K ENGINEER
234567891 KETTY 27 25K ENGINEER
345678912 WANG 39 32K MANAGER
------------ ---------- --- ---- --------------
SSN NAME AGE SALARY PHONE
------------ ---------- --- ---- --------------
234567891 KETTY 25 20K 1234567
345678912 WANG 38 22K 2345678
456789123 MARY 42 34K 3456789

A potential rule for matching could be:

Here, different similarity measures could be applied to each attribute based on its relevance.

SSN	NAME	AGE	SALARY	POSITION
123456789	JOHN	34	30K	ENGINEER
234567891	KETTY	27	25K	ENGINEER
345678912	WANG	39	32K	MANAGER
------------	----------	---	----	--------------
SSN	NAME	AGE	SALARY	PHONE
------------	----------	---	----	--------------
234567891	KETTY	25	20K	1234567
345678912	WANG	38	22K	2345678
456789123	MARY	42	34K	3456789

Learning Matching Rules

Learning-based matching can be conducted through supervised or unsupervised methods.

In supervised learning, a classification approach is used where a model is trained on labeled data. The process includes:

Training Phase:
- Pairs of tuples are labeled as “yes” (if they match) or “no” (if they don’t).
- The system learns the significance of each attribute during this phase.
Weight Definition: Each attribute’s importance in the final matching score is determined.
Application: The learned model is then applied to new pairs of tuples for matching.

While effective, supervised learning requires a substantial amount of training data, which can be a limitation.

In contrast, unsupervised learning approaches, such as clustering techniques, group similar records without the need for labeled training data. This method can identify patterns in the data to suggest potential matches based on inherent similarities.

Probabilistic Matching

Probabilistic matching utilizes probability distributions to model the matching domain, allowing for more nuanced decision-making based on the likelihood of a match.

Pros	Cons
Provides a principled framework that can naturally incorporate diverse domain knowledge	Computationally expensive
Leverages established probabilistic representation and reasoning techniques from AI and database communities	Often challenging to understand and debug matching decisions
Offers a reference framework for comparing and explaining various matching approaches

This approach is beneficial when dealing with uncertain data and can improve matching accuracy by considering the probabilities associated with various matching scenarios.

Data Fusion

Data fusion is the process of integrating and reconciling information from multiple sources to produce a coherent and accurate representation of the same real-world entity. Even after establishing that some data represent the same entity, discrepancies in the information can arise. The challenge then becomes determining how to resolve these inconsistencies.

Resolution Function

Inconsistencies in data can result from various factors, including:

Incorrect Data: One or both of the data sources may contain inaccuracies.
Partial Views: Each source might provide a correct but incomplete view of the entity. For instance:
- The total salary might be the sum of the amounts reported by different sources.
- A list of authors for a book may not include all contributors.
- One source may only provide the first letter of a middle name while another includes the full name.

When reconciling discrepancies, several strategies can be employed. The correct value may be derived from a function of the original values, allowing for a more comprehensive integration. Here are a few examples of how values might be combined:

Summation: Combine values directly, such as total salaries:
Weighted Average: Assign different weights to each value to reflect their reliability: where .
Maximum or Minimum: Choose the maximum or minimum value from the conflicting data:

Example

Given the following tables representing employee data from two different sources:

Source 1:

SSN NAME AGE SALARY POSITION
123456789 JOHN 34 30K ENGINEER
234567891 KETTY 27 25K ENGINEER
345678912 WANG 39 32K MANAGER

Source 2:

SSN NAME AGE SALARY PHONE
123456789 JOHN 34 30K NULL
234567891 KETTY 27 45K 1234567
345678912 WANG 39 54K 2345678
456789123 MARY 42 34K 3456789

In this scenario, we can fuse the information based on SSN. If we aim to resolve the salary for KETTY, we can sum the salaries from both sources:

Thus, a new integrated table might look like:

SSN NAME AGE SALARY POSITION PHONE
123456789 JOHN 34 30K ENGINEER NULL
234567891 KETTY 27 70K ENGINEER 1234567
345678912 WANG 39 54K MANAGER 2345678
456789123 MARY 42 34K NULL 3456789

SSN	NAME	AGE	SALARY	POSITION
123456789	JOHN	34	30K	ENGINEER
234567891	KETTY	27	25K	ENGINEER
345678912	WANG	39	32K	MANAGER

SSN	NAME	AGE	SALARY	PHONE
123456789	JOHN	34	30K	NULL
234567891	KETTY	27	45K	1234567
345678912	WANG	39	54K	2345678
456789123	MARY	42	34K	3456789

SSN	NAME	AGE	SALARY	POSITION	PHONE
123456789	JOHN	34	30K	ENGINEER	NULL
234567891	KETTY	27	70K	ENGINEER	1234567
345678912	WANG	39	54K	MANAGER	2345678
456789123	MARY	42	34K	NULL	3456789

Data fusion is particularly complex in modern contexts where:

There are numerous and heterogeneous data sources, including structured databases, semi-structured sources (like XML), and unstructured data (like text documents).
Different terminologies and operational contexts can create confusion and ambiguity.
Data is often time-variant, especially in environments like the web where information changes rapidly.
Sources may be mobile or transient, adding another layer of complexity to the fusion process.

Data Integration in the Multi database

In a multi database environment, integrating data from various sources is a complex but essential process. This integration allows users to query and manipulate data as if it were coming from a single source, even when it is actually distributed across multiple databases with different schemas and structures.

Wrappers in Data Integration

Definition

A Wrapper is a software module that translates queries and results between the global schema and the data source schemata.

Here’s how wrappers function in data integration:

Query Translation: Wrappers convert high-level queries issued to the global schema into specific queries that can be understood by the target data sources. This may involve transforming the syntax or semantics of the query based on the specific database model (e.g., relational, object-oriented, NoSQL).
Result Formatting: Once a query is executed in the data source, the results may be returned in a format that is not directly usable by the application or user. Wrappers translate these results back into a format compatible with the global schema.
Type Conversions: Wrappers can also facilitate type conversions, allowing data types from different databases to be harmonized for consistency and compatibility.
Handling Unstructured Data: While wrappers are relatively straightforward to implement for structured data (like relational or object-oriented data), they become more complex when dealing with unstructured data (like documents, images, etc.). Special techniques may be required to parse and extract meaningful information from such data.

Design Steps for Data Integration in a Multidatabase

To effectively integrate data from multiple sources, the following design steps are typically followed:

Reverse Engineering: This step involves analyzing the existing databases to produce a conceptual schema that represents the underlying data and its relationships. It helps in understanding how the different sources function and what data they hold.
Conceptual Schemata Integration: Once the conceptual schemas are produced, the next step is to integrate these schemas. This involves identifying commonalities and discrepancies among the schemas and resolving them to form a unified conceptual model.
Choice of Target Logical Data Model: After integration, the target logical data model needs to be chosen. This model serves as the basis for the global schema. The global conceptual schema is then translated into this logical model, ensuring that it meets the needs of the application and users.
Definition of the Language Translation (Wrapping): At this stage, the specifics of how queries and results will be translated between the global schema and individual data sources are defined. This includes specifying the syntax and semantic mappings that wrappers will use.
Definition of Data Views: Finally, data views are defined to present the integrated data to users or applications. These views encapsulate the complexity of the underlying data sources and provide a simplified interface for data access.

Polimi CS - Notes

Explorer

2 - Structured Data Integration

Table of Contents

Designing Data Integration in a Multi database with a Unified Data Model

Global As View (GAV) Approach

Local As View (LAV) Approach

Query Processing in LAV

Sound, Complete, and Exact Mappings

Operators

Data Inconsistencies in Data Integration

Record Linkage (Entity Resolution)

Edit (Levenshtein) Distance

Set-Based Similarity Measures

Phonetic Similarity Measures

Record Matching

Rule-Based Matching

Learning Matching Rules

Probabilistic Matching

Data Fusion

Resolution Function

Data Integration in the Multi database

Wrappers in Data Integration

Design Steps for Data Integration in a Multidatabase

Backlinks

Graph View

Table of Contents