In business applications, the majority of queries are designed to yield exact results. These results are often organized in a specific order to improve accessibility and readability, but there is no inherent preference among the outcomes—each result holds equal relevance and importance. This contrasts with consumer applications, where certain queries necessitate ranking the results based on a notion of “preferability.” In such cases, the simultaneous optimization of various criteria is often required.

For example, objects in a dataset may be described by multiple attributes, and a general problem arises: how to identify the best objects based on a defined measure of "goodness."

Example

Consider practical scenarios: search engines rank web pages according to relevance, e-commerce platforms prioritize product listings based on user preferences, and recommender systems suggest items tailored to individual tastes.

In machine learning, this principle appears in tasks such as finding the -nearest neighbors. Here, given data points in a -dimensional metric space and a query point , the objective is to identify the points closest to . This approach underpins classification algorithms, feature selection, and various other domains.

To address such problems, two major approaches are commonly employed: ranking queries and skyline queries.

  • Ranking queries involve selecting the top objects based on a scoring function, such as relevance or utility.
  • In contrast, skyline queries identify a set of non-dominated objects—those that are not outperformed by any other object in all criteria.

These approaches have broad applicability in both academic research and real-world systems, forming the backbone of methods to rank, classify, and filter information effectively.

Historical Perspective: Rank Aggregation

The challenge of rank aggregation—combining multiple ranked lists to produce a consensus ranking—has been a focus of study for centuries, primarily within the realm of social choice theory. Early contributions date back to Ramon Llull in the 13th century, who introduced methods for aggregating preferences. Later, Borda (1770) and the Marquis de Condorcet (1785) proposed more systematic methods to tackle this problem.

Rank aggregation addresses a fundamental question: given candidates and voters, how can one determine the overall ranking of candidates based on the preferences of all voters? Unlike modern scoring systems, early approaches did not assign visible scores to candidates but instead relied solely on ordinal rankings. Key problems include identifying the “best” candidate and resolving inconsistencies that arise when individual rankings conflict—a hallmark of Condorcet’s paradox, where collective preferences can form cycles.

Borda’s and Condorcet’s Proposals

Borda and Condorcet proposed two influential methods for aggregating preferences in elections, each rooted in distinct philosophical principles.

Borda’s Method organizes elections based on the principle of order of merit. In this system, candidates receive points corresponding to their rank in each voter’s preference list. Specifically, the first-place candidate receives 1 point, the second-place candidate receives 2 points, and so on, up to th place, which receives points. The penalty for each candidate is the sum of their points across all voters’ rankings. The candidate with the lowest penalty is declared the Borda winner, emphasizing overall favorability across the electorate rather than dominance in head-to-head comparisons.

In contrast, Condorcet’s Method focuses on pairwise majority elections. A candidate is considered the Condorcet winner if they defeat every other candidate in a direct comparison based on majority rule. This criterion captures the idea of dominance but introduces the possibility of cyclic preferences, where no single candidate wins all pairwise contests. In such cases, the Condorcet winner does not exist, revealing the limitations of majority rule in ranking aggregation.

Example: Comparing Borda and Condorcet Winners

Consider the following voter preferences, where the ranks are ordered by columns:

Voter12345678910
A1111113333
C2222221111
B3333332222

Using Borda’s scoring, we calculate the penalties:

Here, candidate C has the lowest penalty and is the Borda winner.


Using Condorcet’s criterion, we compare candidates in pairwise elections:

  • A defeats B (6 votes to 4).
  • A defeats C (6 votes to 4).

Thus, A is the Condorcet winner, as it wins against both B and C in direct comparisons. This highlights a potential divergence: while C is the Borda winner, A is the Condorcet winner, illustrating the differing priorities of the two methods.

Approaches to Rank Aggregation

Rank aggregation remains central to various applications, from elections to search engine optimization.

Definition

A common mathematical formulation seeks a consensus ranking that minimizes the total distance to the initial rankings .

Two widely used metrics for comparing rankings are:

  1. Kendall Tau Distance (): This measures the number of pairwise swaps needed to transform one ranking into another.

    For example, bubble sort’s count of exchanges to sort into provides this distance.

    While intuitive, finding an exact solution to minimize Kendall tau distance is computationally hard (NP-complete).

  2. Spearman’s Footrule Distance (): This sums the absolute differences between the ranks of corresponding items in and . Unlike Kendall tau, Spearman’s footrule distance is computationally tractable (PTIME) and can be efficiently approximated. This makes it practical for large-scale problems, where median ranking serves as a useful heuristic.

The computational challenges in rank aggregation depend on the metric used:

  • For Kendall tau distance, finding an optimal aggregation is NP-complete, making it infeasible for large datasets without approximation techniques.
  • For Spearman’s footrule, efficient polynomial-time solutions exist, with approximations available for even faster results in high-dimensional cases.

Combining Opaque Rankings with MedRank

When dealing with ranked lists where only the position of elements matters—without access to associated scores—specific techniques like MedRank can be employed to combine rankings effectively. MedRank, based on the concept of the median, provides an approximate solution to Footrule-optimal aggregation. This method is particularly useful when rankings are opaque, such as hotel rankings based on price, rating, and distance, where only the order of items in each list is known.

Algorithm

The algorithm takes as input:

  • An integer , representing the desired number of top elements.
  • ranked lists , each containing elements.

Output: The top elements based on median ranking.

Steps:

  1. Sorted Access: MedRank performs one sorted access at a time on each list, retrieving elements sequentially.
  2. Selection Criteria: It continues until there are elements that appear in more than of the lists.
  3. Output: These elements are returned as the top , with their rank determined by their median positions across the lists.

Example: Ranking Hotels

Consider three ranked lists of hotels based on different criteria—price, rating, and distance. The goal is to find the top 3 hotels using median ranking.

PriceRatingDistance
1IbisCrillonLe Roch
2EtapNovotelLodge In
3NovotelSheratonRitz
4MercureHiltonLutetia
5HiltonIbisNovotel
6SheratonRitzSheraton
7CrillonLutetiaMercure

To determine the top 3 hotels, MedRank accesses elements in order from each list and calculates their median ranks:

  • Novotel: Positions across the lists → Median rank .
  • Hilton: Positions → Median rank .
  • Ibis: Positions → Median rank .

The algorithm stops when it identifies the top 3 elements based on their median ranks. Here, the depth, or the maximum number of sorted accesses on any list, is 5. Ideally, when all median ranks are distinct, the result matches the Footrule-optimal aggregation.

MedRank is instance-optimal for combining rankings accessed in sorted order. Instance optimality means that for any given input:

  • MedRank’s execution cost is at most a constant factor higher than the execution cost of any other algorithm tailored to that input.
  • MedRank guarantees the best performance (up to a constant factor) across all possible instances where rankings are accessed sequentially.

This is a stronger guarantee than average-case or worst-case optimality. For instance:

  • Binary search is worst-case optimal because it performs comparisons, unlike sequential search, which takes in the worst case.
  • However, binary search is not instance-optimal, as its performance can be significantly worse than a specialized algorithm for specific inputs.

MedRank’s optimality ratio, , is 2.

This means that its execution cost is at most twice the cost of any other algorithm designed for the same instance.

While not optimal in all scenarios, this property makes MedRank robust and efficient for practical use cases.

MedRank’s utility lies in its simplicity and effectiveness for aggregating rankings when only positional information is available. Its ability to approximate Footrule-optimal aggregation efficiently makes it applicable in various domains, such as:

  • Recommender systems: Combining user preferences across categories.
  • Multi-criteria decision-making: Selecting top options based on multiple factors.
  • Search engines: Aggregating rankings from different sources without explicit scoring.

Ranking Queries (top-k Queries)

Ranking queries aim to retrieve the top best results from a potentially vast dataset, where “best” is determined by relevance, importance, or interest.

This is crucial in applications such as e-commerce, web search, multimedia systems, and scientific databases, where users seek the most relevant answers rather than exhaustive listings.

The Naïve Approach to Ranking

A common, straightforward method for ranking is to use a scoring function , which assigns numerical scores to each tuple based on its attributes.

Algorithm

Taking as input the cardinality , a dataset , and a scoring function , to obtain the highest-scored tuples with respect to , the naive approach is as follows:

  1. Compute the score for every tuple in the dataset .
  2. Sort all tuples by their scores in descending order.
  3. Select the top tuples as the result.

Example

For example, could combine a basketball player’s statistics.

NamePointsReboundsScore ()
Shaquille O’Neal16697602429
Tracy McGrady20034842487
Kobe Bryant18193922211
Yao Ming14656692134
Dwyane Wade18543972251

While simple, this approach becomes computationally expensive for large datasets because it requires sorting all tuples, which has a complexity of . This inefficiency is compounded when the query involves joining multiple tables, especially if the join is many-to-many.

SQL offers tools to handle ranking queries, though historically these have evolved over time:

  • Sorting: Achieved through the ORDER BY clause.
  • Limiting results: Only natively supported in SQL since the SQL:2008 standard using FETCH FIRST k ROWS ONLY. Earlier databases relied on non-standard syntaxes like LIMIT k (PostgreSQL, MySQL) or SELECT TOP k (SQL Server).

Example

For example, consider ranking used cars by a weighted score

  • Query A filters cars by price () and orders the results:
    SELECT * 
    FROM UsedCarsTable 
    WHERE Vehicle = 'Audi/A4' AND Price <= 21000 
    ORDER BY 0.8*Price + 0.2*Miles;
    This query is efficient but may exclude relevant results that slightly exceed the price limit but have better mileage (near-miss problem).
  • Query B removes the price filter, returning all cars and ordering them by their weighted score:
    SELECT * 
    FROM UsedCarsTable 
    WHERE Vehicle = 'Audi/A4' 
    ORDER BY 0.8*Price + 0.2*Miles;
    While this query ensures no potentially relevant results are missed, it risks overwhelming users with an exhaustive list of results (information overload problem).

Two primary abilities are required to perform efficient and relevant ranking queries:

  1. Sorting by Scores: Implemented through the ORDER BY clause.
  2. Limiting the Result Set: Controlled by clauses like LIMIT or FETCH FIRST k ROWS ONLY.

However, predefined filters (e.g., price limits) can create trade-offs:

  • Near-Miss Problem: Strict filters may exclude near-optimal results.
  • Information Overload: Broad queries may return too many results to be useful.

Weighting factors like (price) and (mileage) reflect user preferences and enable fine-tuned ranking. The weights normalize different attributes, allowing multi-dimensional comparisons.

Given the inefficiencies of the naive approach, more sophisticated techniques are often employed to improve performance, such as:

  • Indexing: Creating indexes on frequently queried attributes can reduce query time.
  • Approximate Methods: Sampling or heuristic-based algorithms may reduce computation time while maintaining acceptable accuracy.
  • Incremental Approaches: Algorithms that generate results progressively, stopping as soon as the top tuples are identified, avoid processing the entire dataset.

Semantics of Top-k Queries

Top- queries aim to retrieve the most relevant tuples according to a scoring function or ordering criterion. The semantics allow for non-determinism if multiple sets of tuples equally satisfy the ordering criteria. The first tuples that meet the criteria are part of the result, but the selection among equally ranked tuples may vary.

For example, given:

SELECT * FROM R ORDER BY Price FETCH FIRST 3 ROWS ONLY

If multiple tuples have identical prices that qualify them for the th position, any of these tuples can appear in the result, leading to non-deterministic behavior.

Examples of Top- Queries

  • Best NBA Player by Points and Rebounds
SELECT * 
FROM NBA 
ORDER BY Points + Rebounds DESC 
FETCH FIRST 1 ROW ONLY;
  • Two Cheapest Chinese Restaurants
SELECT * 
FROM RESTAURANTS 
WHERE Cuisine = 'Chinese' 
ORDER BY Price 
FETCH FIRST 2 ROWS ONLY;
  • Top 5 Audi/A4 by Weighted Score of Price and Mileage
SELECT * 
FROM USEDCARS 
WHERE Vehicle = 'Audi/A4' 
ORDER BY 0.8*Price + 0.2*Miles 
FETCH FIRST 5 ROWS ONLY;

Top- query evaluation depends on two key factors:

  1. Query Type: Single relation, multi-relation joins, or aggregated results.
  2. Access Paths: Availability of indexes for ranking attributes.

In the simple case of a single relation:

  • Pre-sorted Input: If tuples are already sorted by the scoring function , only the first tuples are needed, minimizing cost.
  • Unsorted Input: Perform an in-memory sort using a heap for efficiency. With tuples, this requires operations.

Geometric Interpretation in 2D Space

For queries involving multiple attributes (e.g., Price and Mileage), tuples can be visualized as points in a 2D space: .

The scoring function determines the “closeness” of a point to the ideal target point . The objective is to find points that minimize , equivalent to finding those closest to the target.

Example: Weighted Preferences and Their Impact

Consider the line of equation , where is a constant representing the score. This can also be written as , from which we can derive that all the lines with the same slope are parallel and represent different scores. By definition, all the pints on the same line are equally good, with higher values indicating less desirable points.

  1. Weights :
    • Best car: , as it is closest to .
  2. Weights :
    • Best cars: and , as the weights change the scoring function and shift preferences.

k-Nearest Neighbors (k-NN)

The -NN model evaluates top- queries using distances in a multidimensional attribute space. Instead of ranking by scores, tuples are ranked by their proximity to a target point , using a distance function .

Components of -NN Model:

  • Attribute Space (): A set of -dimensional ranking attributes , with
  • Relation (): , where are non-ranking attributes.
  • Target Point (): Query point , with any
  • Distance Function (): Measures closeness between tuples and .

Algorithm

Given a point , a relation , an integer , and a distance function , the -NN algorithm retrieves the tuples closest to :

  1. Compute Distances: For each tuple , calculate the distance .
  2. Sort and Select: Sort the tuples in by their distances to and select the top .
  3. Return Results: Output the tuples with the smallest distances.

The most commonly used distance functions are Lp-norms (Minkowski distance).

Relevant cases are:

  1. Euclidean Distance ():

    Suitable for circular search areas (hyper-ellipsoids with weights).

  2. Manhattan Distance ():

    Ideal for rectangular search areas (hyper-rhomboids with weights).

  3. Chebyshev Distance ():

    Useful for box-shaped search areas (hyper-rectangles with weights).

Weights can adjust the importance of dimensions:

  • Weighted Euclidean Distance:
  • Weighted Manhattan Distance:
  • Weighted Chebyshev Distance:

The geometric impact of the weights in distance functions is significant: a larger weight reduces the importance of the corresponding attribute , effectively stretching the search region along that axis. Conversely, a smaller weight increases the importance of , compressing the search region in that direction. This adjustment allows the ranking or nearest neighbor search to reflect user preferences by emphasizing or de-emphasizing specific attributes in the multidimensional space.

Top-k Join Queries

In top- join queries, the goal is to rank results based on combined information from multiple relations.

SELECT   <attributes>
FROM     R_1, R_2, ..., R_n
WHERE    <join and local conditions>
ORDER BY S(p_1, p_2, ..., p_m) [DESC]
FETCH FIRST k ROWS ONLY;

Top-k 1-1 Join Queries

All joins are based on a common key attribute(s), making this the simplest case to manage and serving as the foundation for more complex scenarios. This approach is practically relevant and involves two main scenarios:

  1. An index is available for retrieving tuples according to each preference.
  2. The relation is distributed across multiple sites, with each site providing information about only a portion of the objects (the “middleware” scenario).

Each input list supports sorted access, where each access returns the ID of the next best object along with its partial (local) score and potentially other attributes. Additionally, each input list supports random access, allowing retrieval of the partial score of an object given its ID (requiring an index on the primary key). The ID of an object remains consistent across all inputs (possibly after reconciliation). Each input contains the same set of objects.

Example

Consider the two tables above that describe the same set of objects but considering the score given only considering how well eating at a restaurant matches the user’s preferences and considering bread and wine preferences, respectively.

The goal is to find the top objects based on a scoring function that combines these local scores.

SELECT * 
FROM EatWll EW, BreadAndWine BW 
WHERE EW.Name = BW.Name
ORDER BY EW.Score + BW.Score DESC
FETCH FIRST 1 ROW ONLY;

The resulting table will contain the top objects based on the combined scores from both tables. However, the winner is not the best locally scored object in either table, but rather the one that maximizes the combined score across both tables.

NameGlobal Score
Cheers!16.8
The old mill16.7
Da Gino16.5
The canteen16.0
Chez Paul13.0
Let’s eat!12.4
Los pollos hermanos11.5

A Model for Scoring Functions

Scoring functions are essential to rank objects based on their attributes. Each object has local or partial scores , one for each scoring dimension . These scores are normalized for consistency and mapped to a score space, , where is the number of dimensions.

Definition

The global score combines these local scores into a single value that represents the overall ranking of the object. Formally:

In the score space, objects are represented as points defined by their local scores.

Example

For example, in an attribute space :

Assume:

These transformations map objects into the score space while preserving the relative ranking of scores along each axis.

There are several common scoring functions, each with different properties and implications for ranking:

  1. SUM (or AVG): Assumes equal weight for all attributes.

  2. WSUM (Weighted Sum): Assigns different weights to attributes to reflect their importance.

  3. MIN (Minimum): Ranks objects by their worst local score, ensuring robust evaluation.

  4. MAX (Maximum): Ranks objects by their best local score.

For any scoring function, the goal is to retrieve the objects with the highest global scores. However, efficiency depends on the scoring function used, as some are computationally more demanding.

Definition

Iso-score curves are sets of points in the score space with the same global score.

Example

For example:

  • For , iso-score curves are lines in with a slope determined by the weights and .
  • For , iso-score curves are horizontal and vertical lines intersecting at .

These curves allow visualization of how scoring functions distribute rankings in the score space, analogous to iso-distance curves in geometric spaces.

Efficient Computation of Top-k Results

Challenge

Efficiently compute top- results for a 1-1 join query when the scoring function is complex.

  1. Baseline Algorithm ():

    • Retrieve the best objects from each input .

    • Rank these objects by .

    • Output the top- objects.

    Limitation: works well for but may fail for functions like or .

  2. Advanced Algorithms:

    • Use index structures or dynamic programming for efficient query evaluation.
    • Leverage sorted access to fetch tuples in ranked order.
    • Utilize random access to compute scores dynamically for objects retrieved from different relations.

Summary

AlgorithmFunctionData AccessNotes
MAXSortedInstance-optimal
FAMonotoneSorted and RandomCost independent of the scoring function
TAMonotoneSorted and RandomInstance-optimal
NRAMonotoneSortedInstance-optimal, but provides approximate scores due to lack of random access

The B0 Algorithm

The algorithm is a foundational method designed to solve top- queries efficiently when the scoring function is defined as . Its simplicity lies in leveraging ranked lists of objects without requiring complex operations such as random accesses or full computation of global scores for all objects. The algorithm is considered instance-optimal, meaning its cost on any input is at most twice the cost of the best algorithm for that input.

Algorithm

The algorithm accepts as input:

  1. An integer , representing the number of top-ranked objects to retrieve.
  2. ranked lists , where each list is sorted by a local scoring criterion.

The output is a set of objects with the highest global scores, computed using the scoring function.


  1. Perform sorted accesses on each of the ranked lists. Store the retrieved objects and their corresponding partial scores in a buffer .
  2. For each object in , calculate its global score as the maximum of its available partial scores.
  3. Return the objects with the highest global scores.

This method avoids retrieving missing partial scores for objects, as it relies solely on the maximum value among the available scores. It also performs no random accesses, making it computationally efficient in scenarios where is the scoring function.

Example

Consider a query with and three ranked lists representing partial scores , , and . The goal is to identify the two objects with the highest global scores

OIDOIDOIDOID
o70.7o20.9o71.0o71.0
o30.65o30.6o20.8o20.9
o40.6o70.4o40.75o30.65
o20.5o40.2o30.7o40.6

After performing sorted accesses and computing , the top two objects are and , with scores 1.0 and 0.9, respectively.

While is effective for , it fails for other scoring functions like or .

For instance, consider a case where . If , the algorithm might mistakenly select an object based on partial scores alone, leading to incorrect results.

OIDOIDOID
o70.9o20.95o71.0
o30.65o30.7o20.8
o20.6o40.6o40.75

Performing random accesses to compute yields:

OID
o20.6
o70.5

However, objects with missing partial scores, such as , might have a global score exceeding the retrieved values, resulting in incorrect results. This issue arises because lacks a mechanism to evaluate lower bounds or complete scores for all objects.

Practical Insights:

  1. Efficiency: The simplicity of makes it computationally efficient for -based queries, avoiding unnecessary accesses and computation.
  2. Scalability: scales well when the number of lists or objects is large, as it processes only a small subset of the data.
  3. Use Case Limitations: For applications requiring other scoring functions, alternative algorithms like TA (Threshold Algorithm) or NRA (No Random Access) must be employed to guarantee correctness and efficiency.

Fagin’s Algorithm (FA)

Fagin’s Algorithm (FA) is a robust approach for computing the top- results from multiple ranked lists, where the scoring function combines these lists through a monotone aggregation function. Unlike algorithms tailored to specific scoring functions, FA operates with a general-purpose stopping condition, making it versatile but less optimized in some cases.

Algorithm

Input: integer , a monotone function combining ranked lists

Output: the top- pairs

  1. Extract the same number of objects by sorted accesses in each list until there are at least objects in common
  2. For each extracted object, compute its overall score by making random accesses wherever needed
  3. Among these, output the objects with the best overall score

The algorithm proceeds in three main phases:

  1. Sorted Access Phase: Objects are accessed sequentially from each ranked list until at least objects appear in all lists (i.e., common objects are found across the lists). This ensures enough candidates for the final top- computation.

  2. Score Computation Phase: For each object encountered in the sorted access phase, its global score is computed by making random accesses to retrieve any missing partial scores. These scores are then aggregated using the monotone function .

  3. Selection Phase: Among all objects with fully computed scores, the objects with the highest global scores are selected as the output.

The algorithm guarantees correctness and is efficient when the scoring function is monotone, as this ensures that objects with higher partial scores in individual lists are favored.

FA is sublinear in the total number of objects , with a complexity of . For example, when combining two lists (), the complexity reduces to . Despite this efficiency, FA is not instance-optimal, meaning there may be scenarios where other algorithms perform better for a given input.

Example: Hotels in Paris

Consider a query to find hotels in Paris with the best combined cheapness and rating. The scoring function is defined as a weighted average:

FA proceeds as follows:

  1. Sorted Access:
    • Hotels are accessed sequentially from both the cheapness and rating lists.
    • The process continues until at least hotels appear in both lists.
CheapnessScoreRatingScore
Ibis0.92Crillon0.90
Etap0.91Novotel0.90
Novotel0.85Sheraton0.80
Mercure0.85Hilton0.70
Hilton0.825Ibis0.70
  1. Score Computation:
    • Random accesses retrieve missing scores for overlapping hotels, allowing the computation of their global scores.

    • For example:

  1. Selection:

Definition

The threshold point () is defined as the object with the smallest values in all lists during the sorted access phase.

FA stops once the yellow region (fully seen objects) contains at least points. The gray regions represent partially seen objects, while the blue region contains unseen objects that cannot beat the scores in the yellow region.

While FA is versatile, it has several limitations:

  1. Scoring Function Independence: The algorithm does not exploit specific properties of the scoring function . This generality can result in higher costs compared to specialized algorithms.

  2. Memory Requirements: FA buffers all objects accessed during the sorted access phase. For large datasets, this can lead to significant memory consumption.

  3. Inefficiency in Random Access: The sequential nature of sorted access followed by random access can be inefficient. Some improvements, such as interleaving random accesses with score computation, can reduce unnecessary work.

  4. Stopping Condition: The stopping condition of finding common objects can be suboptimal. Algorithms like TA (Threshold Algorithm) refine this by dynamically updating score bounds.

FA is well-suited for scenarios where:

  • The scoring function is complex or not explicitly given.
  • The dataset is relatively small, or memory usage is not a constraint.
  • General-purpose solutions are preferred over fine-tuned approaches.

Threshold Algorithm (TA)

The Threshold Algorithm (TA) is a powerful and efficient method for computing the top- results from multiple ranked lists. Unlike Fagin’s Algorithm (FA), TA is instance-optimal among all algorithms that utilize both random and sorted accesses. This means that no other algorithm in its class can guarantee better performance on every possible input.

Algorithm

Input: integer , a monotone function combining ranked lists

Output: the top pairs

  1. Do a sorted access in parallel in each list
  2. For each object , do random accesses in the other lists , thus extracting score
  3. Compute overall score . If the value is among the highest seen so far, remember (in a buffer)
  4. Let be the last score seen under sorted access for
  5. Define threshold
  6. If the score of the th object is worse than , go to step 1
  7. Return the current top- objects

TA operates by combining sorted accesses (to retrieve objects sequentially from each ranked list) and random accesses (to gather missing scores of partially seen objects). The algorithm iteratively refines a threshold score to determine when it can safely terminate.

  1. Parallel Sorted Access:

    • TA begins by sequentially retrieving objects from each ranked list using sorted access.
    • For each new object encountered in any list, random accesses are made to obtain its scores from other lists , ensuring that its global score can be computed.
  2. Score Computation:

    • The global score is calculated for each object by aggregating its partial scores using a monotone function .
    • If ’s score is among the -highest seen so far, it is added to the top- buffer.
  3. Threshold Update: The threshold is dynamically updated based on the scores of the last objects retrieved via sorted access in each list. The threshold score is given by:

  4. Stopping Condition: TA terminates when the th highest score in the top- buffer is no worse than the current threshold . At this point, no unseen object can have a score higher than the worst score in the top- buffer.

  5. Final Output: TA returns the objects with the highest scores.

Example: Hotels in Paris with TA

Imagine a query to find the top hotels in Paris based on price and rating, using a scoring function defined as:

Two ranked lists are available, one for cheapness and one for rating:

HotelsCheapnessHotelsRating
Ibis0.92Crillon0.90
Etap0.91Novotel0.90
Novotel0.85Sheraton0.80
Mercure0.85Hilton0.70
Hilton0.825Ibis0.70
Sheraton0.80Ritz0.70
Crillon0.75Lutetia0.60

Steps of TA:

  1. Perform sorted accesses on both lists:

    • From the cheapness list:
    • From the rating list:
  2. Use random accesses to complete missing scores: Example: For , retrieve cheapness = 0.75 and rating = 0.90. Compute .

  3. Update the threshold : After accessing the first few hotels, we update the threshold setting it to and the point to .

  4. Stop when the score of the th object in the buffer is no worse than .

Result:

Top 2 HotelsScore
Novotel0.875
Crillon0.825

The threshold point represents the lowest possible scores of unseen objects based on the current state of sorted access. By design:

  • Objects fully seen (yellow region) have accurate scores and can be directly compared.
  • Partially seen objects (gray region) can only compete if they dominate .
  • Unseen objects (blue region) cannot have scores exceeding the threshold, ensuring correctness.

The dashed red line separates the region of scores dominating from the rest, reinforcing that no unseen object can outperform the th best seen so far.

TA is highly efficient because it adapts to the specific scoring function and minimizes unnecessary accesses. The cost model for TA is expressed as:

where:

  • and are the total numbers of sorted and random accesses, respectively.
  • and are the unit costs for sorted and random accesses.

In many real-world scenarios, random accesses (such as database lookups) are significantly more expensive than sorted accesses (such as index scans), meaning that the cost per random access () is greater than the cost per sorted access (). In some extreme cases, random access may not be possible at all (i.e., ), in which case the Threshold Algorithm (TA) operates using only sorted accesses.

The main advantage of TA is its ability to efficiently compute top- results while minimizing the number of random accesses, making it particularly suitable for applications with large datasets and complex scoring functions.

  • Instance Optimality: TA is optimal across all instances using both access types, unlike FA, which does not leverage scoring function properties.
  • Scoring Function Adaptation: TA dynamically adjusts its operations based on the monotone function , ensuring efficient processing.
  • Lower Middleware Costs: By reducing unnecessary accesses, TA often outperforms FA in practical applications.

TA is ideal for applications where:

  • Random access costs are not prohibitively high.
  • The scoring function is monotone and well-defined.
  • Efficient use of resources (e.g., memory, access time) is crucial.

The No Random Access (NRA) Algorithm

The NRA (No Random Access) algorithm is designed for scenarios where random access to data is not feasible or practical.

Its primary goal is to identify the top- objects from a set of ranked lists while keeping the cost of computation as low as possible.

Unlike algorithms that rely on precise scores for ranking, NRA acknowledges uncertainty and computes bounds for scores to approximate the top- results.

Key concepts of NRA:

  1. Score Bounds:
    For each object retrieved via sorted access, the algorithm maintains:

    • A lower bound, , which represents the minimum possible score of based on the data retrieved so far.
    • An upper bound, , which represents the maximum possible score of .
  2. Buffer Management:
    The algorithm employs a buffer with unlimited capacity to store objects retrieved during sorted access. This buffer is kept sorted in descending order of the objects’ lower bound scores, .

  3. Threshold Point:
    A threshold is maintained to represent the upper bound of scores for unseen objects. This threshold provides an additional constraint for evaluating whether further iterations are necessary.

  4. Termination Condition:
    By maintaining a set of the current best objects, the algorithm halts when it becomes clear that no object not in can outperform any of the objects within it. So

    i.e, when beats both the max value of among the objects in and the score of the threshold point .

Algorithm

Input: integer , a monotone function combining ranked lists

Output: the top- objects according to

  1. Initialize:
    Begin with an empty buffer and make a sorted access to each of the ranked lists .

  2. Update Buffer:
    For each object retrieved, calculate its lower bound and upper bound . Store in , ensuring it remains sorted by .

  3. Iterative Refinement:
    Repeat the process of accessing lists and updating until the termination condition is met. This ensures the algorithm identifies the top- objects.

  4. Output:
    Return the set of objects in with the highest lower bound scores.

The computational cost of the NRA algorithm does not always increase with larger values of . In certain cases, identifying the top- objects can require less work than finding the top-, due to the way score bounds are refined and objects are eliminated from consideration.

NRA is instance-optimal among algorithms that rely solely on sorted access and avoid random access. For any given input, it guarantees the best possible performance within its class, with an optimality ratio equal to , the number of ranked lists being combined.

A distinctive aspect of NRA is its handling of uncertain scores. Since random access is not used, the algorithm may only be able to provide approximate scores for some objects in the final result, reflecting the uncertainty inherent in the available data.

Example of NRA in Action

Consider a scenario with three objects , , and , and the following details:

  • :
    The algorithm identifies as the top object with a score . NRA requires accessing all but the last rank of the lists to confirm this result.

  • :
    To determine the top-2 objects ( and ), only three rounds of access are necessary, as quickly emerges as the second-best option.

Skyline Queries

Skyline queries are a powerful tool in multi-criteria decision-making, often used to identify the best options among a dataset according to multiple perspectives. They rely on the concept of dominance, which is central to their operation.

Intuitively, a tuple in a dataset is said to dominate another if it is equal or better across all dimensions and strictly better in at least one dimension. In practical terms, lower values are typically preferred (e.g., lower price or distance).

The result of a skyline query is the set of non-dominated tuples, representing the most desirable options under various criteria.

Definition

Mathematically, a tuple dominates another tuple , denoted , if and only if:

  1. For every dimension (): (i.e., is nowhere worse than ).
  2. For at least one dimension (): (i.e., is strictly better in at least one aspect).

The set of all non-dominated tuples is called the skyline of the relation. This concept is also known as the Pareto-optimal set in multi-objective optimization or the maximal vectors problem in computational geometry. Graphically, in two dimensions, the skyline can resemble the contour of the dataset when visualized.

Skyline queries differ from top- queries in that they do not rely on a single scoring function. Instead, they identify all potentially optimal tuples, regardless of any specific weighting of attributes. This flexibility makes them particularly useful when user preferences are not explicitly defined or when exploring trade-offs among attributes.

SQL Representation of Skyline Queries

Skyline queries are not part of the SQL standard, but proposed syntaxes have been developed for their implementation. A possible syntax might look like this:

SELECT ... FROM ... WHERE ... GROUP BY ... HAVING ... 
SKYLINE OF [DISTINCT]   d1 [MIN | MAX | DIFF], ..., 
						dm [MIN | MAX | DIFF] 
ORDER BY ...

and a possible example of a skyline query could be:

SELECT * FROM Hotels WHERE city = 'Paris' 
SKYLINE OF price MIN, distance MIN

This query identifies hotels in Paris that are optimal based on the minimal price and distance criteria. Internally, this can be translated into standard SQL using subqueries and dominance conditions:

SELECT * FROM Hotels h WHERE h.city = 'Paris' AND NOT EXISTS (
    SELECT * FROM Hotels h1 WHERE h1.city = h.city 
        AND h1.distance <= h.distance AND h1.price <= h.price 
        AND (h1.distance < h.distance OR h1.price < h.price)
)

This translation explicitly checks for dominance by ensuring that no other tuple dominates the current one.

Algorithms for Computing Skylines

Skyline queries can be computed using different algorithms, each with distinct trade-offs in terms of efficiency and practicality.

  • The Block Nested Loop (BNL) algorithm is simple and intuitive, iteratively comparing each point to those already identified as potential skyline candidates. However, its quadratic time complexity () makes it unsuitable for large datasets.
  • In contrast, the Sort-Filter-Skyline (SFS) algorithm first sorts the data using a monotone function, which allows skyline points to be identified more efficiently during a single pass. This reduces the computational complexity to , making SFS preferable for larger datasets.

While BNL is easier to implement and requires no pre-processing, SFS offers significant performance improvements by leveraging sorting and early elimination of dominated points.

AlgorithmApproachStrengthsLimitationsComplexity
BNLCompare each point to current skyline setSimple, no pre-processingSlow for large datasets
SFSSort, then filter for skyline pointsEfficient for large dataRequires sorting upfront

Block Nested Loop (BNL)

The Block Nested Loop (BNL) algorithm is a straightforward approach to computing skylines.

Algorithm

Input: a dataset of multi-dimensional points

Output: the skyline of

  1. Initialize the skyline set as an empty set.
  2. For every point in , if the point is not dominated by any point in , then:
    1. Remove from the points dominated by .
    2. Add the point to the set .
  3. Return the skyline set .

While conceptually simple, BNL has a computational complexity of , where is the number of points in the dataset. This makes it inefficient for large datasets.

Sort-Filter-Skyline (SFS)

The Sort-Filter-Skyline (SFS) algorithm improves upon BNL by leveraging pre-sorting. By sorting the dataset according to a monotone function of its attributes, SFS ensures that no later tuple can dominate an earlier one.

Algorithm

Input: a dataset of multi-dimensional points

Output: the skyline of

  1. Sorting Phase: The dataset is sorted based on a monotone function, such as the sum of all attribute values. This ensures that points are processed in an order that facilitates skyline identification.
  2. Initialize a skyline set as an empty set.
  3. For every point in the sorted dataset , if is not dominated by any point in , then add that point to the skyline set .
  4. Return the skyline set .

Pre-sorting reduces unnecessary comparisons and allows immediate identification of skyline points during iteration. The computational complexity is reduced to , making it more practical for larger datasets.

Example

Consider again the example of hotels in Paris. Using SFS, we can efficiently compute the skyline by first sorting the hotels based on their price and their complaints.

HotelPriceComplaints
Ibis0.80.3
Novotel0.150.1
Hilton0.1750.3
Crillon0.250.1
Sheraton0.20.2

The sorted list based on the sum of price and complaints would look like this:

HotelPrice ⬆️Complaints
Novotel0.150.1
Crillon0.250.1
Ibis0.80.3
Sheraton0.20.2
Hilton0.1750.3

Then, we add if not dominated by any point in and remove dominated points, leading to the final skyline set. The final window of skyline points would include Novotel and Ibis, as they are not dominated by any other hotel in terms of both price and complaints.

HotelPriceComplaints
Novotel0.150.1
Ibis0.80.3

Strengths and Limitations of Skyline Queries

Skyline queries are highly effective for identifying potentially interesting objects when user preferences are unknown or undefined. They provide a comprehensive overview of all non-dominated options, requiring no parameters or pre-specified scoring functions. However, this agnostic approach comes with limitations.

For large datasets, particularly those with anti-correlated attributes, the number of skyline points can become excessively large, making the results difficult to interpret. Additionally, the computational cost remains high for quadratic algorithms like BNL. While SFS mitigates this to some extent, it may still struggle with extremely large datasets or complex dominance relationships.

The k-Skyband

To address the issue of excessive result sizes, the concept of -skybands has been introduced. A -skyband includes all tuples that are dominated by fewer than other tuples. This allows for finer control over the size of the result set. For example:

  • The skyline is equivalent to the 1-skyband.
  • A top- result set is always contained within the -skyband.

By adjusting , users can balance the comprehensiveness of the results with their cardinality.

Comparing Skyline Queries to Ranking Queries

Skyline queries and ranking queries serve different purposes and are suited to different scenarios. While skyline queries provide a broad overview of all potentially optimal options, ranking queries focus on identifying a small, prioritized subset based on a specific scoring function.

AspectRanking QueriesSkyline Queries
SimplicityModerate complexityVery simple
Overview of interesting resultsLimitedComprehensive
Control of result sizeYesNo
Trade-offs among attributesExplicit controlImplicit (dominance)

Ranking queries are ideal when user preferences are well-defined and a limited result set is desired. Skyline queries excel in exploratory analysis, where trade-offs among attributes need to be understood without committing to a specific scoring function.