1.3 - Performance

Performance in computing can be viewed from two key perspectives: the purchasing perspective and the design perspective.

From the purchasing perspective, the focus is on selecting the most efficient machine from a collection of available options. This involves evaluating which machine offers the best performance for the least cost or the best performance-to-cost ratio. The goal is to balance the cost of the machine with the performance it can deliver.
In contrast, the design perspective applies to the task of choosing the most efficient design or architectural choice when building a system. This includes evaluating which design will provide the best performance improvements, considering the least cost and the performance-to-cost ratio.

Both of these perspectives require clear comparison metrics and consolidated evaluation criteria to make well-informed decisions.

When discussing performance, we typically focus on two distinct metrics: execution time and throughput.

Definition

Execution Time (Latency): This refers to the time it takes to complete a task. Execution time, also known as response time or latency, is the period it takes from the initiation of a task until its completion. Lower execution time is usually desirable, as it indicates faster processing.

Throughput (Performance): Throughput refers to the number of tasks (or operations) that can be completed within a specific time frame, such as per second, minute, or hour. It is often referred to as performance or bandwidth. High throughput is desirable as it indicates a system’s ability to handle a greater volume of tasks.

However, latency and throughput are often inversely related.

For example, a system that can process a large number of tasks (high throughput) may experience higher latency, meaning that it takes longer to complete individual tasks. Conversely, a system optimized for low latency may struggle with processing a high volume of tasks at once.

Performance Improvement and Execution Time

When discussing the relationship between the performance of two systems, we often express this as a percentage improvement. If we say that “system is n% faster than system ”, this means that the execution time of system is times the execution time of system . Essentially, performance is the inverse of execution time. This means that:

A performance improvement is an increase in the system’s ability to execute tasks, represented by a decrease in execution time.
A decrease in execution time is a direct indication of improved performance. Therefore, the goal is often to minimize execution time while maximizing throughput.

The relationship between execution time and performance can be formalized in the following formulas:

Another important factor in performance evaluation is the clock cycle. The clock cycle time () is the duration of a single clock cycle in seconds, while the clock frequency () refers to the number of clock cycles per second. These two are inversely related, with the clock frequency being the reciprocal of the clock cycle time: .

Execution Time or CPU Time

The execution time (or CPU time) is the total amount of time required to execute a program, which can be expressed as the product of the clock cycles and the clock cycle time:

To optimize the performance of a system, we need to focus on reducing the execution time, which can be achieved by:

Reducing the number of clock cycles needed to execute a program.
Decreasing the clock cycle time ().
Increasing the clock frequency ().

Formula for CPU Time

The formula for CPU time can be expressed in another way, focusing on the number of instructions executed and the number of clock cycles per instruction:

Where:

Instruction Count (IC) refers to the total number of instructions executed by the program.

CPI (Clock Cycles Per Instruction) is the number of clock cycles required to execute a single instruction.

is the clock cycle time.

Alternatively, CPU time can also be expressed as:

Where the total clock cycles are calculated as:

Several factors affect the CPU time, which in turn determines system performance. These include the instruction count, the CPI, and the clock rate.

Factor	Affects Instruction Count	Affects CPI	Affects Clock Rate
Program	✔️
Compiler	✔️	✔️
Instruction Set	✔️	✔️	✔️
Organization		✔️	✔️
Technology			✔️

MIPS: Millions of Instructions Per Second

The MIPS (Millions of Instructions Per Second) metric measures the number of millions of instructions a processor can execute per second. It can be calculated as:

While MIPS can be a useful measure, it does not directly correlate to system performance since it does not account for the complexity of individual instructions or the time taken for each. Therefore, it is essential to consider the context and other performance factors when interpreting MIPS values.

Amdahl’s Law

Amdahl’s Law provides a theoretical formula for predicting the speedup in the latency of executing a task when certain system resources are improved. The law highlights the impact of optimizing a portion of the system, emphasizing that the overall improvement is constrained by the fraction of time spent on the enhanced resource.

Formula for Speedup

To define the speedup introduced by an enhancement , we use the following formula:

This formula compares the performance of a system before and after applying an enhancement.

Assume an enhancement speeds up a fraction of the execution time of the task by a factor . The remaining portion of the task, which is unaffected by the enhancement, remains at its original execution time. The overall speedup, , can be expressed as:

where:

is the fraction of the task that benefits from the enhancement.
is the speedup factor of the enhanced portion.
is the fraction of the task that remains unaffected.

This formula illustrates that the speedup is limited by the portion of the task that cannot be enhanced. Even if (the speedup of the enhanced part) is very large, if (the fraction of the task that benefits from the enhancement) is small, the overall speedup will be modest.

The Limitation of Amdahl’s Law

A key insight from Amdahl’s Law is that the performance improvement achievable from using a faster execution mode is limited by the proportion of time that the faster mode can be applied. Specifically, the overall speedup is constrained by two factors:

The fraction of the computation time in the original system that can be optimized with the enhancement.
The improvement factor () provided by the enhancement itself.

If only a small fraction of the task is accelerated, then the potential improvement in overall execution time is limited, regardless of how much the enhancement improves the accelerated portion.

Example of Amdahl's Law

Consider a system where 60% of the task can be sped up by a factor of 10, while the remaining 40% of the task cannot be enhanced. According to Amdahl’s Law, the overall speedup is calculated as:

Thus, even with a significant speedup (a factor of 10) for 60% of the task, the overall speedup is only about times faster than the original system.

Basis of Evaluation

Performance evaluation is a difficult task and depends on many different factors and can be performed in many different ways.

Evaluation Method	Pros	Cons
Actual Target Workload	- representative	- very specific - non-portable - difficult to run, or measure - hard to identify cause
Full Application Benchmarks	- less representative - portable - widely used - improvements useful in practice - easy to “fool”	- easy to “fool”
Small “Kernel” Benchmarks	- easy to run, early in design cycle	- identify peak capability and potential bottlenecks
Microbenchmarks	- “peak” may be a long way from application performance - identify peak capability and potential bottlenecks	- “peak” may be a long way from application performance - identify peak capability and potential bottlenecks

Performance Evaluation in Pipelined Processors

In a pipelined processor, the execution of a single instruction may take slightly longer than in an unpipelined processor because of the time needed to manage and synchronize different pipeline stages. Additionally, the slowest stage in the pipeline determines the overall clock period, meaning that if there is an imbalance in the time taken by different stages, the clock period cannot be reduced below the time taken by the slowest stage. This imbalance, along with other overheads such as delays introduced by pipeline registers and clock skew, contributes to the overall performance limitations of pipelining.

Despite the increase in latency for individual instructions, pipelining improves throughput by allowing multiple instructions to be processed simultaneously, which leads to an overall increase in the number of instructions completed in a given amount of time.

Pipeline Speedup

The speedup due to pipelining can be evaluated by comparing the average execution time of an unpipelined processor with that of a pipelined processor. The formula for pipeline speedup is given by:

where:

refers to cycles per instruction.
refers to the clock cycle time.

If we assume that pipelining does not introduce significant overhead (i.e., no stall cycles) and the pipeline stages are perfectly balanced, the clock cycle times of the unpipelined and pipelined processors can be considered equal. In this case, the speedup formula simplifies to:

Here, stall cycles per instruction represents the additional cycles that are needed when the pipeline experiences delays, such as when waiting for data dependencies or branch instructions.

Ideal Pipeline Speedup

In the ideal case, where there are no pipeline stalls (i.e., no delays between instructions), the speedup is equal to the depth of the pipeline. This is because each instruction can move to the next stage of the pipeline every clock cycle, leading to a theoretical maximum speedup of:

For example, if a processor has a 5-stage pipeline, the theoretical speedup, assuming ideal conditions, would be 5.

Impact of Branch Instructions

One of the main factors that limit the speedup in pipelined processors is the handling of branch instructions. Branch instructions cause pipeline stalls because the processor must wait to determine the outcome of the branch before proceeding with subsequent instructions. This delay is often referred to as a branch penalty.

When we factor in the performance impact of branches, the pipeline speedup formula becomes:

where:

Branch Frequency is the percentage of instructions that are branches.
Branch Penalty is the number of cycles lost due to branch instructions.

In real-world scenarios, branch penalties can significantly affect the speedup provided by pipelining. If branches occur frequently and have a large penalty, the speedup will be lower than the ideal case.

Polimi CS - Notes

Explorer

1.3 - Performance

Table of Contents

Performance Improvement and Execution Time

Execution Time or CPU Time

MIPS: Millions of Instructions Per Second

Amdahl’s Law

The Limitation of Amdahl’s Law

Basis of Evaluation

Performance Evaluation in Pipelined Processors

Pipeline Speedup

Ideal Pipeline Speedup

Impact of Branch Instructions

Backlinks

Graph View

Table of Contents