Abstract

Nowadays, many new low power ASICs applications have emerged. This new market trend made the designer’s task of meeting the timing and routability requirements within the power budget more challenging. One of the major sources of power consumption in modern integrated circuits (ICs) is the Interconnect. In this paper, we present a novel Power and Timing-Driven global Placement (PTDP) algorithm. Its principle is to wrap a commercial timing-driven placer with a nets weighting mechanism to calculate the nets weights based on their timing and power consumption. The new calculated weight is used to drive the placement engine to place the cells connected by the critical power or timing nets close to each other and hence reduce the parasitic capacitances of the interconnects and, by consequence, improve the timing and power consumption of the design. This approach not only improves the design power consumption but facilitates also the routability with only a minor impact on the timing closure of a few designs. The experiments carried on 40 industrial designs of different nodes, sizes, and complexities and demonstrate that the proposed algorithm is able to achieve significant improvements on Quality of Results (QoR) compared with a commercial timing driven placement flow. We effectively reduce the interconnect power by an average of 11.5% that leads to a total power improvement of 5.4%, a timing improvement of 9.4%, 13.7%, and of 3.2% in Worst Negative Slack (WNS), Total Negative Slack (TNS), and total wirelength reduction, respectively.

1. Introduction

Power consumption has become a major concern in diverse areas. Industrial and home appliances, automotive applications, telecommunication circuits, and high-speed supper scalar microprocessors are the primary drivers of low power IC market. To keep up with this growing market demand, modern System-on-Chip (SoC) implementation has become very puzzling, as CMOS technology is scaled into the nanometer regime. Scaling benefits tremendously the performance of the cells (propagation delay and power consumption) as transistors have shrunk but worsens significantly the interconnect delay due to the wires resistance and capacitance increase. Thus, interconnects now are a determining parameter of the circuit performance.

In recently enabled technology nodes, the interconnections power overcomes the standard cell’s power and becomes dominant in the overall circuit performance (speed and power) [1]. A large amount of Interconnect power is dissipated as heat and causes temperature fluctuation. Therefore, this degrades the circuit performance and reliability and affects its behavior and functionality. Consequently, to meet the market requirements and to increase circuit performance and reliability, interconnect power dissipation should be optimized as much as possible in all development phases.

One of the main steps in the physical design process is the Placement stage. Its outcome’s quality has a big impact on the circuit power consumption, and it is an essential step to achieve design timing closure. The traditional placers objective was the total wirelength reduction, as reducing the total wirelength helps the timing and routability closure. However, it has been proved that the objective of minimizing the wirelength is insufficient to close the design timing and misses many opportunities to improve the timing in placement stage, which gives a harder problem for subsequent stages. Therefore, timing optimization techniques were incorporated within the placer, giving birth to a new placement approach called timing-driven placement. Timing-Driven Placement (TDP) techniques are divided into two main branches, Global [24] and Incremental [57].

Global placement algorithm starts the placement problem from scratch, which means that none of the cells has been assigned to a specific location. Its objective is to close the timing by assigning high weighting values to critical timing nets and try to place the cells connected by these nets close to each other. Placing the cells on critical paths close to each other not only reduce nets delay but also the cells delay, since they will have to drive a smaller load. The limitation of the global placement approach is that it may create new violations while trying to fix the original ones and it may also create some congestion hotspots which will complicate the routing stage. Incremental placement aims to improve the global placement solution by recalculating the placement of the cells on timing violated paths.

Most of the available TDP approaches focus only on reducing the timing of critical paths [1, 2, 4, 6]. Their general principle is to reduce the interconnect wirelength which reduces their equivalent RC component and hence the overall signal propagation delay. But as the technology advances, the total resistance (R) and capacitance (C) of interconnect have increased significantly due to the decreasing width and spacing between the wires and have become a dominant factor affecting the dynamic power consumption. This new physical constraint should be taken into consideration as early as possible in the design development cycle.

Recently, more attention was given to nets power consumption reduction from the placement stage. But most researches focused on the clock network power reduction. Reference [8] proposed an algorithm for register clustering to reduce the clock wirelength and hence power consumption. Reference [9] developed an approach to enable the synthesis of a low-power clock network by modifying the register placement. A new method of register placement calculation was proposed in [10, 11] to cluster the registers in a few groups after the first placement round to decrease the clusters proportionally in order to reduce the wirelength and potential skew. Taking into account only the clock network and moving the registers without taking into account the full timing picture leads to power and timing degradation in data nets which may be more than the gain achieved in the clock tree [12].

In this paper, we give more attention to the signal nets. Since in 40 industrial designs we used to measure the power dissipation components, we noticed that clock nets power represents only ~15% of total nets power consumption and, hence, more work should be done to reduce the data nets power consumption (75% of total nets power). To do so, we addressed the weighting calculation problem, which was based only on the delay in traditional TDP algorithms. We have changed its formula to include not only the timing but also the power factor in a nonlinear formula that prioritizes the timing and power critical nets exponentially. In this new approach, the input is a circuit in the preplacement stage. This means that the floor-planning and the power routing are already done. We apply the newly calculated weights and run the design through the placement and routing stages to examine not only the impact of the new placement on the timing and power but also on the design routability.

This paper is organized as follows: Section 2 presents the related works; Section 3 explores the adopted new weighting formula and its integration in the P&R flow to reduce timing and power simultaneously. An evaluation on a small multivoltage design to illustrate the concept and to calibrate the algorithm is also presented in Section 3. Section 4 discusses the experimental results gathered after comparing the results of 40 industrial designs. Finally, Section 5 draws the conclusion and the perspectives.

TDP is a crucial step to achieve design closure in standard cell place and route ASIC flow. It can be categorized into two types: net-based [24] and path based or incremental TDP [57] algorithms.

The net-based approach prioritizes nets based on their timing criticality, and the prioritization is done by weights assignment which represents the attraction forces between circuit nodes. In [13], the authors proposed a systematic net weighting adjustment algorithm that is runtime and wirelength effective but did not evaluate timing and power impact. The power limitation was treated in [14, 15] where the net switching activity was used to drive the placement engine to be power aware (PDP). It assigns a net weight proportional to the product of the switching rate and the pin count. Significant power and heat reduction were found, but no timing, area, or routing statistics were provided to assess the global impact. Another approach was presented in [16]; it proposed an activity based net weighting that reduces net switching power by assigning a linear combination of activity and timing weights to the nets with higher switching rates or more critical timing and was able to achieve a power reduction of 11.4% but has degraded the timing by 2% and the area by 1.2%. Also, [16] calculates the net weights using the switching activity instead of the total net power, which may be misleading in some cases, since the net power is a factor of the frequency and the voltage also.

After weight assignment, the layout area is partitioned into several global bins. All cells of the circuit will be distributed into these global bins to minimize certain placement objectives, such as wirelength, timing, power, or routing congestion. During global placement, constraints, such as design rules, are relaxed to speed up the process. If a cell is distributed into a particular global bin, it will be placed within the area of this bin in the final layout. As we proceed to finer levels, the number of global bins increases and the physical size of global bins decreases. Thus, we can get more and more detailed information about the physical locations of cells as we proceed. It terminates when there are only a few cells in each global bin.

The main advantage of this type of algorithms is its capacity to handle a huge number of nets simultaneously. However, the timing picture gets less accurate during the algorithm iterations, since it does not account for the neighboring gates impact on the placed cell, which results in creating new timing violations. The remaining timing violations resolution is the objective of the path-based approach, it straightens the critical paths to reduce their wirelength, leading to a reduction in their parasitic capacitance and resistance, and hence the timing propagation of the critical nets is enhanced.

For net-based global placement, Lagrangian relaxation based algorithms, to optimize both performance and total wirelength, were introduced to embed the timing in the problem formulation along with the spreading constraints and showed significant timing improvement compared with commercial wirelength-driven placement flows [17]. A new TDP variation was introduced in [18]; it uses a smooth timing analysis that constructs the timing-cost function as a smooth function of cell placement and uses a net routing algorithm to provide accurate wire delay. While most proposed algorithms focus on late slack optimization, [19] came up with an algorithm to improve the early slack while preserving an optimized late slack by accurately predicting the optimal Steiner tree topologies after each move in the TDP algorithm.

For incremental global placement, a flow was proposed in [20] to address the routability issues from placement stage; it proposes a routing-aware incremental timing driven placement technique to reduce early and late negative slacks while considering global routing congestion. Another formulation of the problem was proposed in [21]; it presents a quadratic formulation for the incremental timing driven placement that includes the delay model formulation into the quadratic function objective and performs path smoothing by optimizing the distance of neighbor critical pins which has outperformed the state-of-art results in term of timing closure.

As has been shown, TDP has been an active research field in the last decade and researchers are trying to find a balance between the timing, the power consumption, and the routability requirements. To overcome the limitations of the previously presented approach such as [1416], we calculated the net weights using the real estimated total net power which is a function of not only the switching activity but also the voltage domain and the operating frequency. Also, instead of the linear weighting equation proposed in [16], we used an exponential formula to give more priority to the nets with high power or timing impact. We added another enhancement to [16] to include the fanout number in the weight calculation to reduce the congestion and to improve the routability of the designs.

3. The Proposed Power and Timing-Driven Placement Flow

In the advanced technology nodes with a very high density and more than 10 metal layers, the parasitic resistance and capacitance become a limiting factor for speed and power consumption. Traditional physical design flows focus on timing and Design Rule Constraints (DRC) closure and leave the power optimization until the end of the flow. The most known techniques for power reduction in P&R phase are gates sizing, equivalent pin reordering, and high-voltage threshold (HVT) cells usage.

Gate sizing consists of substituting big cells in the subcritical path by smaller equivalent gates that satisfy the delay requirement. Such technique is widely used in the industry for timing, area, and power optimization.

Equivalent pin reordering involves connecting the input with high capacitance to the net with low switching activity. Logically equivalent pins may not have identical circuit characteristics, which means that the pins have a different delay or power consumption. Such property is exploited for low power design. Also, by using HVT cells, the amount of charges stored into the parasitic capacitances of the transistors is reduced and hence the dissipated power.

By relying on such power reduction techniques, the traditional P&R flow misses some very good opportunities for power reduction and leaves the designer with a limited set of solutions to reduce the power consumption. Taking the power constraint into account from the placement stage will lead to the maximum power reduction in early P&R stages and will help to meet the allocated power budget easily at design tip-out phase.

3.1. Description of the New PTDP Flow

The main focus of our approach is to minimize power consumption and timing violations simultaneously while placing the design, without degrading the routability factor represented by the congestion metrics. We have developed a weighting algorithm that wraps an industrial net-based and paths-based TDP flows. It relies on two major stages, (I) running the nets-based TDP wrapped by the weighting process, (II) running the paths-based TDP wrapped by the weighting process, then running the routing flow to assess the QoR. The baseline flow is a similar flow without the new weighting process.

The net-based, path-based TDPs, Timer, RC Extractor, Power Analysis, and Router used in our experiments are Nitro-SoC™ internal engines [22].

The outline of our PTDP flow is illustrated in Figure 1. The proposed flow improves the timing and the power while considering the routing congestion estimated by the global routing server. In the proposed flow we run first the weighting algorithm to generate and apply the nets weights, and then we call the net-based TDP to do a global placement based on the applied weights. A pass of legalization, global routing, and timing update is made at this stage to estimate the timing and routability impact. A second call of the weighting algorithm is executed to calculate the new weights based on the updated timing, power, and congestion pictures. Then, a second incremental (path-based) TDP pass is performed to improve the delay and the power consumption followed by a global routing call to repair the routing topology. A final route track and search and repair pass is needed to confirm that the estimated results persist after detail routing stage.

The difference between the two weights applied before global placement and the incremental placement is that the wire capacitance is not taken into account in the first weight calculation, while it is estimated based on the global routing topology in the second pass. In the rest of this section, we will discuss in detail the proposed weighting algorithm.

3.2. Description of the New Nets Weighting Algorithm

In this section, we talk about the detailed implementation of the nets weighting algorithm, which is the main contribution of our paper. It combines the timing, power, and fanout information in an exponential formulation to calculate the net weights. The objective is to generate good results in term of timing and power without impacting the design routability.

Our algorithm starts by filtering the signal nets and estimating their power. It analyses power consumption by calling Nitro-SoC™ internal power estimation engine that uses the design constraints and a switching annotation file (SAIF) that profiles the net switching activity. If the SAIF file is not available, the user can set primary inputs toggle rate and the tool propagate them through the netlist and perform a probabilistic simulation internally to estimate the switching activity of each net in the design. The main advantage of using the net power instead of the switching activity is that the switching activity (used in [23, 24]) may be misleading for power reduction objective since the power is a factor also of the voltage and the operating frequency.

It calculates then a power-based weight (wp) for each net. wp is a ratio of the net power (NP) and the design average interconnects power (Avg(NPs)) raised to the exponent, to give more emphasizes to the high power consuming nets.

After calculating the power-based weight (wp), the algorithm calculates a timing-based weight (wt) by combining the path delay (PD) and the design average negative slack (Avg(NSs)) as shown in (2). This weight is applied only if the net is in a timing violated path.

The algorithm also accounts for high fanout nets. It uses the fanout number (NFout) to reduce the high fanout net weights. Experiments on small testcases showed that if a higher weight is given for a high fanout net, the placer will put all the cells connected to that net in the same bin, which can lead to a severe congestion and legalization problem and consequently a nonroutable design.

Power weight wp, timing weight wt, and net fanout (NFout) are combined in a linear formula (3) to calculate the final weight w.

where α is a power ratio between 0 and 1. It controls the ratio of the power weight to the final weight and provides a knob to trade-off between timing and power objectives in the placer. Comparing exponential versus linear weighting with a prefixed α (0.5) in a small testcase (~2000 instance) resulted in a 6% and 2% improvements in total power and setup timing, respectively.

3.3. Power-Timing Trade-Off Power Ratio

In order to determine the best value of the power prioritization parameter α, the PTDP flow is run with varying power ratios (from 0 to 1 with a step of 0.05) using a multivoltage testcase presented in Figure 2.

The testcase used in this experiment (Figures 2 and 3) is a 45nm multivoltage and open source libraries design with the following characteristics:(i)Technology: 45nm(ii)Number of instances: 250k(iii)Equivalent NAND2 gate count: 700k(iv)Number of RAM: 48(v)Targeted utilization: 60–65%(vi)Main clock frequency in normal operation mode: 400MHz

The operating voltages of the chip are as follows:(i)Always on (0.95v): CPU, USB_PHY, DDR, and PLLs (colored red in Figure 2).(ii)Switchable (0.95v/OFF): USB controller (colored green in Figure 2).(iii)Switchable (0.95v/0.85v/OFF): nova0 and nova1 (colored blue in Figure 2).

The results summarized in Figure 4 show that the best timing reduction is achieved with α = 0.25, α = 0.5, and α = 0.9. It shows also that the best power reduction is achieved with α = 0.25 and α = 0.4. We choose α = 0.25 to achieve both timing and power reduction.

In the next chapter, we have fixed α = 0.25 and we have run the flow on multiple industrial designs of different technologies, in order to generalize the study.

4. Experimental Results

The motivational example presented in Section 3 gives evidence that taking the power factor while calculating the nets weight can lead to an overall better solution in term of timing and power. So, in order to evaluate the effectiveness of the proposed weighting process, we developed the proposed weighting approach in TCL programming language and integrated it in the flow with α =0.25 as shown in Figure 1, and we compared the generated results with a commercial EDA tool (Nitro-SoC™).

The experiments were conducted on a set of 40 designs from different technologies, sizes, and complexities. Table 1 presents the main characteristics of the designs used. Our baseline is the default placement generated by the default Nitro-SoC™ TDP which runs the global placement while concurrently optimizing for wirelength, spread, and timing. Its goal is to find the best trade-off for all of the objectives. In the end, we compared the QoR (timing, power, and routability) metrics of both runs.

As shown in Figure 5, interconnect dynamic power gain between the new approach and the default one is 11.5% in average and we can see that all the designs have benefited positively in terms of power reduction. This is mainly due to the clustering of the cells connected by the high power consuming nets. This achieved interconnect power gain resulted in an overall power consumption decrease of 5.4% (Figure 6). Along with the power gain, a timing improvement of 13.7 % in TNS (Figure 7), 9.4% in WNS (Figure 8), and 8.7% in Total Transition Violations (Figure 9) is realized with only a 6% of runtime increase (Figure 11) caused by the additional RC extraction and timing/power analysis during the weighting process.

The power, the timing, and the transition time gains are the electrical effects of a 3.2% reduction in total wirelength (Figure 10). In few cases, we are seeing timing and wirelength increases because the chosen α parameter may prioritize power critical nets over timing critical ones and drive the placer to shorten the power critical nets and increase the timing critical one.

Total wirelength increases usually when extra net weighting is added. This phenomenon is seen in its classical usage, which is applied after a primary placement by increasing critical nets weight. This approach is used for instance to help pull the level Shifters/ISO cells near to their power domain region boundaries. By inserting a high net weight on the net segment between the LS/ISO cells and their associated power domain region boundary, the placer places additional priority on minimizing the length of the net segment and therefore pulls the cells nearer to the power domain region boundary. Another classical usage of net weighting in clock path is when the clock gating cells are right before the sync pins and the net fanout is small enough that buffering must not occur. By applying additional weight to the clock-gating output, the registers are pulled near to the clock-gating.

In our new PTDP, we are weighting all data nets, using timing, power, and fanout information. Since nets power is a factor on the voltage, the capacitance, and the switching activity, using the power parameter helps pull together the cells connected with high power nets, which are long nets in most cases (depending additionally on voltage and switching activity).

Another side effect of the traditional net weighting usage is seen in high fanout nets. All the cells on the net are pulled together leading to a congestion problem and additional router detours. In our weighting mechanism, we added the fanout parameter to resolve this issue by reducing the high fan-out nets weight according to their fanout number, this has helped for additional wirelength reduction and routability improvement.

5. Conclusion & Perspective

In this paper, we have proposed a new weighting approach that takes the power factor into account while calculating the net weights before and during the global placement stage in order to generate a power-friendly placement without impacting neither the timing nor the electrical Design Rule Constraint (eDRC) represented by the Max Transition Constraint in our study. It takes into consideration the nets switching activity, the net power domain, and operating frequency. By evaluating this new weighting approach on a wide variety of designs, we achieved an average gain of 11.5% in interconnect dynamic power and 5.4% in total power consumption. The power gain is achieved while keeping a better timing and eDRC (TNS Gain = 13.7 %, WNS gain = 9.4%, and Transviolations gain = 8.7%).

Taking the power factor into consideration early in the physical design process can lead to a better power reduction throughout the P&R flow. More work can be done in the same direction in earlier stages such as Floor-Planning, especially in the Macro Placement phase to explore new possibilities and opportunities for low power design methodologies.

Data Availability

The test cases used to support the findings of this study have not been made available because they are Mentor Graphics’ and its customers IPs.

Conflicts of Interest

The authors declare that they have no conflicts of interest.