Fault-tolerant TMR and DMR circuits with latchup protection switches

https://doi.org/10.1016/j.microrel.2014.04.001Get rights and content

Highlights

  • The ASIC design flow is modified to protect circuits from the single event effects.

  • The protection switch is verified by measurements in radiation environment.

  • TMR and DMR shift-registers with latchup protection are designed and tested.

  • DMR with protection switch outperforms TMR in terms of failure-free probability.

  • The fault-tolerant DMR middleware switch processor is designed and implemented.

Abstract

The paper presents CMOS ASICs which can tolerate the single event upsets (SEUs), the single event transients (SET), and the single event latchup (SEL). Triple and double modular redundant (TMR and DMR) circuits in combination with SEL protection switches (SPS) make the base of the proposed approach. The SPS had been designed, characterized, and verified before it became a standard library cell. A few additional steps during logic synthesis and layout generation have been introduced in order to implement the redundant net-lists and power domains as well as to place the latchup protection switches. The approach and accompanying techniques have been verified on the example of a shift-register and a middleware switch processor.

Introduction

Malfunctions of electronic devices due to the single event effects, being an effect of radiation, are observed not only in cosmic and airborne equipment, but also in mainstream applications. Together with the progressing integration and scaling of the electronic chips their susceptibility to errors increases. The current space microelectronics development and high energy physics research require shorter design time and cheaper solutions for the radiation and fault tolerant ASIC designs [1], [2], [3]. The main idea is to provide ASICs capable of correct and reliable functioning in the radiation environment using the standard semiconductor technologies and design flow. Therefore, the design of the advanced fault-tolerant digital integrated circuits needs scientific research and progress concerning three important aspects:

  • a.

    Analysis of irradiation effects and circuit faults.

  • b.

    Development of fault models and simulation test benches.

  • c.

    Design of fault-tolerant circuits and systems.

Definition and description of the basic irradiation effects and fault mechanisms can be found in the literature [4], [5], [6]. The thesis referenced in [7] provides an overview of different radiation environments and investigates interaction mechanisms between energetic particles and the matter. It also introduces a new semi-empirical model for estimating the electronic stopping force of heavy ions in solids. The most common irradiation effect is a single event effect (SEE) induced by a cosmic particle strike. Three common types of SEE are known: single event upset (SEU), single event transient (SET) and single event latchup (SEL). A single event upset causes the change of state in a storage element. It affects the memory cells and sequential logic. A single effect transient causes a short impulse (and wrong logic state) at the combinational logic output. The wrong logic state will propagate in case that it appears during the active clock edge. On the other hand, a single event latchup causes the excessive current flow through a NPNP structure in CMOS circuits. The single event latchup, compared to the SEU and SET, is a potentially destructive state [8], [9], [10], [11].

In order to analyze the final impact of the single event effects and provide the high fault coverage at circuit and system levels, it is necessary to develop practical, accurate, and simple simulation fault models. A good fault model should keep the correct function of a circuit (NAND, NOR, flip-flop, etc.) in case of no fault and cause the circuit malfunction in case of a fault. Beside the logic function, the most relevant modeling parameters are the effect duration, the minimal time of current discharge, and the current pulse intensity. The easiest way to simulate a fault in the logic circuit is use of a fault injector: XOR or XNOR gate. Fault injectors can be added once the logic synthesis has been completed and a design net-list has been generated. Computer parsers are used for automatic insertion of the fault injectors [12], [13], [14], [15]. Simulation fault-injection is the most used approach for validation of the fault-tolerant systems. The proposed fault models can be used during behavioral and net-list simulation. The models are usually described in a hardware description language (VHDL or Verilog).

We can clearly separate the known SEU, SET, and SEL fault-tolerant techniques into the two categories: circuit level techniques [16] (hardened-cell design [17], triple modular redundancy (TMR) [18], double modular redundancy (DMR) [19], and error detection and correction for memories [20]) and layout level techniques [21] (enclosed layout transistors, guard rings, and trench isolation). There are also many mitigation techniques based on the expensive technology changes [21].

When it comes to SEU and SET, the most common fault-tolerant techniques are TMR and DMR. The triple modular redundant circuit consists of three identical modules and a 3-input majority voter. The voter’s function is to pass through the major input value to the output. As we speak about digital circuits, the modules are memory elements such as flip-flops or latches. The main disadvantage of this technique is that the system fails in case of a faulty voter. Therefore, a new triple voting logic was developed to complete the circuit redundancy. Each of the three voters is fed from outputs of all three memory modules. This technique is known in the literature as the full triple modular redundancy. A detailed analysis of the triple modular redundancy was presented in [22]. The higher redundancy provides the higher fault tolerance and circuit reliability but increases the chip area, energy consumption, and costs. Therefore, the goal is to trade-off between the redundancy level and the reliability requirements taking into account the prospective application. In order to reduce the high hardware overhead produced by TMR and keep the design reliability high, the double modular redundancy with self-voting was developed [19]. A novel double/triple modular redundancy (DTMR) technique for dynamically reconfigurable devices tested on a finite state machine was presented in [23].

The SEL mitigation techniques can be classified in the three main groups:

  • a.

    First approach uses the current sensors at board-level to detect the excessive current induced by latchup. The power supply of the affected device is switched off and, after a long enough period of time, reestablished again. This approach suffers from a serious drawback: the circuit state is destroyed and cannot be recovered.

  • b.

    Second approach [24] is based on introduction of an epitaxial buried silicon layer and reduction of the well resistivity. However, this modification incurs additional costs and may impact circuit performance (breakdown voltage, for example).

  • c.

    Third approach [25] uses guard rings (additional N-type and P-type regions) that break the parasitic bipolar transistor structure. This solution is very efficient but can result in excessive circuit area and, therefore, price.

A new SEL mitigation scheme that combines error correcting codes with intelligent power line implementation was presented in [26]. It prevents the circuit damage and corrects errors caused by latchup in a transparent manner. The area cost is low and the scheme can also be used to correct SEUs and reduce power dissipation.

In spite of the huge research efforts and respectable scientific achievements, there are still challenges regarding the use of commercial ASIC technologies in space and safety-critical applications. The most important are:

  • a.

    The radiation-hardened technologies are expensive (qualification and quality requirements are severe) and commercially not attractive (small volume production).

  • b.

    There is no standard integrated framework of circuit design techniques that provides simultaneous SEU, SET, and SEL fault-tolerance.

  • c.

    There is no standard design automation flow of fault-tolerant digital ASICs and SOCs.

This paper presents a design flow for fault-tolerant ASIC that is based on the redundant circuits with latchup protection and additional implementation steps during logic synthesis and layout generation. The proposed approach protects ASICs from the SEU, SET, and SEL faults by combining and integrating different fault-tolerant techniques. The design methodology has been described in [27].

The following paper sections describe a SEL protection switch (Section 2), redundant (TMR and DMR) circuits with latchup protection (Section 3), and test results of the implemented fault-tolerant circuits (Section 4). The conclusions are summarized in Section 5.

Section snippets

SEL protection switch

Instead of using the external current sensors and expensive technological modifications, we have introduced a single event latchup protection switch (SPS) which can be integrated with standard cells and control their supply current. The proposed SEL protection switch [28], [29] is shown in Fig. 1.

The function of this circuit is to switch off the power supply in case of the excessive supply current. A current sensor/driver (P5 transistor) is used for supplying the power to the logic that needs

Redundant circuits with latchup protection

Redundancy is always required for a fault-tolerant design. The higher redundancy, the better protection, but also the larger chip area, power consumption, and cost are. Therefore, it is necessary to trade-off between the redundancy rate and the cost for each and every application. However, the redundant circuits (TMR and DMR) need to be protected from the latchup condition too. This requires significant modifications of the standard TMR and DMR circuit designs. We have proposed the SEL

Test circuits design, implementation and verification

Fault-tolerant ASICs can be implemented using standard design automation tools [33], [34], [35] and introducing a few additional steps in the design flow. An extra step compared to the standard design flow is necessary to generate a new net-list including redundant cells and voters. The other two extra steps (definition of the power domains and placement of the SPS standard cells) have to be made in the layout phase [27].

Table 6 presents features of the primary, TMR, and DMR circuits

Conclusion

The standard design flow has been modified to provide the protection of digital ASICs against the known single event effects (SEU, SET, and SEL) occurring in radiation environment. The modifications of the standard flow (generation of a redundant net-list, definition of the power domains, and placement of the SPS cells) are minor and fully automated.

The SPS cells have been characterized and verified by measurements in radiation environment. The results, based on the redundant circuit

References (38)

  • Dressnandt N, Newcomer M, Rohne O, Passmore S. Radiation hardness: design approach and measurements of the ASDBLR ASIC...
  • Tarrillo J, Chipana R, Chielle E, Kastensmidt FL. Designing and analyzing a SpaceWire router IP for soft errors...
  • Jung Y-K. Fault-recovery non-FPGA-based adaptable computing system design. In: Proc second NASA/ESA conference on...
  • A. Holmes-Siedle et al.

    Handbook of radiation effects

    (2002)
  • Javanainen A. Particle radiation in microelectronics. PhD thesis, Department of Physics, University of Jyväskylä,...
  • R.R. Troutman

    Latchup in CMOS technology: the problem and its cure

    (1986)
  • S.H. Voldman

    Latchup

    (2007)
  • N.H.E. Weste et al.

    CMOS VLSI design: a circuits and systems perspective

    (2011)
  • Ye L, Xiaohan G, Weiwei X, Zhiliang H, Killat D. An experimental extracted model for latchup analysis in CMOS process....
  • H. Mei-Chen et al.

    Fault injection techniques and tools

    IEEE Comput

    (1997)
  • A. Benso et al.

    Fault injection techniques and tools for embedded systems reliability evaluation

    (2003)
  • Sheng W, Xiao L, Mao Z. An automated fault injection technique based on VHDL syntax analysis and stratified sampling....
  • Sharma A, Singh B. Simulation of fault injection of microprocessor system using VLSI architecture system. In: Proc IEEE...
  • R.C. Lacoe

    Improving integrated circuit performance through the application of hardness-by-design methodology

    IEEE Trans Nucl Sci

    (2008)
  • M.P. Baze et al.

    A digital CMOS design technique for SEU hardening

    IEEE Trans Nucl Sci

    (2000)
  • R.E. Lyons et al.

    The use of triple-modular redundancy to improve computer reliability

    IBM J Res Dev

    (1962)
  • J. Teifel

    Self-voting dual-modular-redundancy circuits for single-event-transient mitigation

    IEEE Transa Nucl Sci

    (2008)
  • Cited by (0)

    View full text