TL;DR: This article gives an overview of the PQC hardware extensions for the OpenTitan Big Number (OTBN) accelerator hardware IP block selected for integration into the second generation of the OpenTitan® design.

The article starts by listing the limitations of the OTBN hardware as taped out for the first production OpenTitan silicon and by discussing the PQC requirements of the OpenTitan project. In line with these requirements, the focus of the selected extensions lies on 1) enabling implementation hardening against SCA and FI attacks to achieve CC certification, 2) limiting the increase in area and critical path delay to keep enabling integrations at scale into both discrete and integrated RoT designs, and 3) meeting performance requirements. To motivate the selection of extensions, this article provides a cost-benefit analysis with respect to these focus areas. The key takeaway message is that purely software-based SCA hardening involves a 100x run time overhead and further optimizing the vectorized OTBN datapaths cannot alleviate this. In contrast, the mask conversion acceleration extension selected by the OpenTitan project allows reducing this run time overhead to below 30% compared to an unhardened implementation. 

This article is based on the project-internal RFC: PQC Support for OTBN (see public GitHub issue #26846) which has been approved by the OpenTitan project in Dec 2025. For this public article, any proprietary information has been removed and the article has been updated to provide additional insight on the implementation progress, latest implementation results and the next steps.

The OpenTitan project is stewarded by lowRISC® C.I.C.

Background

The first production OpenTitan silicon can run post-quantum cryptographic (PQC) algorithms on its main processor core Ibex and as a matter of fact, it already supports hardware-accelerated SLH-DSA (SPHINCS+) based signature verification for Secure Boot. However, OpenTitan features a dedicated coprocessor for asymmetric cryptographic operations called OpenTitan Big Number (OTBN) accelerator, which should be used whenever side-channel analysis (SCA) resilience and performance matter. And when it comes to the NIST standardized PQC algorithms ML-DSA (Dilithium) [1] and ML-KEM (Kyber) [2] as required for CNSA 2.0 compliance [3], the version of OTBN which has been taped out as part of the first production OpenTitan silicon is not suitable for the following reasons:

  • Limited memory sizes: This version of OTBN features 8 KiB of IMEM and 4 KiB of DMEM, but the new PQC algorithms especially require more data memory, in particular when considering hardening. Similarly, also more instruction memory is required to fit even a basic unhardened implementation.
  • No KMAC application interface: The new PQC algorithms make use of Keccak operations and with the KMAC hardware IP block OpenTitan features a hardened Keccak accelerator. But without a KMAC application interface for OTBN, the operations have either to be implemented in software running on OTBN (thereby increasing the instruction memory footprint and limiting the performance) or software running on Ibex has to take care of moving intermediate data between KMAC and OTBN (thereby exposing intermediate results to software which may weaken security).
  • No support for single-instruction, multiple data (SIMD) execution: The bignum ALU of OTBN is optimized for handling big numbers while the new PQC algorithms mostly operate on 12, 24 and 32-bit elements. The lack of SIMD support means that big parts of the 256-bit wide bignum ALU remain actually unused when implementing the new PQC algorithms on this version of OTBN.  
  • No hardware support for efficient conversion from boolean to arithmetic masking for SCA hardening: The new PQC algorithms combine operations most efficiently protected against SCA using arithmetic (lattice operations) or boolean masks (hash operations). To harden the full algorithm against SCA, implementations either need to mask one type of operation using a sub-optimal masking scheme or when switching from one operation type to the other, the masks need to be converted. Both these options incur a substantial performance penalty.

For these reasons, strategies for accelerating PQC crypto on OTBN have previously been topics of various discussions and meetings of the OpenTitan project-internal Security, Silicon and Integrated Working Groups throughout 2024 and 2025, as well as in the public OpenTitan GitHub issue #26846.

In autumn 2025, lowRISC listed out the requirements for PQC implementations on OpenTitan and condensed previous, relevant discussions into an implementation proposal or RFC documents meeting these requirements. This RFC was formally approved by the OpenTitan Technical Committee in Dec 2025. The OpenTitan project then started working on implementing the approved proposal and upstreaming the changes into the upstream OpenTitan code base. 

The goal of this article is to share more insight on the rationale behind the implementation approach chosen by the partners of the OpenTitan project, and to give an update on latest results and the progress of lowRISC’s Security Team towards a PQC ready OTBN version ready for integration into next-generation OpenTitan systems.

Requirements

OpenTitan’s PQC requirements can be grouped along multiple different dimensions. For a summary of the requirements, see Summary of Requirements

Security Level

The FIPS standards for ML-DSA [1] and ML-KEM [2] define different parameter sets to trade off memory requirements, computational complexity and security. As module-lattice-based public-key cryptography schemes including the newly standardized PQC algorithms are less well studied than more widely used and known public-key schemes such as RSA and ECC, multiple OpenTitan partners have recommended to use at least Level 3 for ML-DSA, i.e., ML-DSA-65. For compatibility with CNSA 2.0 [3], the highest security level, i.e. Level 5 for both algorithms or more precisely ML-DSA-87 and ML-KEM-1024, need to be supported.

Security Hardening

For certification under Common Criteria (CC) PP-0084 and/or CC PP-0117, the implementation in OpenTitan needs to be resistant to attacks performed by an attacker possessing High attack potential, requiring security countermeasures against physical attacks such as side-channel analysis (SCA) and fault injection (FI). Physical attacks on PQC implementations as well as the design of effective countermeasures against such attacks are very active research areas. The general understanding is thus that security countermeasures will mostly be implemented in software to enable flexibility, and that there needs to be some headroom in the IMEM to further improve the hardening in the future.

SCA

There is consensus to aim for 1st-order masking against SCA. As with existing SCA-hardened non-PQC algorithmic implementations running on OTBN, the masking countermeasures are expected to be implemented in software for improved flexibility and area efficiency. Hardware masking implementations are only considered for performance critical parts of the PQC algorithms and only for well-researched countermeasures (e.g. boolean-to-arithmetic mask conversion).    

To facilitate the implementation of masking countermeasures in software, the bignum datapath (ALU, MAC, register file read and write paths) of OTBN implements a blanking countermeasure to minimize glitching activity on datapath control signals and to force inputs of datapath elements to zero which are unused by the currently executed instruction, thereby reducing unnecessary and unintentional switching activity in the design. The understanding is that the blanking countermeasure must be extended to comprise any newly added extensions to the bignum datapath of OTBN. 

FI

Unlike Ibex, OTBN doesn’t feature a holistic lockstep countermeasure to harden the entire execution pipeline against FI in hardware. Instead, OTBN implements a set of countermeasures to selectively harden the hardware design against FI and relies on additional software countermeasures to be added at the algorithmic level. The same approach will be used to harden the PQC algorithms against FI. The design of additional FI hardware countermeasures will be considered where they can also be beneficial for the implementation of non-PQC algorithms but such general FI hardening improvements are outside of the scope of this article.

Performance

ML-DSA will be used in Secure Boot which is on the critical path between power-on/wake-up and first firmware-based operations. For consideration of integrations into PCIe-compliant devices, Secure Boot needs to complete within 120 ms of power-on. The project has agreed to aim to finish Secure Boot within 100 ms to meet this requirement with some headroom.

Secure Boot involves verifying the signature of two boot stages with ML-DSA (ROM_EXT and BL0). When performing e.g. a firmware update, signatures additionally need to be generated in one or both boot stages for attestation. However, if the device attestation can be done after completion of the firmware update via a separate reset flow, the signature generation can be completely moved off the critical path and only the two verification iterations remain. This is desirable because the ML-DSA sign operation: 

  • is computationally much more intense compared to the verify operation,
  • involves the private secret, meaning it requires more security hardening than the verify operation, which further increases the computational complexity, and
  • has a highly variable run time which makes it impossible to give hard guarantees without overengineering the implementation (e.g. 3.85 iterations of the main loop are required for ML-DSA-87 signature generation on average, but only with 16 loop iterations, the success probability is greater than 99%).

Assuming a discrete chip like production OpenTitan Earl Grey silicon running at 100 MHz1, the 100 ms to complete Secure Boot correspond to 10 M clock cycles. As an upper bound, a single ML-DSA-87 signature verification must take less than 5 M clock cycles.

If the signature generation cannot be moved off the critical path (i.e., the reset flow for doing device attestation after completion of a firmware update cannot be supported), the signature verification must take less than 1.5 M clock cycles and the signature generation must take less than 7 M clock cycles for the average number of loop iterations (this assumes either splitting the update of ROM_EXT and BL0 over two boots or doing a double-update boot to make more time, see #26846).

The OpenTitan project has decided to aim for supporting a separate reset flow to move the ML-DSA-87 signature generation off the critical path for Secure Boot. This means there is no strict performance requirement for ML-DSA-87 signature generation, but ML-DSA-87 signature verification must take less than 5 M clock cycles.

There are currently no performance requirements for ML-KEM.

Area and Timing

OpenTitan is designed to be of commercial quality and has been successfully taped out both as integrated RoT (iRoT) IP in larger SoCs as well as discrete RoT chips such as the first production OpenTitan silicon. And especially when it comes to mass production, the silicon area is no longer “just” one of multiple optimization dimensions but the key figure for our industry partners. With respect to timing, discrete RoT chips typically run at moderate clock speeds in the order of 100 MHz. In contrast, iRoTs are getting taped out in the most recent technology nodes and typically clock above 1 GHz, making timing closure for the iRoT much more challenging. Specifically considering Darjeeling, the OpenTitan top-level design targeting iRoTs, OTBN is on the overall critical path of the design, i.e., there is not much room for increasing the critical path delay inside OTBN.

The PQC extensions must be implemented to fit both these two core use cases of OpenTitan and our industry partners. This means they need to be as efficient as possible in terms of silicon area (logic and memory) without inflating the critical path delay.

As demonstrated by previous research [4, 5], the memory of OTBN needs to be substantially increased for ML-KEM and especially for ML-DSA. Even for a baseline implementation, leaving any performance and hardening requirements out of the picture, an IMEM size in the range of 12 to 32 KiB and a DMEM size in the range of 12 to 128 KiB are required to fit ML-DSA-87 signature generation.

Compared to the taped out OTBN configuration (8 KiB IMEM, 4 KiB DMEM), this means more than double the area spent for memory. And since already for this configuration, the memory (including the scrambling logic) accounts for roughly 30% of the silicon area of OTBN, it can be anticipated that the area increase due to the larger memory likely dominates the overall area increase required by the PQC extensions.

The primary goal is thus to minimize the increase in memory while meeting the performance and hardening requirements, and leaving some IMEM headroom to allow for future hardening improvements. The addition of more logic is avoided unless it helps minimize the memory increase directly (e.g. by offloading entire operations and thereby reducing the IMEM footprint) or indirectly (e.g. by substantially improving the computational performance and thereby enabling additional DMEM optimizations while still meeting performance requirements).

Requirements Summary

In short, the requirements can be summarized as follows:

  • Security level: ML-DSA-87 and ML-KEM-1024
  • Security hardening:
    • SCA: 1st-order masking (mostly in software), extension of existing blanking countermeasure to comprise newly added bignum datapath elements. 
    • FI: mostly in software and at the algorithmic level, consideration is being given to additional FI hardware countermeasures beneficial also for the hardening of non-PQC algorithms.
  • Performance
    • Maximum 5 M clock cycles for ML-DSA-87 signature verification provided that attestation after a firmware update can be done via a separate reset flow. This is the preferred option of the OpenTitan project. Otherwise
    • Maximum 1.5 M clock cycles for ML-DSA-87 signature verification and 7 M clock cycles for the average number of 3.85 loop iterations signature generation, if the attestation after a firmware update cannot be done via a separate reset flow.
  • Area & timing
    • Minimize increase in memory sizes while meeting performance and hardening requirements.
    • Leave some IMEM headroom for future hardening improvements.
    • Only consider additional logic if it helps minimize the memory increase directly or indirectly.
    • Don’t inflate critical path delay.

OpenTitan’s PQC Extensions for OTBN

Below, we first give a high-level overview of the implementation approach chosen by the OpenTitan project before diving into the details in separate sections.

Extensions Overview

The following modifications and extensions to OTBN have been approved by the OpenTitan project and are currently being implemented to address the previously stated requirements. For details, refer to the sections further below. The estimated area overhead considering relaxed timing constraints and an implementation in the open FreePDK45 process is given for each modification in parenthesis2 3.

  • Increasing OTBN memory
    • Increase IMEM from 8 KiB to 16 KiB – (+50 kGE, +8%).
    • Increase DMEM from 4 KiB to 32 KiB (+440 kGE, +72%).
  • Adding a KMAC application interface (+35 kGE, +6%)
    • Beneficial for reducing IMEM and DMEM footprint, and for improving SCA hardening and performance. 
  • Adding a 32-bit SIMD ISA extension (+23 kGE, +4%)
    • Add vectorized instructions interpreting the 256-bit wide data registers as vectors of 8 32-bit operands. Beneficial for reducing IMEM footprint and improving performance.
    • Add instructions for efficiently packing and unpacking such vectors into dense vectors of 24-bit elements. Beneficial for reducing DMEM footprint.
  • Adding hardware acceleration for mask conversion  (+57 kGE, +9%)
    • Add new instructions for A2B and B2A masking conversions and the involved SecureAdd operation. This is not only beneficial for improving performance of ML-DSA signature generation but also helps for classic and/or symmetric algorithms.

In total, the area overhead for all these modifications together amounts to roughly 604 kGE (+99%) of which 80% is spent on increasing the OTBN memory. The remaining 20% spent on more logic helps limit the memory increase. How this works is explained in more detail in the following sections.

Increasing OTBN Memory

As discussed above, it is infeasible to implement ML-DSA on the previous OTBN configuration with 8 KiB of IMEM and 4 KiB of DMEM.

Increasing the OTBN IMEM to 16 KiB

It is known from previous research that ML-DSA-87 implementations for OTBN require between 21 and 32 KiB of IMEM depending on hardware extensions [4]. While the lower bound assumes a SIMD ISA extension as well as a dedicated KMAC application interface for OTBN, the upper bound is without any hardware extensions. Note that this analysis has two shortcomings: 1) It does not consider any SCA or FI hardening, and 2) it does not consider optimizations required to reduce the DMEM footprint as analyzed in depth in other research which may in turn impact the code size [5].

Thanks to the work of lowRISC’s Security Team, who are responsible for hardening the OpenTitan cryptographic library to get it ready for CC certification, we now know that the signature verification algorithm together with SCA- and FI-hardened versions of the signature generation and key generation algorithms for ML-DSA-87 can be implemented on OTBN with slightly more than 14 KiB of IMEM, assuming the other PQC extensions for OTBN approved by the project are implemented as well. For this reason, the OTBN IMEM has been increased from 8 KiB to 16 KiB in the upstream repository with PR #29318

Increasing the IMEM from 8 KiB to 16 KiB creates an area increase of around 50 kGE which corresponds to +8% on top of the previous OTBN configuration when assuming relaxed timing constraints.

Increasing the OTBN DMEM to 32 KiB 

The minimum DMEM size to support any PQC algorithm is primarily driven by the ML-DSA-87 signature generation algorithm and substantially impacted by two factors: 1) whether it is on the critical path for Secure Boot or not, and 2) the SCA hardening. The following two research papers are particularly relevant for translating our requirements into a minimum DMEM size:

  • J. W. Bos et al. have come up with a set of optimizations for reducing the data memory footprint of an unhardened ML-DSA-87 signature generation implementation from 113 KiB down to 8.1 KiB [5] (not including the keys).
  • M. Azouaoui et al. have investigated the design of masking countermeasures for hardening ML-DSA signature generation against SCA and provide insight on which variables and operations actually need to be masked thereby increasing the data memory footprint and increasing the computational complexity [7]. 

As discussed in the Requirements section, the OpenTitan project has decided to aim for supporting a separate reset flow to move the ML-DSA-87 signature generation off the critical path for Secure Boot. There is thus no strict performance requirement for ML-DSA-87 signature generation and all techniques proposed by J. W. Bos et al. [5] can be leveraged to trade off DMEM size against computational complexity. Assuming the other PQC extensions for OTBN approved by the project are implemented as well, this additional computational complexity can be handled reasonably well. The following techniques are particularly relevant:

  • Streaming of A: Re-compute matrix A on the fly (once per main loop) instead of storing it in DMEM.
  • Streaming of y: Re-compute vector y on the fly (twice per main loop) instead of storing it in DMEM.
  • Dense vectors: Store vectors of 24-bit elements densely in DMEM. Requires instructions for efficiently packing and unpacking vectors.
  • Memory slotting: Implement a DMEM allocation scheme to use DMEM slots for storing multiple variables over time.

Using these techniques and additional know-how, lowRISC’s Security Team has been able to reduce the DMEM footprint of the hardened ML-DSA-87 signature generation implementation from 164 to below 32 KiB. For this reason, the OTBN DMEM has been increased from 4 KiB to 32 KiB in the upstream repository with PR #29318.

Increasing the DMEM from 4 to 32 KiB induces an area overhead of around 440 kGE or +72% on top of the previous OTBN configuration without PQC extensions.

Performance-wise, the upstreamed implementation of the ML-DSA-87 signature verification operation takes around 300k clock cycles and a first, internal but SCA-hardened ML-DSA-87 signature generation implementation takes around 1.2 M clock cycles per loop iteration. This means, even if for a particular OpenTitan integration, a separate reset flow to move the ML-DSA-87 signature generation off the critical path for Secure Boot could not be supported, the now tighter performance targets (1.5 M clock cycles for signature verification and 7 M clock cycles for the average 3.85 loop iterations of signature generation) can be met. But as mentioned in the Requirements section, there is no guarantee that 4 loop iterations are sufficient for the signature generation, meaning a separate reset flow is still desirable to guarantee compliance with PCIe integration guidelines.

Adding a KMAC Application Interface

Adding a KMAC application interface to OTBN to offload the Keccak hashing operations of the PQC algorithms from OTBN to the KMAC hardware IP block as proposed by Abdulrahman et al. [4] has three main benefits:

  • It allows reducing the IMEM and DMEM footprints for unhardened implementations by roughly 6 and 1 KiB, respectively [4] by removing a SW based solution. A hardened SW-based solution would be significantly larger. The IMEM footprint reduction alone amounts to roughly 35 kGE. 
  • It substantially reduces the run times of unhardened ML-DSA-87 signature generation and verification implementations by 62% and 79%, respectively [4].
  • The KMAC block is hardened against SCA already. Considering also SCA hardening, the area and run time reductions are even higher.

lowRISC’s Security Team has come up with a specification and simulator implementation of this interface, and the RTL implementation work is currently ongoing. The core idea is to add some wide special purpose registers (WSR) where OTBN can push data to the KMAC hardware IP block and read the digest from it. To control the KMAC block operation, additional regular CSRs are added.

The cost of interfacing OTBN with KMAC through an application interface is currently estimated to amount to roughly 35 kGE or +6% of the OTBN configuration without PQC extensions and considering relaxed timing.

Adding a SIMD ISA Extension

To support efficient execution of PQC algorithms, the OpenTitan project has decided to add new vectorized instructions interpreting the 256-bit wide data registers (WDRs) of OTBN as vectors of 8 32-bit elements, resulting in a SIMD execution. This work has been upstreamed into the upstream OpenTitan repository with PRs #29344 and #29395. As part of this upstreaming effort, the implementation has undergone thorough RTL and security review with a special focus on the blanking countermeasure employed throughout the bignum ALU and MAC datapaths of OTBN. The aim of this SCA countermeasure is to minimize glitching activity on datapath control signals and to force inputs of datapath elements to zero which are unused by the currently executed instruction, thereby reducing unnecessary and unintentional switching activity in the design and simplifying the SCA hardening of OTBN software. While required for security, the blanking countermeasure can have a notable impact on area and critical path delay. But thanks to the work of lowRISC’s Security Team, the blanking countermeasure could be coalesced into the OTBN datapaths in an optimal way to minimize these undesirable side effects. The resulting implementation in the upstream repository is a lot more area efficient, faster and with better security hardening compared to the previous prototype implementations. Also, the new instructions have been added to the UVM-based verification environment.

In the following, we discuss the added instructions as well as the motivation for choosing them, and we also discuss the impact of the SIMD extension on area and timing.

Supported Vector Instructions

Following the reasoning in [4, 10], 4 new instruction types have been added, each with subvariants, making 13 new instructions in total. These instructions follow a generic character such that they can also be used for other cryptographic computations than just ML-DSA and ML-KEM.

  • The first instruction type includes bn.pack and bn.unpk which enable it to load 24-bit elements from memory into 32-bit element vectors and vice versa. This can reduce parts of the ML-DSA DMEM footprint by up to 25% as the actual numbers can be represented within 24 bits. As these instructions are generic, they are also beneficial for other workloads where 24 bits are sufficient like for example ML-KEM.
  • The second group targets polynomial computations which are a salient computation of ML-DSA as well as ML-KEM and are a great target to vectorize. In particular, the NTT and INTT naturally lend themselves to parallelization because of the independence of the individual butterfly operations on each layer. For this, the instructions bn.addv(m), bn.subv(m), and bn.mulv(m)(l) are proposed which offer SIMD (modular) addition, subtraction, and multiplication, respectively. With these instructions, not only is the performance increased but this also enables IMEM savings as one instruction can cover multiple computations.
    • The multiplication instructions are implemented in a pipelined fashion. Instead of extending the multiplier to a full vectorized 256-bit multiplier which would be costly both in terms of silicon area and critical path delay, the existing 64-bit multiplier is vectorized to handle two 32-bit multiplications in parallel.
    • The bn.mulv(l) instructions compute a regular multiplication and take 4 cycles to process a full vector.
    • The bn.mulvm(l) instructions implement the Montgomery multiplication algorithm. This algorithm efficiently computes a*b mod q by avoiding expensive division operations. To compute a Montgomery multiplication it requires three regular multiplications. These multiplications are performed sequentially on OTBN’s vectorized 64-bit multiplier. The total execution therefore requires 12 cycles (3 * 4 chunks). To reduce the hardware overhead by implementing the Montgomery in hardware, the conditional subtraction step of the Montgomery multiplication has not been implemented in hardware as it can be performed with a bn.addvm instruction.
  • The third instruction group contains the instructions bn.trn1 and bn.trn2 which enable it to interleave data inside two WDRs when interpreting them as vectors of multiple elements. This is especially useful for NTT and INTT to shuffle the vector elements when the stride between elements is smaller than what two WDRs provide. These instructions also operate on 64-bit and 128-bit vector elements.
  • The last instruction type is a bit-shifting instruction, bn.shv. This allows, for example, fully vectorizing the decomposition in ML-DSA and facilitates the implementation of the sampling coefficients step more efficiently.
Supported Vector Element Sizes

The new PQC algorithms mostly operate on 32-bit as well as 24-bit (ML-DSA) and 12-bit elements (ML-KEM) suggesting the addition of support for both 16- and 32-bit vector elements in the SIMD ISA extension for maximum benefit. However, our investigations in collaboration with experts in the design of vector engines for high-performance computing at ETH Zürich revealed that adding support for 16-bit vector elements increases the overheads of the SIMD extension by 50% and 33% in terms of silicon area and critical path delay, respectively.

Since these overheads are considerable and because the project doesn’t currently have performance requirements for ML-KEM where the support for 16-bit vector elements would benefit most, the OpenTitan project has decided to restrict the SIMD extension of OTBN to supporting 32-bit vector elements.

It’s further worth noting that having 16-bit support on top of 32-bit support can at most double the performance. As we’ll illustrate below, the additional area is better spent on the mask conversation acceleration hardware which offers much higher performance gains (see Adding Hardware Acceleration for Mask Conversion).   

Impact on Area and Timing

When synthesizing the upstreamed SIMD design (see PRs #29344 and #29395) for the open FreePDK45 process, the area overhead of the SIMD ISA extension can be quantified as 23 kGEs. In turn, having this extension enables valueable memory area savings:

  • For the IMEM, it enables code size reductions of around 18% (4.6 KiB) [4]. This translates to an area saving of roughly 30 kGE (-5%).
  • The bn.pack and bn.unpk instructions allow reducing the DMEM footprint of vectors and matrices by 25%. This translates to an area saving of roughly 130 kGE (-21%).
  • The much improved performance (2.58x faster ML-DSA-87 signature generation and 1.8x faster signature verification according to [4]) paves the way for applying the DMEM optimization techniques proposed by [5] to reduce the DMEM footprint for a hardened ML-DSA-87 signature generation implementation from 164 kGE down to 32 KiB while still meeting performance requirements.

To assess the impact on timing concerning integrations into bigger SoCs as part of an iRoT, we have synthesized the design with a commercial synthesis tool (Cadence Genus) targeting the open ASAP7 PDK. The design is analyzed in the slow-slow corner under the following clock sweep where the actual synthesis has a clock overconstraining of 15%:

The resulting AT curves are shown in the following plot. The plot considers overconstraining and the achievable slack, meaning the points are shifted accordingly.

The “Baseline” design corresponds to the OTBN version taped out in the first production OpenTitan silicon. The “SIMD” design corresponds to the current upstream version of OTBN including the SIMD extension and the increased memory (commit 1b83ebf1). The memory area itself has been removed for producing the plot to fully focus on the overheads of our SIMD extension. As shown in the AT plot, the minimum clock period increases by 0.229 ns from 1.5 ns to 1.729 ns which corresponds to a reduction in the maximum clock frequency from 666 MHz to 580 MHz (-13%).

So to summarize, the upstreamed implementation of our SIMD ISA extension has a great cost-benefit ratio: The area cost itself is small (23 kGE, 4%) but it enables notable IMEM and DMEM savings which outweigh the area cost of the SIMD extension by far. Timing wise, there is a reduction in maximum clock frequency (-13%) but this reduction is sufficiently small to not risk integrations into iRoT designs which clock a lot faster than discrete RoTs.

As outlined in the Requirements section, hardening the implementation against physical attacks such as side-channel analysis (SCA) is a key requirement for the OpenTitan project. For this reason, lowRISC’s Security Team started investigating this challenge early on by doing extensive literature research and by getting advice from experts like Markku-Juhani O. Saarinen, Professor of Practice at Tampere University, who has substantial practical experience when it comes to SCA attacks and hardening of PQC implementations. The team identified highly relevant research works such as the paper by Azouaoui et al. investigating the design of masking countermeasures for hardening ML-DSA signature generation against SCA and providing insight on which variables and operations actually need to be protected [7]. Based on these insights, the team then collaborated with ETH Zürich on implementing the proposed techniques to come up with an SCA hardened version of ML-DSA running on OTBN and experimentally verifying the SCA hardening of this implementation on a ChipWhisperer FPGA platform [9].

Breakdown of Masking Run Time Overhead

When doing that, the team also investigated the run time overhead compared to an unhardened baseline implementation. Considering the signature generation function for ML-DSA-65, the run time increases from 4.5 M clock cycles without hardening to 107 M clock cycles with hardening, i.e., the run time increases by roughly 24x when adding purely software-based hardening. In other words, this means 95% of the run time is spent for the software-based SCA countermeasures. It’s well worth noting that this work used a version of OTBN with larger memory but without any SIMD support and without a KMAC application interface. If these two extensions would get considered as well, roughly 99% of the run time would be spent on the software-based SCA countermeasures.

The figure below presents a breakdown of these overheads for the signature generation. Most of the run time overhead is spent on two main functions: The SecDecompose function [7] takes roughly 55 M cycles (54% of the overhead) and the SecBoundCheck function [7] about 46 M cycles (44%).

These two functions can further be broken down into the following three main subfunctions or masking gadgets:

  • Arithmetic-to-Boolean mask conversion (A2B): This is generally required when moving outputs of lattice operations into hash operations. Around 46% of the run time overhead is spent on this.
  • Boolean-to-Arithmetic mask conversion (B2A): This is generally required when moving outputs of hash operations into lattice operations. Around 33% of the run time overhead is spent on this.
  • The secure add function (SecureAdd) computes the arithmetic addition of two boolean mask operands, e.g., as they come out of a hash operation. Around 10% of the run time overhead is spent on this function. 

Of these three functions, SecureAdd is the most fundamental one: Assuming a hardware accelerator for SecureAdd is available, both A2B and B2A can be implemented on top of that. This means with a single core building block, 89% of the SCA hardening run time overhead can be accelerated.

OTBN’s Masking Accelerator Interface (MAI)

For this reason, the OpenTitan project has decided to invest into designing a WSR/CSR-controlled Masking Accelerator Interface (MAI) which uses a SecureAdd primitive at its core to build a SecureAddMod (SecureAdd with a flexible modulo), from which using some additional arithmetic, A2B and B2A accelerator functions can be constructed. Thanks to the flexible modulo, the MAI is not only suitable for ML-DSA and ML-KEM but also beneficial to efficiently accelerate masked implementations of classical algorithms including symmetric algorithms.

lowRISC’s Security Team has come up with a specification of this new interface and the implementation work is currently ongoing. Internally, the team has a first RTL implementation of the SecureAdd core component which passes pre-silicon masking verification using CocoAlma (formal) [13] and PROLEAD (simulation based) [14]. The design leverages Hardware Private Circuit (HPC) masking gadgets [11] to implement a fully-pipelined 32-bit Slanksy adder [12] with a latency of 6 clock cycles. This design is able to add two vectors of 8 32-bit boolean masked values in just 13 clock cycles instead of 1.6 M clock cycles when implementing this in software, meaning it’s more than five orders of magnitude faster. Considering the full, SCA hardened, ML-DSA-87 signature generation algorithm, the MAI reduces the run time to roughly 1.2 M clock cycles which corresponds to a speedup in the order of 80x. Compared to an unhardened implementation featuring SIMD support and a KMAC application interface, the resulting run time overhead is below 30% [4]. In terms of area, the SecureAdd core consumes roughly 25 kGEs. The total MAI is expected to consume roughly 57 kGE (+9%) but considering the 80x speedup, this area is well invested.

Current Status & Next Steps

Out of the four discussed extensions to enable PQC support on OTBN, two hardware extensions have been fully upstreamed into the public OpenTitan repository (see PR #29318 for increasing the memory sizes, and PRs #29344 and #29395 for the SIMD ISA extension).

Concerning the two remaining hardware extensions, lowRISC’s Security Team has designed the corresponding specifications and is now working on the RTL implementations. The next RTL implementation to get upstreamed will be the Masking Accelerator Interface (MAI). 

Concerning the actual ML-DSA-87 implementation that will run on the extended OTBN hardware, the team has completed upstreaming a first version of the signature verification operation integrated into the cryptolib (see mldsa87 for details). This implementation consumes roughly 300k clock cycles on the simulated version of OTBN assuming all extensions presented in this article.

Internally, the team also has a first SCA-hardened version of the ML-DSA-87 signature generation operations ready for which the upstreaming is about to start, and which takes roughly 1.2 M clock cycles per loop iteration. This means, even if for a particular OpenTitan integration, a separate reset flow to move the ML-DSA-87 signature generation off the critical path for Secure Boot could not be supported, this implementation can meet the performance requirement of 7 M clock cycles for signature generation on the average case. 

Get Involved

The OpenTitan project is stewarded by lowRISC C.I.C. a not-for-profit engineering company that creates and maintains commercial-grade open source silicon designs through its collaborative Silicon Commons® approach. As the post-quantum world comes ever closer the OpenTitan partnership and lowRISC are committed to a roadmap that addresses these challenges head on.

If you would like to find out more about OpenTitan and its approach to PQC then contact us at get-involved@opentitan.org

References

  1. FIPS 204: Module-Lattice-Based Digital Signature Standard, https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.204.pdf 
  2. FIPS 203: Module-Lattice-Based Key-Encapsulation Mechanism Standard, https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.203.pdf
  3. Commercial National Security Algorithm Suite 2.0, CNSA 2.0, https://media.defense.gov/2025/May/30/2003728741/-1/-1/0/CSA_CNSA_2.0_ALGORITHMS.PDF 
  4. A. Abdulrahman et al., Towards ML-KEM & ML-DSA on OpenTitan, https://eprint.iacr.org/2024/1192.pdf
  5. J.W. Bos et al., Dilithium for Memory Constrained Devices, https://eprint.iacr.org/2022/323.pdf 
  6. P. Etterli et al., Design and Optimization of a PQC ISA Extension for OTBN, semester project at ETH Zürich: Slides, Report, Repo.
  7. M. Azouaoui et al., Protecting Dilithium against Leakage, https://eprint.iacr.org/2022/1406.pdf 
  8. A.R. Shahmirzadi et al., Efficient Boolean-to-Arithmetic Mask Conversion in Hardware, https://eprint.iacr.org/2024/1633.pdf
  9. H. Filali, P. Nasahl, C. Reinwardt, P. Vogel,  Power Side-Channel Evaluation and Hardening of PQC Algorithms on OpenTitan. M.S. Thesis, ETH Zurich, Mar. 2025. Available: https://www.research-collection.ethz.ch/entities/publication/d573d76d-9cae-48d3-b149-5bddd86a14cf
  10. E. Urquhart et al., Acceleration of Core Post-quantum Cryptography Primitive on Open-Source Silicon Platform Through Hardware/Software Co-design, https://link.springer.com/chapter/10.1007/978-981-97-8013-6_7 
  11. Gaëtan Cassiers et al., Compress: Generate Small and Fast Masked Pipelined Circuits, https://eprint.iacr.org/2023/1600.pdf 
  12. J. Sklansky et al., Conditional-Sum Addition Logic, https://ieeexplore.ieee.org/abstract/document/5219822 
  13. V. Hadžić and R. Bloem, COCOALMA: A Versatile Masking Verifier, https://ieeexplore.ieee.org/document/9617707/ 
  14. N. Müller and A. Moradi, PROLEAD – A Probing-Based Hardware Leakage Detection Tool, https://tches.iacr.org/index.php/TCHES/article/view/9822

1Note that if future versions of the discrete OpenTitan Earl Grey top level were taped out in a more advanced technology node, the frequency of the main clock and accordingly the maximum allowable cycle counts would likely increase by 50% or 100%.

2The percentage increase for every modification is relative to the total area of the OTBN configuration taped out as part of the first production OpenTitan silicon.

3The estimated area overheads were obtained by converting the relative overheads of the individual extensions, e.g., as presented in the cited works or by synthesizing the implemented extensions, to the open FreePDK45 process. The area overheads for the memories are rough estimates.