Since presenting the Diamond iO paper and its partial implementation last year, we have focused on improving the efficiency of BGG+ encodings from both theoretical and engineering perspectives. We believe this direction will make Diamond iO practical for larger input sizes and more complex circuits, as shown in our roadmap.
All codes are available at our mxx repository.
GPU Implementation
We have developed GPU implementations of the fundamental operations underlying BGG+ encodings and lattice-based iO. Specifically, these include:
- arithmetic operations on polynomials and digit decomposition,
- arithmetic operations on polynomial matrices, and
- preimage sampling.
Compared with the CPU implementation, the GPU implementation achieved roughly a 1,000-fold speedup for polynomial matrix multiplication and roughly a 20-fold speedup for preimage sampling.
New Lookup Table Evaluation over BGG+ Encodings
Last year, we have introduced a new technique to evaluate lookup tables (LUTs) over encodings [SB25]. However, for large circuits, this is still far from practical. One bottleneck is that, if we let $T$ denote the size of a LUT and $G_L$ the number of gates that evaluate the LUT, then an evaluation key generated by a trusted party, which is a part of the obfuscated circuit, must contain $\mathcal{O}(TM)$ lattice preimages.
To address this bottleneck, we built a new LUT evaluation technique that reduces the number of these preimages to $\mathcal{O}(T + G_L)$. We believe the construction is secure under the LWE and the private-coin evasive LWE assumptions.
Evaluating Modulo-q Multiplication over BGG+ Encodings with Large Parameters
Using the new LUT evaluation technique, we succeeded in evaluating a modulo-$q$ multiplication over BGG+ encodings, where $q$ is a prime factor of the encoding modulus $Q$ and is about 24 bits. The parameters used in the evaluation were sufficiently large for correctness and security; in particular, the ring dimension is $n=2^{14}$, and the encoding modulus size is $\log_2 Q = 264$ bits. However, the encoding modulus size was chosen by considering correctness only, so the actual modulus size may need to be larger in order to ensure security (e.g., for noise flooding).
With 8 GPUs (RTXPro 6000) running in parallel, preprocessing (corresponding to obfuscation in iO) took about 1.5 hours, and online evaluation (corresponding to obfuscated-circuit evaluation) finishes in under 1 minutes. The preprocessing output size, corresponding to a part of an obfuscated circuit, was 731 GB.
Notably, we confirmed that the main preprocessing bottleneck—preimage sampling for all LUT entries (23,114 entries in our experiment)—parallelizes well. This implies that the preprocessing time can be reduced by scaling out the number of identical GPUs.
Slotwise Operations over BGG+ Encodings
To ultimately support FHE multiplication, we extended modulo-$q$ multiplication over our BGG+ encodings to polynomial multiplication. To achieve this, the BGG+ encodings had to support encoding the value of each evaluation slot of a polynomial separately, as well as a slot transfer operation that transforms values across slots.
To enable these new features without increasing the preprocessing output size for LUTs with the number of slots, we introduced a variant of BGG+ encodings in which the public key is shared across all slots, while the secret key is specific to each slot. Furthermore, by introducing a mechanism for switching the secret key across different slots, we realized slot transfer for this variant of the encodings. The preprocessing data size for slot transfer is bounded by $\mathcal{O}(NG_S)$, where $G_S$ is the number of slot transfer gates, and $N$ is the number of slots, namely the ring dimension.
GSW FHE Multiplication over BGG+ Encodings
By combining all of the above new techniques, we succeeded in implementing a circuit for GSW FHE evaluation that can be evaluated over BGG+ encodings. As a concrete example, we measured a benchmark by computing a Goldreich's PRG on input an encoding of an encrypted seed.
The evaluation was conducted on a single H200 GPU with ring dimension $n = 2^{16}$ and modulus size $\log_2 Q = 1120$ bits. Because the circuit size and total computational cost were too large to complete the full computation, we evaluated only the smallest parallelizable unit, such as the evaluation of a single gate in each circuit layer, and estimated the benchmark results by multiplying these measurements by the corresponding number of parallel computation units.
The estimated benchmark results are as follows:
- In preprocessing (corresponding to obfuscation in iO), the minimum latency is 10 mins when more than $10^{12}$ GPUs are available.
- In online evaluation (corresponding to evaluation in iO), the minimum latency is 51 mins when more than $10^{18}$ GPUs are available.