verify everything

most zk bugs are silent. your prover generates something that looks valid, verifier accepts it, everything feels fine until someone exploits a soundness hole you never knew existed.

when i started building gunmetal, i kinda knew i wouldn’t stop at just one backend. Metal today, CUDA tomorrow, Vulkan after that. but every new backend is a fresh opportunity to introduce subtle bugs that only manifest on specific hardware, specific inputs, specific cosmic ray alignments. adding automated test cases for this is true shit and easy to mess up.

so i built gunmetal-verify, to formally verify the prover. three tiers of paranoia:

lean proofs for core arithmetic - field operations actually form a field, curve addition is associative, pairing equations hold. if lean accepts it, it’s definitively correct. 193 proptest properties covering MSM, NTT/FFT, QAP/R1CS, Groth16 - instead of checking 5 handpicked inputs we’re verifying mathematical properties across thousands of random ones. catches edge cases deterministic tests miss. then differential testing: same computation on GPU vs CPU, compare bit-for-bit. any divergence is a bug.

the framework also scans for broken primitives, insufficient rounds, and FreeLunch attack patterns (ePrint 2024/347). if you’re not familiar: certain prover optimizations create soundness holes letting you forge proofs for false statements. passes every functional test, completely breaks your security model.

backend-specific bugs are a fascinating failure mode. your cuda kernel might be 10x faster but introduce a 1-in-million case where the result is wrong by a single bit. that single bit can break soundness entirely.

implement, run suite, pass = cryptographically correct by construction. that’s the whole point.

heavy metal

built a groth16 prover for mac metal.

this perhaps makes zero sense. proofs run on servers. gnark, arkworks, rapidsnark, snarkjs, icicle - that’s the stack. nobody’s proving on laptops. but i had the tickle & was curious how the M-series Ultra would hold up with its unified memory.

120ms for 134k constraints on M3 Ultra. 2.9x faster than CPU.

here’s why.

custom field arithmetic instead of relying on arkworks. GLV endomorphism with zero-allocation Barrett reduction gives ~20x faster scalar decomposition - arkworks uses heap-allocating BigInt, we use stack arrays. CIOS Montgomery multiplication fully unrolled in Metal shaders.

pipelining is where it gets interesting. QAP computation (the polynomial math) overlaps with MSM execution (the elliptic curve math). while the GPU grinds through G1 MSMs, the CPU handles G2 MSMs in parallel. two expensive operations, zero waiting.

stockham FFT for NTT computation - 10x faster than CPU. avoids bit-reversal permutations and enables coalesced memory access.

apple silicon’s unified memory architecture means zero-copy between CPU & GPU. no PCIe bottleneck. no staging buffers. the GPU just… sees the data. interleaved point layout for cache locality means memory access patterns are basically free performance. this is a massive advantage most people sleep on.

pippenger algorithm with 13-bit windows & local threadgroup atomics. 128 threads per threadgroup hit the sweet spot after profiling - beats the 256 baseline. classic MSM acceleration but tuned specifically for metal’s execution model.

PRIOR ART

the zkmopro team and EluAegis were onto something with their metal MSM work. but i needed something holistic to support future platforms, and icicle’s GPU backends are closed source commercial stuff anyway.

now i’m tickling to add cuda, rocm, maybe mali. who knows.

the beginning

so i’m gonna start dropping random tech thoughts here… whatever bubbles up in this weird vibe coding wasteland we’re all living in now. llms writing code, humans reviewing it, nobody quite sure who’s driving anymore. feels like we’re all just along for the ride at this point. should be fun.