I have a gaming PC. RTX 5090, Windows 11, the works. The problem is I don’t actually game that much. Most days it just sits there, an expensive space heater drawing idle power while I work on a Mac.
After building gunmetal (a Groth16 prover on Apple Metal), I wanted to add CUDA support. But developing on Windows is painful, and I needed a proper Linux build target anyway. So the obvious idea formed: run Linux as the base OS, run Windows as a VM. Boot Windows when I want to game, use Linux for development the rest of the time. With GPU passthrough, Windows gets the 5090 and Linux uses the integrated adapter. Two build servers from one machine.
Simple plan. One complication: 800 GB of storage, applications, games. I wasn’t about to reinstall everything from scratch. I’m not that eager to waste unnecessary time. I’d clone the existing Windows install and restore it inside a VM.
At this time I had no idea what I had done to myself.
Cloning is the easy part
Macrium Reflect X turned out to be a solid cloning tool. Image the drive, get a backup in their mrimgx format, restore it somewhere else. Straightforward.
But I needed to test the restore on Linux before wiping the machine. And Macrium Reflect is Windows-only. You can’t restore an mrimgx image on Linux. There’s no tool for it.
I dug around and found that Macrium had released specs for their file format along with a Visual Studio C++ project you can build on Windows. That meant either installing VS Code and C++ SDKs on the gaming machine, or using their pre-built img_to_vhdx demo executable.
Neither option appealed to me. So I did the reasonable thing and wrote an mrimgx parser in Rust.
One parser leads to another
The Rust parser worked great. I could read the Macrium image on my Mac and convert it to QCOW2 for QEMU. (Yes, yes, QEMU has tools to convert from VHDX too. That’s beside the point.)
I extended the parser to understand partition tables so I could open the disk properly. Write to QCOW2, no problems.
Then I tried to boot it.
Blue screen.
Windows was looking for native disk drivers that don’t exist in a virtual machine. The fix is to modify the Windows registry to disable those drivers before booting. Specifically, you edit the registry HIVE files on the NTFS volume.
But to do that, you need to read NTFS. The existing Rust crate for NTFS is read-only and wasn’t deep enough into MFT parsing to avoid recursive indexing, so I wrote my own. Read and write.
Then I needed to modify the registry. There were existing Rust crates for reading registry HIVE files, but none of them supported writing. So I built a registry HIVE writer too. It’s only later I discovered that peitaosu had contributed his own HIVE editor that supports read/write. Oh well.
The chain looked like this: unpack the mrimgx image, open the partition table, find the NTFS volume, mount it, locate the registry hive, modify it to disable the incompatible driver, write everything back, boot in QEMU.
It worked.
The rabbit hole has no end
I should have stopped there. I had what I needed. The Windows VM booted, the gaming PC was free to become a Linux machine. Mission accomplished.
But I didn’t stop. Before I quite understood what was happening, I had added support for 31 filesystems. With most of their features. Including RAID. All in pure Rust. I called it archer.
Here’s what it reads now.
Windows/DOS: NTFS, FAT12, FAT16, FAT32, exFAT, ReFS
Linux: ext2, ext3, ext4, XFS, Btrfs, ZFS, JFS, ReiserFS, Reiser4, F2FS, SquashFS, DwarFS
macOS/Apple: APFS, HFS, HFS+
BSD: UFS, UFS2
DragonFly BSD: HAMMER, HAMMER2
Other: VxFS, bcachefs, ISO 9660, UDF
Every single one parsed from first principles in Rust. No C bindings, no FUSE, no shelling out to system tools.
“All your disks are belong to us” (iykyk)
If you can read filesystems, you need to read the containers they live in. Archer supports 12 disk image formats:
VMDK, QCOW2, VHD, VHDX, VDI, Parallels, Raw/IMG, DMG, OVA, IPSW, ISO 9660
Backup formats: Veeam (VBK/VIB), Acronis (TIB/TIBX), and Macrium Reflect X (.mrimgx).
Encryption is not optional
Real disks are encrypted. If you want to open actual disk images, encryption support isn’t optional. Archer supports:
BitLocker (with recovery key), LUKS/LUKS2, QCOW2 LUKS, DMG AES-128/AES-256, and Veeam encryption.
It also handles LVM2, Linux RAID (md), and ZFS pools for volume management. Because real-world disks aren’t just a single partition with a single filesystem. They’re layered, encrypted, striped, mirrored, and nested in ways that make you question the sanity of whoever set them up.
Obviously, it runs in the browser
I compiled the whole thing to WASM.

You can open an 880 GB disk image in under a second. Browse the filesystem, search for files, edit the Windows registry, explore partitions. All in the browser. No installation, no downloads, no plugins. Just open the page and point it at a disk image.
This works because the WASM code only reads the bytes it needs. It parses partition tables, filesystem metadata, and directory structures on demand. The 880 GB file isn’t loaded into memory. It’s accessed through range requests, reading only the sectors that matter for whatever you’re looking at right now.
How we got here
Every step was logical at the time. I needed to clone a disk, so I wrote a parser. The parser needed to understand partitions, so I added that. The VM wouldn’t boot, so I needed NTFS, then a registry editor. Once you have NTFS you think “ext4 isn’t that different” and then “well, ZFS would be interesting” and then some moments in you’re reading specs for filesystems you’ve never even used.
The scope is absurd. I know that. But the thing works, it’s fast, it runs everywhere, and every piece of it exists because at some point I genuinely needed it, maybe. Who cares.
Now what
I registered a domain for it. I think there’s something genuinely useful here for anyone who needs to look inside a disk image without installing platform-specific tools or booting up a specific OS. IT forensics, data recovery, migration work, or just curiosity about what’s on an old drive.
Oh, and I still haven’t wiped that gaming PC.
I needed to support a proprietary file format. There was no public documentation, no open-source parser, and the vendor had no interest in changing that. The only path forward was to open up the binary and figure out what it does.
Which is how I ended up back in a disassembler for the first time in decades.
I want to get this out of the way: proprietary file formats are almost never technically justified. They exist because someone decided that lock-in was more valuable than interoperability. The data inside is usually trivial. A header, some structured fields, maybe a compression layer. Nothing that couldn’t be a well-documented open format. Nothing that benefits from being secret.
Every proprietary format creates a little ecosystem of people reverse engineering the same thing, independently, repeatedly, across years. All that collective effort could have gone toward building something useful instead. It’s pure waste.
But here we are. The format exists, I need to read it, and nobody is going to hand me a spec. So let’s talk about tools.
Three disassemblers, three tradeoffs
I hadn’t used IDA since Ilfak Guilfanov released it as shareware in the early 90s. I was running it on DOS. That’s how long it had been. I’m not going to talk about what I was using it for, but iykyk ;) So I came into this round with fresh eyes, tried the three that people actually use for serious work, and came away with opinions.
Ghidra is free, open source, and built by the NSA. Its real strengths are multi-binary project support, where you can load an application and all its libraries at once, and a built-in decompiler that covers every architecture it supports without charging extra. It’s also fully extensible since you can read and modify the source. But the interface feels like it was designed by committee in 2005 and nobody has been allowed to touch it since. It works. It’s not pleasant. Oh, yeah, it’s using Java Swing..
Binary Ninja is the opposite experience. The UI is modern, responsive, and genuinely well-designed. Using it feels like someone who cares about user experience actually built a disassembler. You can orient yourself in an unfamiliar binary quickly, and the workflow just makes sense. For a lot of reverse engineering tasks, this is what I’d reach for first.
IDA Pro is the gold standard and it earns that reputation. The disassembler and decompiler produce output that is a noticeable step above the other two. The decompiled output is richer, the analysis is deeper, the type propagation is more accurate. For serious reverse engineering work, the quality of the output matters enormously, and IDA’s output is still the best I’ve seen. The user experience, on the other hand, feels like it has accumulated thirty years of interface decisions without ever rethinking any of them. You learn to live with it because the results are worth it.
Let the interns handle it
Here’s where it gets interesting.
Staring at decompiled output of a proprietary format, manually tracing data structures, labeling fields, testing hypotheses about what each byte means. It’s tedious, detail-oriented work. Exactly the kind of work I don’t want to do by hand for hours on end.
So I hooked IDA Pro up as an MCP server (by the shockingly talented Duncan Ogilvie) and pointed my unreliable interns with amnesia at the problem, as he would have said.
If you’ve used LLMs for any kind of technical work, you know the type. They’re enthusiastic, they work fast, they sometimes produce genuinely brilliant insights, and they forget everything between conversations. They’ll confidently label a field as a checksum, then in the next session ask you what that same field is. Classic intern behavior.
But for reverse engineering a file format, this turns out to be a surprisingly good fit. The work is inherently exploratory. You make hypotheses, test them, refine them. Having an agent that can read disassembly, propose structure definitions, and iterate on them faster than I can type is genuinely useful. The amnesia is annoying but manageable. You keep notes. You feed context back in. You learn to work with the limitations.
The MCP integration means the agent can actually navigate IDA’s analysis directly. It can look up cross-references, read decompiled functions, examine data segments. All the things I’d normally do by clicking around in the UI, except the agent does it programmatically and faster. My job shifts from doing the tedious work to directing the tedious work and verifying the results.
It’s not perfect. The interns still need supervision. But they’ve gotten surprisingly reliable, and they get me to 80% in a fraction of the time. The remaining 20% is the interesting part anyway.
The actual workflow
The loop looks like this: point the agent at a function that seems to handle file parsing, let it propose a data structure, validate that structure against known sample files, correct the mistakes, feed the corrections back in, and move to the next function. Repeat until you have a complete format specification.
What would have taken me days of manual analysis took an afternoon of supervised agent work. The proprietary format is no longer proprietary to me.
And honestly, that felt really good.
Most zero-knowledge bugs don’t announce themselves.
Your prover generates a proof. The verifier accepts it. Your tests pass. Everything looks correct. And it might be. Right up until someone discovers a soundness hole that lets them forge a proof for a false statement. You never saw it coming because none of your tests could have caught it. The proof was structurally broken in a way that functional testing can’t reach.
This is the fundamental problem with testing cryptographic systems: the thing you’re most afraid of, a soundness failure, is exactly the thing that’s hardest to detect. A wrong answer looks identical to a right one. Both pass the verifier. Both look fine in your CI pipeline. The difference only shows up when an adversary exploits it.
Why multi-backend makes this worse
When I started building gunmetal, I knew it wouldn’t stay Metal-only for long. CUDA comes next, then Vulkan, then whatever else makes sense. But here’s the thing: every new backend is a fresh surface for subtle bugs. Not the obvious kind that crash your program. The quiet kind that only show up on specific hardware, with specific inputs, in specific edge cases that nobody would think to test by hand.
A GPU kernel might be 10x faster than its CPU equivalent but introduce a one-in-a-million case where the result differs by a single bit. In most software, a single wrong bit is a rounding error. In cryptography, a single wrong bit can break soundness entirely. Your proof is now forgeable, and you have no idea.
So I built a verification framework for the prover itself. Three layers, each catching a different class of problem.
The foundation is Lean proofs for core arithmetic. These verify that field operations actually form a field, that curve addition is associative, that pairing equations hold. If Lean accepts the proof, the math is correct. Not probably correct. Not correct for all inputs we tested. Definitively correct, in the mathematical sense. This was a monumental undertaking tbh, to have a complete Lean for Groth16.
This matters because everything else in the prover builds on these operations. If your field multiplication has a subtle bug, every layer above it (MSM, FFT, the entire Groth16 protocol) inherits that bug. Getting the foundation right isn’t optional.
Property-based testing at scale
Above the formal proofs, there are 193 proptest properties covering MSM, NTT/FFT, QAP/R1CS, and Groth16 end-to-end. The idea is simple but powerful: instead of checking five carefully chosen inputs, you verify mathematical properties across thousands of randomly generated ones.
For example, rather than testing “does this specific multiplication produce this specific result,” you test “is multiplication commutative” and “is multiplication associative” and “does multiplying by one give back the original value”, across thousands of random field elements. This is how you catch the edge cases that deterministic tests miss: the off-by-one in a field reduction, the carry propagation error that only triggers on a specific bit pattern.
Differential testing across hardware
The third layer runs the same computation on GPU and CPU and compares the results bit-for-bit. Any divergence is a bug, full stop.
This layer exists because GPU arithmetic can be subtly wrong in ways that are nearly impossible to catch otherwise. The CPU implementation serves as a reference. If the GPU disagrees with it, something is broken, even if both results look plausible on their own. If my GPU implementation had been audited or even formally verified externally, I could have trusted the implementation. But since I’ll be working on more backends, its best to have a proven truth (arkworks).
Scanning for known attack patterns
The framework also scans for broken primitives, insufficient rounds, and FreeLunch attack patterns (ePrint 2024/347). If you’re not familiar with FreeLunch: certain prover optimizations create soundness holes that let an attacker forge proofs for false statements. The proof passes every functional test you could write. It completely breaks your security model. And it comes from an optimization, something that was supposed to make things faster, not less secure.
This is exactly the class of bug that you cannot find by testing outputs. You have to verify the structure of the computation itself.
The point
Implement a new backend. Run the verification suite. Pass everything. Now you have a cryptographically sound implementation. Not by hope, not by spot-checking a handful of test vectors, but by construction.
At least that was the point when I did all that.
I built a Groth16 prover that runs on Apple Metal, called it gunmetal for now.
If you work in zero-knowledge cryptography, your first reaction is probably why? And honestly, that’s fair. ZK proofs run on servers. The entire ecosystem is built around it. gnark, arkworks, rapidsnark, snarkjs, icicle. These are the standard tools, and they all target server-class hardware. Nobody proves on a laptop.
But Apple Silicon has a property that server GPUs don’t: unified memory. The CPU and GPU share the same physical memory, with no copying between them. In a prover, where you’re constantly moving large amounts of data between CPU-side computation and GPU-side computation, eliminating that copy overhead felt like it could matter more than people assumed. I wanted to find out.
It does.
The result
120ms for 134,000 constraints on an M3 Ultra. That’s 2.9x faster than the same workload running on CPU alone. Not bad.
Here’s how it works, and more importantly, why it’s fast.
Custom arithmetic, all the way down
The first decision was to build field arithmetic from scratch instead of relying on existing libraries. This sounds like unnecessary masochism, but it’s where a huge chunk of the performance comes from. The key move is keeping everything on the stack. Fixed-size arrays instead of heap-allocated big integers. When you’re doing millions of field multiplications, the difference between a stack allocation and a heap allocation adds up fast.
GLV endomorphism with zero-allocation Barrett reduction gets roughly 20x faster scalar decomposition than the standard approach. Montgomery multiplication is fully unrolled in Metal shaders, so the GPU spends its time doing math instead of managing memory.
Never let the hardware sit idle
This is where the real gains live. A Groth16 proof involves two expensive categories of work: QAP computation (the polynomial math) and MSM execution (the elliptic curve math). The natural instinct is to do one, then the other. But if you pipeline them, running QAP computation on one piece of hardware while MSM runs on another, you can overlap them almost entirely.
In practice, the GPU handles G1 MSMs while the CPU handles G2 MSMs in parallel. Two of the most expensive operations in the entire proving pipeline, running concurrently. Zero idle time.
The FFT that actually fits the GPU
Number-theoretic transforms are another major bottleneck, and the algorithm choice here matters enormously. I went with a Stockham FFT, which avoids the bit-reversal permutations that traditional FFT approaches require. That turns out to be a big deal on a GPU, because bit-reversal creates scattered memory access patterns that destroy throughput. Stockham keeps memory access coalesced, and the result is about 10x faster than the CPU baseline.
Unified memory is the whole game
This is the part most people miss when comparing Apple Silicon to discrete GPUs. On a traditional setup with an NVIDIA card, you have to copy data from CPU memory to GPU memory across the PCIe bus. You need staging buffers. You pay transfer costs every time the GPU needs new data.
On Apple Silicon, there’s none of that. The GPU reads directly from the same physical memory the CPU wrote to. No copies, no bus, no staging. The data is just there. Combine that with interleaved point layout for cache locality, and memory access patterns become essentially free performance. This architectural advantage is what makes the whole approach viable.
The MSM implementation uses the Pippenger algorithm with 13-bit windows and local threadgroup atomics. One thing I found during profiling: 128 threads per threadgroup consistently outperforms the conventional 256-thread baseline on Metal hardware. This is the kind of thing you can only discover by measuring on actual hardware. The textbook answer isn’t always right for a specific execution model.
Prior art
The zkmopro team and EluAegis did some cool work on Metal MSM. What I needed, though, was a complete proving pipeline, not just MSM, and one that’s architecturally ready to support multiple GPU backends over time.
Next up: CUDA and ROCm backends. Maybe Mali. Someday. The architecture is designed for it. First I have some other battles to overcome.
I’ve been building things for a long time without ever writing about them. For some reason I decided to write a logbook.
This isn’t going to be a polished publication with an editorial calendar. It’s a workbench log. I build something, I learn something, I write it down. Right now most of what I’m working on is cryptography engineering, zero-knowledge provers, GPU acceleration, formal verification, file systems, deduplications, LLMs.. anything, really. I go where the curiosity takes me, and my curiosity have no boundaries.
The reason I’m starting this is simple. I keep running into stuff where the existing implementations doesn’t go far enough, the academic papers are impenetrable, and the only way forward is to write code and see how far I can possibly take it. When I eventually figure something out, I may leave a trail. Partly for future me, partly for anyone else who’s stuck in the same hopeless wasteland.
So that’s what this is. Technical notes from the workbench. No schedule, no rules about format or length. Just whatever I think is genuinely worth logging.