Speed up scalar Sobol with closed-form evaluation#97
Conversation
|
Hi @wantonsushi. I've just given the PR a scan and it is looking good. This is great work, thank you. Very interesting to see the comparison. I'll follow up with a closer read of the code details sometime tomorrow. |
|
These changes are stellar, thank you. I would be happy to add this into the project. I've added some minor comments to the PR. There are also a few additional of files we should update:
Once we are happy with the PR, next steps will be rebase onto main, and squash the commits down to a single commit. You will need to digitally sign the commit. Info on how to digitally sign, and commit message format can be found here: |
|
Everything is looking good to me. Thank you for adding those additional changes. I think we are now ready for next steps:
Once that is done, the pipeline should allow us to merge the PR 🚀 |
The scalar Sobol path evaluates each dimension with a per-bit matrix-vector product over 16 columns. Ahmed (EGSR 2024) computes the same result more efficiently using a sequence of shift-mask-xor steps, avoiding the loop over the columns. Replace the scalar sobolReversedIndex loop with closed-form steps for dimensions 0 to 3, steps emitted by a new matrices CLI tool, and cite the technique in the comments and docs. Leave the SIMD paths unchanged, as experiments showed Ahmed to be slower since the SIMD path also avoids the loop by packing the 16 columns into a wide register. Add test to validate the new path against old classic implementation. Signed-off-by: wantonsushi <realeuanhughes@gmail.com>
c676881 to
467eaf9
Compare
|
Pushed. Should be good to go now, let me know if anything else is required. Thanks for the opportunity to contribute! Might do more in the future :) |
|
This is cool! I hadn't figured out how to extend Ahmed's construction to arbitrary matrices as you've done here. If I am reading the code right though, aside from the Pascal matrix (dimension 1) which has a compact O(log(n)) evaluation, we should expect to get roughly O(n) steps for an arbitrary matrix? So if we wanted to extend the code to 32-bit indices we would expect roughly twice as many steps? Ahmed's follow up paper on SZ sequences has the nice property that higher order terms also have a factored representation so they are all O(log(n)) to evaluate, which allows extending even to 64-bit sequences (potentially useful if you are sharing a single sequence across pixels). Also since all pairs of dimensions are themselves (0,2) sequences, you can benefit from the super fast pascal matrix evaluation for every other term. |
|
Hi Chris. Exciting to see you comment here. Is it weird to say I'm a big fan of your work at SPI? Yes, you're right: dim 1 is O(log n), whereas dims 2/3 are O(n). I verified this, for dim 2:
so roughly doubling like you said. The SZ construction seems to make sense: every dim is a Pascal matrix P(a), and the S-P-Z decomposition evaluates it as the shared Pascal plus precision-independent corrections, so all dims stay O(log n) even at 64-bit (with every other dim reducing to plain Pascal)? I guess the catch is it's a different sequence past dim 1, so for OpenQMC, this isn't a simple drop-in replacement. @joshbainbridge what do you think? I think I'll try making an SZ prototype for OpenQMC and benchmark it against the current path when I get a chance, I will follow up. Thanks for the comment. |
|
I played a bit with the code and got the same results as you for the Kuo Sobol matrices which OpenQMC is using. The SZ matrices have a slightly less random structure, so more stuff cancels out and you get only 12 steps required for 32-bit input for dimensions 2,3 (which is fairly close to what you need with the factored implementation and is easier to reason about). I'll do some tests to see if it makes any big difference in practice. One small optimization, if instead of: You do this: you can skip the final call to |
|
@fpsunflower Thanks for the suggestion. I applied it, but ran into an issue. The conjugated form measured 0.83x on GPU for me. To my understanding, right shifts contend with the logic pipe, while the original left shifts lower to multiplies on the multiply pipe. I tried right-shift-as-multiply-high, but __umulhi expands on Ada and I measured 0.59x. So I kept the old method for the GPU path. Maybe there is something I overlooked. For CPU though, works great. With I've left this as an independent commit so you and @joshbainbridge can easily verify the changes. I'll squash it to one commit if we're happy with it. |
fde20ff to
fbf7ec2
Compare
|
Interesting. That difference may depend on the GPU architecture in practice, I hadn't considered the fact that left-shifts are potentially faster. Still, its better if the I'll also point out that in OpenQMC, Also, while the 16-bit implementation is sufficient for a randomized seed (all the index bits are used for the sample index), when using the Z-sampler construction, part of the index gets used for the pixel coordinates and you need a fully 32-bit evaluation. With the optimized method used here, I don't think it should be that much slower than the 16-bit version and would scale to higher sample counts. |
Hello OpenQMC maintainers!
This PR implements the Ahmed 2024 paper mentioned in issue #67 for the scalar path of
sobolReversedIndex. I used the existing benchmark tool to check timings andgenerateto confirm the output is unchanged.Changes
include/oqmc/owen.h: the scalar path uses the closed form.src/tools/cli/matrices.cpp: derives the steps from the generator matrices and prints them, the same way it already printsdirections[].src/tests/owen.cpp:OwenTest.SobolReversedIndexchecks the result against the direction matrices for every 16-bit index and dimension.I also tried to implement a SIMD version (sobol_simd_experiment.cpp). It's not in this PR since it doesn't beat the existing SIMD, but I'm sharing it so you can reproduce the SIMD timings below.
Timings
Using
benchmark sobol samples, median of 9 runs, in microseconds:The speedup is small because generation is only part of a draw, along with scrambling and state hashing. The SIMD implementation fits one dimension per lane (four lanes). The existing SIMD path is faster because it packs all sixteen matrix columns into a wider register, so I left the SIMD paths alone.
Verification
generate soboloutput is byte-identical to before on scalar, SSE and AVXclang-formatandclang-tidyclean