Use touchstone by VisruthSK · Pull Request #352 · stan-dev/loo

VisruthSK · 2026-04-08T23:22:10Z

Use touchstone to evaluate PRs' impact on performance. Starting with a monolithic approach to testing, calling just loo().

Closes #348.

codecov-commenter · 2026-04-08T23:25:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.80%. Comparing base (60f012d) to head (c73be4d).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #352   +/-   ##
=======================================
  Coverage   92.80%   92.80%           
=======================================
  Files          31       31           
  Lines        3004     3004           
=======================================
  Hits         2788     2788           
  Misses        216      216

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

VisruthSK · 2026-04-09T06:05:09Z

Should lower reps to maybe 5, perhapsalso use smaller data? Current ones are fully LLM generated so maybe more careful creation could cut down on execution time while still being informative tests.

@jgabry any thoughts? The main file is touchstone/script.R The code rn is ugly, I'll fix it tomorrow probably, but any thoughts on the actual tests? The run is taking 1hr, is that way too slow?

jgabry · 2026-04-09T19:36:54Z

If we switch to the example that @avehtari suggested, is that a bigger log lik matrix/array than what you have here or smaller?

VisruthSK · 2026-04-10T02:36:53Z

Wine one is a decent bit larger. How long do you figure is too long?

jgabry · 2026-04-10T19:14:37Z

Is there an easy way to tell it to skip the touchdown step? For important PRs it’s fine for it to be slow, even very slow, but for simple PRs that we know won’t affect runtime at all (or trivially) it would be good to be able to skip it. I guess we could always just merge the PR without waiting for it to finish, but sometimes we’re not checking the CI results immediately and then we’d end up just wasting a bunch of computation which I think might prevent other things in the Stan org from running.

avehtari · 2026-04-10T20:01:23Z

Storing that log_lik_matrix and using the stored matrix for loo takes a couple seconds. I don't think that's too slow.

jgabry · 2026-04-10T20:14:17Z

I agree, that’s not too slow. Maybe I’m just confused. I thought that example was bigger than the one Visruth is currently using, and the current run takes 1hr. It runs things many times to compute the benchmarks, but running something that takes a few seconds many times should still take way less than an hour. So then why does the current example take 1hr? Sorry if I’m missing something simple here.

VisruthSK · 2026-04-10T20:17:31Z

I think the LLM code is doing the wrong thing--I'll run the wine thing and store the log lik (probably in the touchstone dir?) and just read from it and run loo. I think the LLMd code is just abysmal and so is too slow. Will try to get this in by Monday.

jgabry · 2026-04-10T21:22:24Z

I think the LLM code is doing the wrong thing

It might be, I haven't had time to dig into it. Either way, thanks for setting this up. I took a quick look, and I do like that it's also testing the loo.function method in addition to the matrix and array methods. But yeah, let's use @avehtari's example.

avehtari · 2026-04-11T08:26:01Z

I tested the current benchmark code by running it manually on my laptop, and it takes less than 10s

VisruthSK · 2026-04-11T17:43:47Z

I looked into the logs and Aki is right--the code is fine. I think the first run was slow because it needed to install R, deps, etc. Did a rerun and it took 5 min . If the code examples look reasonable could be good to merge. I think the commenting only works when merged.

avehtari · 2026-04-11T17:54:37Z

I think the log_liks the current code generates are bad, there are tens of warnings. Please use the log_lik I suggested, as I know that it's based on real model, has nice properties, and can be used also for srs_diff_est example

VisruthSK · 2026-04-11T18:37:00Z

Swapped to wine data, but didn't write a helper to expose the data, just accessing it through loo:::.wine_log_lik_matrix. The second test is still AI generated. Touchstone won't work since the wine data isn't on master.

jgabry · 2026-04-13T18:31:45Z

On the master branch sysdata.R is <250 KB but now it's almost 40 MB. That's going to be an issue for CRAN I think. Is the new example really that big? If so, then we probably can't include it with the package but we could still put it in the repo and use it here for benchmarking.

VisruthSK · 2026-04-13T18:40:39Z

Maybe I did something wrong but when I saved the loglik to RDS by itself its 39.5 MB. Forgot CRAN had limits on that. If we put the file in the touchstone directory it should be fine right?

Edit: putting in touchstone dir and excluding from build means we can't uses it directly in example, but maybe that's okay?

jgabry · 2026-04-13T19:29:50Z

If we want to use it in examples too then we can create it again in the example by fitting the brms model. Creating a log lik matrix of that size should be ok from CRAN's perspective, just not storing it in the package. The policy is:

As a general rule, neither data nor documentation should exceed 5MB

They do offer exceptions sometimes (RStan is huge), but they are pretty strict with the exceptions so I don't think it's worth asking in a case like this.

jgabry · 2026-04-13T19:32:31Z

And I don't think you did anything wrong necessarily. 40MB is plausible.

VisruthSK · 2026-04-13T19:47:32Z

I'll see how to make the file only available to touchstone then.

Barebones touchstone setup

ce9689c

VisruthSK mentioned this pull request Apr 8, 2026

use posterior::gpdfit and posterior::qgeneralized_pareto() #305

Open

jgabry mentioned this pull request Apr 9, 2026

Rely on posterior for pareto smooth tails #290

Open

VisruthSK added 2 commits April 8, 2026 18:14

Generated tests

faaca77

Fixed script calling

c73be4d

VisruthSK added 3 commits April 11, 2026 11:23

Use wine data

bb68267

Fixed typo

b7c7719

Renamed file [no ci]

e61f350

Uh oh!

Conversation

VisruthSK commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

VisruthSK commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgabry commented Apr 9, 2026

Uh oh!

VisruthSK commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgabry commented Apr 10, 2026

Uh oh!

avehtari commented Apr 10, 2026

Uh oh!

jgabry commented Apr 10, 2026

Uh oh!

VisruthSK commented Apr 10, 2026

Uh oh!

jgabry commented Apr 10, 2026

Uh oh!

avehtari commented Apr 11, 2026

Uh oh!

VisruthSK commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avehtari commented Apr 11, 2026

Uh oh!

VisruthSK commented Apr 11, 2026

Uh oh!

jgabry commented Apr 13, 2026

Uh oh!

VisruthSK commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgabry commented Apr 13, 2026

Uh oh!

jgabry commented Apr 13, 2026

Uh oh!

VisruthSK commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

VisruthSK commented Apr 8, 2026 •

edited

Loading

codecov-commenter commented Apr 8, 2026 •

edited

Loading

VisruthSK commented Apr 9, 2026 •

edited

Loading

VisruthSK commented Apr 10, 2026 •

edited

Loading

VisruthSK commented Apr 11, 2026 •

edited

Loading

VisruthSK commented Apr 13, 2026 •

edited

Loading