Skip to content

Use touchstone#352

Draft
VisruthSK wants to merge 6 commits intomasterfrom
use-touchstone
Draft

Use touchstone#352
VisruthSK wants to merge 6 commits intomasterfrom
use-touchstone

Conversation

@VisruthSK
Copy link
Copy Markdown
Member

@VisruthSK VisruthSK commented Apr 8, 2026

Use touchstone to evaluate PRs' impact on performance. Starting with a monolithic approach to testing, calling just loo().

Closes #348.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.80%. Comparing base (60f012d) to head (c73be4d).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #352   +/-   ##
=======================================
  Coverage   92.80%   92.80%           
=======================================
  Files          31       31           
  Lines        3004     3004           
=======================================
  Hits         2788     2788           
  Misses        216      216           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@VisruthSK
Copy link
Copy Markdown
Member Author

VisruthSK commented Apr 9, 2026

Should lower reps to maybe 5, perhapsalso use smaller data? Current ones are fully LLM generated so maybe more careful creation could cut down on execution time while still being informative tests.

@jgabry any thoughts? The main file is touchstone/script.R The code rn is ugly, I'll fix it tomorrow probably, but any thoughts on the actual tests? The run is taking 1hr, is that way too slow?

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Apr 9, 2026

If we switch to the example that @avehtari suggested, is that a bigger log lik matrix/array than what you have here or smaller?

@VisruthSK
Copy link
Copy Markdown
Member Author

VisruthSK commented Apr 10, 2026

Wine one is a decent bit larger. How long do you figure is too long?

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Apr 10, 2026

Is there an easy way to tell it to skip the touchdown step? For important PRs it’s fine for it to be slow, even very slow, but for simple PRs that we know won’t affect runtime at all (or trivially) it would be good to be able to skip it. I guess we could always just merge the PR without waiting for it to finish, but sometimes we’re not checking the CI results immediately and then we’d end up just wasting a bunch of computation which I think might prevent other things in the Stan org from running.

@avehtari
Copy link
Copy Markdown
Member

Storing that log_lik_matrix and using the stored matrix for loo takes a couple seconds. I don't think that's too slow.

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Apr 10, 2026

I agree, that’s not too slow. Maybe I’m just confused. I thought that example was bigger than the one Visruth is currently using, and the current run takes 1hr. It runs things many times to compute the benchmarks, but running something that takes a few seconds many times should still take way less than an hour. So then why does the current example take 1hr? Sorry if I’m missing something simple here.

@VisruthSK
Copy link
Copy Markdown
Member Author

I think the LLM code is doing the wrong thing--I'll run the wine thing and store the log lik (probably in the touchstone dir?) and just read from it and run loo. I think the LLMd code is just abysmal and so is too slow. Will try to get this in by Monday.

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Apr 10, 2026

I think the LLM code is doing the wrong thing

It might be, I haven't had time to dig into it. Either way, thanks for setting this up. I took a quick look, and I do like that it's also testing the loo.function method in addition to the matrix and array methods. But yeah, let's use @avehtari's example.

@avehtari
Copy link
Copy Markdown
Member

I tested the current benchmark code by running it manually on my laptop, and it takes less than 10s

@VisruthSK
Copy link
Copy Markdown
Member Author

VisruthSK commented Apr 11, 2026

I looked into the logs and Aki is right--the code is fine. I think the first run was slow because it needed to install R, deps, etc. Did a rerun and it took 5 min . If the code examples look reasonable could be good to merge. I think the commenting only works when merged.

@avehtari
Copy link
Copy Markdown
Member

I think the log_liks the current code generates are bad, there are tens of warnings. Please use the log_lik I suggested, as I know that it's based on real model, has nice properties, and can be used also for srs_diff_est example

@VisruthSK
Copy link
Copy Markdown
Member Author

Swapped to wine data, but didn't write a helper to expose the data, just accessing it through loo:::.wine_log_lik_matrix. The second test is still AI generated. Touchstone won't work since the wine data isn't on master.

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Apr 13, 2026

On the master branch sysdata.R is <250 KB but now it's almost 40 MB. That's going to be an issue for CRAN I think. Is the new example really that big? If so, then we probably can't include it with the package but we could still put it in the repo and use it here for benchmarking.

@VisruthSK
Copy link
Copy Markdown
Member Author

VisruthSK commented Apr 13, 2026

Maybe I did something wrong but when I saved the loglik to RDS by itself its 39.5 MB. Forgot CRAN had limits on that. If we put the file in the touchstone directory it should be fine right?

Edit: putting in touchstone dir and excluding from build means we can't uses it directly in example, but maybe that's okay?

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Apr 13, 2026

If we want to use it in examples too then we can create it again in the example by fitting the brms model. Creating a log lik matrix of that size should be ok from CRAN's perspective, just not storing it in the package. The policy is:

As a general rule, neither data nor documentation should exceed 5MB

They do offer exceptions sometimes (RStan is huge), but they are pretty strict with the exceptions so I don't think it's worth asking in a case like this.

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Apr 13, 2026

And I don't think you did anything wrong necessarily. 40MB is plausible.

@VisruthSK
Copy link
Copy Markdown
Member Author

I'll see how to make the file only available to touchstone then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance benchmarks for PRs?

4 participants