feat: add PinSource data source for pins boards#246
Conversation
Add PinSource R6 class that reads data from a pins board into DuckDB. Supports lazy file reading (parquet, csv, json via DuckDB native readers) and in-memory path (rds, arrow via pin_read + dbWriteTable). Pin metadata (title, description, tags) auto-populates data_description when not provided by the user.
Add PinSource class that reads data from a pins board into DuckDB. Supports lazy file reading (parquet, csv, json) and in-memory path for other pin types. Pin metadata auto-populates data_description. Pins is an optional dependency (pip install querychat[pins]).
- Use Any type for DuckDB type parameter instead of unavailable DuckDBPyType - Refactor _make_column_meta to reduce return statements - Fix pandas import alias and try-except-pass lint issues - Apply ruff formatting
R: 39 tests covering lazy path (parquet, csv, json), in-memory path (rds), error handling, schema, table_name, get_data_description, and DuckDB security lockdown. Python: 19 tests covering lazy path, schema generation, metadata, table naming, and security lockdown.
Combine board creation, pin_write, and PinSource construction into a single local_pin_source() helper. Suppress pin_write messages to keep test output clean.
- Export PinSource in NAMESPACE and generate man page - Track auto-derived data_description so it refreshes when data_source changes (prevents stale pin metadata in system prompt) - Sanitize pin names into valid SQL identifiers when used as table names - Fix overly broad exception type in Python security test
- Clarify R docs: non-file pins must produce a data frame, not "Arrow, etc." - Add QueryChat integration tests verifying auto-fill from pin metadata, explicit data_description wins, and auto-description clears on source change
- Support multi-file pins by passing all paths from pin_download() to DuckDB file readers instead of only the first file - Add pandas to the Python pins optional-dependency extra since PinSource always returns pandas DataFrames
- Error when pin_download() returns != 1 file; pin_write() always produces single-file pins, and multi-file pins from pin_upload() are typically heterogeneous and not safely unionable. - Wrap DuckDB connection setup in cleanup guards (on.exit in R, try/except in Python) so the connection is always closed on failure. - Regenerate PinSource.Rd to remove stale Arrow reference.
Wrap the query as a subquery before appending LIMIT 1, matching the pattern used by PolarsLazySource and IbisSource. The previous approach broke valid SQL statements with trailing semicolons.
Clean up the previous data source when QueryChat.data_source is reassigned, preventing DuckDB connection leaks in long-lived apps. Add tests for multi-file pin rejection (both R and Python) and trailing-semicolon handling in PinSource.test_query().
- Replace bare on.exit() disarm with a con_owned flag so other exit
handlers in the same frame aren't cleared.
- Change R sanitize_table_name() to strip all trailing underscores
(sub("_+$",...)) matching Python's rstrip("_") behavior, so the same
pin name produces the same table name in both languages.
…implifications - Add get_data_description() to base DataSource class (returns "" by default) in both R and Python, matching get_semantic_views_description() pattern. Remove isinstance(PinSource) checks from QueryChat — any DataSource subclass can now provide a description polymorphically. - Remove unused .board and .pin_name private fields from R PinSource. - Simplify Python PinSource in-memory path: use a staging virtual table name for registration instead of the create-tmp/unregister/rename dance (4 ops → 3, no name collision risk). - Refactor Python tests to use pytest fixtures with try/finally for cleanup, preventing DuckDB connection leaks on test failure.
Extract duckdb_column_meta(), duckdb_column_stats(), and duckdb_get_schema() as module-level functions in _datasource.py. Both DataFrameSource and PinSource now call the same code path for schema generation. This replaces DataFrameSource's narwhals-based type introspection with DuckDB's native type metadata (which is more accurate — e.g. Python date objects are correctly typed as DATE instead of Object/TEXT). Also fixes a pre-existing bug where narwhals Datetime was mapped to "TIME" instead of "TIMESTAMP". PinSource drops ~90 lines of reimplemented schema logic.
Add assertions for table_name non-null where it's guaranteed by control flow, and guard _auto_fill_data_description against None data_source.
cpsievert
left a comment
There was a problem hiding this comment.
Thanks, this is a nice addition! Here are some initial thoughts from scanning the Python logic. I haven't looked yet at the R changes.
Moves the inline try/import/isinstance board detection from _querychat_base.py into a reusable TypeGuard function co-located with the rest of the pins-specific logic.
- Add BaseBoard to data_source type unions in __init__, data_source setter, and normalize_data_source() (behind TYPE_CHECKING guard) - Replace inline try/import/isinstance board detection with the is_pins_board() TypeGuard from _pin_source.py - Replace _data_description_auto bool with _data_description_mode Literal["supplied", "inferred", "empty"] for clarity
Users can pass a pins board directly to QueryChat(), so direct PinSource construction is uncommon. Available as querychat.types.PinSource for users who need it.
Adds a test that explicitly confirms a user-provided data_description is preserved (not overwritten by pin metadata) when the data source is replaced via the setter.
Add PinSource to pkgdown reference index and fix nanoparquet dependency issue. pins soft-depends on nanoparquet for parquet I/O, which wasn't available on CI. Default test pin type changed from parquet to csv since most tests don't need a specific format; parquet-specific tests now explicitly set type and skip when nanoparquet is missing.
|
Have you considered switching
Practically it'd mean |
Thanks for catching this; I meant to call it out in the description. I was hoping to get a vibe check from about whether that was a good idea or not. Seems like it would be! |
DuckDB's read_json_auto() requires the json extension, which isn't bundled and can't be auto-installed in all environments. Explicitly install and load it before reading json pins in both R and Python.
PinSource now returns nw.DataFrame, matching the pattern used by SQLAlchemySource. Uses polars (.pl()) when available for faster DuckDB result conversion, falls back to pandas (.df()) otherwise. Both are wrapped via nw.from_native(), giving consumers the choice of .to_native(), .to_pandas(), or .to_polars() on results. Drops the explicit pandas dependency from the pins extra since pins itself hard-depends on pandas.
Arrow pins can't be read natively by DuckDB, but when polars is installed, PinSource now uses pl.read_ipc() to read arrow pin files directly — skipping the pin_read() → pandas deserialization overhead. Falls back to the existing pin_read() path when polars isn't available.
Okay, done in ad18fa7 and def2051. PinSource now returns def _convert_result(result: duckdb.DuckDBPyConnection) -> nw.DataFrame:
if _has_polars():
return nw.from_native(result.pl())
return nw.from_native(result.df())Polars gives you the fast I also added arrow/IPC pin support: when polars is available, PinSource uses |
…_mode - Extract DuckDB security lockdown SQL into shared duckdb_lock_down() helper in both R (utils-duckdb.R) and Python (_datasource.py), replacing inline duplicates in DataFrameSource and PinSource. - Rename R .data_description_auto (boolean) to .data_description_mode (3-state: "supplied"/"inferred"/"empty") for consistency with Python.
Ruff S101 disallows assert in non-test code.
There was a problem hiding this comment.
Pull request overview
Adds a new PinSource data source for both the R and Python querychat packages to query datasets stored on pins boards via DuckDB (direct file reads where possible, otherwise in-memory registration), with automatic prompt metadata and post-load DuckDB security lockdown.
Changes:
- Introduces
PinSourceimplementations (R + Python) with parquet/csv/json “lazy” loading and other formats viapin_read(). - Adds automatic
data_descriptioninference from data source metadata and cleans up old data sources when replaced. - Refactors DuckDB schema/stat helpers (Python) and centralizes DuckDB lockdown logic (R + Python).
Reviewed changes
Copilot reviewed 16 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Adds pins as an optional extra dependency for Python. |
| pkg-r/tests/testthat/test-PinSource.R | Adds comprehensive R tests for PinSource, security, and QueryChat integration. |
| pkg-r/R/utils-duckdb.R | Introduces shared duckdb_lock_down() helper. |
| pkg-r/R/utils-check.R | Adds sanitize_table_name() helper for pin/table name normalization. |
| pkg-r/R/QueryChat.R | Adds data description inference and pins-board normalization support; updates cleanup behavior. |
| pkg-r/R/PinSource.R | New R PinSource backed by pins + DuckDB with metadata-derived descriptions. |
| pkg-r/R/DataSource.R | Adds get_data_description() default method to the DataSource interface. |
| pkg-r/R/DataFrameSource.R | Uses shared duckdb_lock_down() helper. |
| pkg-r/pkgdown/_pkgdown.yml | Updates reference index to include all *Source classes automatically. |
| pkg-r/NAMESPACE | Exports PinSource. |
| pkg-r/man/TblSqlSource.Rd | Documents newly inherited get_data_description() method. |
| pkg-r/man/PinSource.Rd | Adds generated documentation for PinSource. |
| pkg-r/man/DBISource.Rd | Documents newly inherited get_data_description() method. |
| pkg-r/man/DataSource.Rd | Documents new get_data_description() method. |
| pkg-r/man/DataFrameSource.Rd | Documents newly inherited get_data_description() method. |
| pkg-r/DESCRIPTION | Adds pins (and nanoparquet) to Suggests for R. |
| pkg-py/tests/test_pin_source.py | Adds Python tests for PinSource, security, and QueryChat integration. |
| pkg-py/src/querychat/types/init.py | Exports PinSource from the public types module. |
| pkg-py/src/querychat/_querychat_base.py | Adds pins-board support, description inference, and old-source cleanup behavior. |
| pkg-py/src/querychat/_pin_source.py | New Python PinSource backed by pins + DuckDB with metadata-derived descriptions. |
| pkg-py/src/querychat/_datasource.py | Extracts shared DuckDB schema/stats helpers and adds get_data_description() default. |
Files not reviewed (5)
- pkg-r/man/DBISource.Rd: Generated file
- pkg-r/man/DataFrameSource.Rd: Generated file
- pkg-r/man/DataSource.Rd: Generated file
- pkg-r/man/PinSource.Rd: Generated file
- pkg-r/man/TblSqlSource.Rd: Generated file
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return nw.from_native(result.df()) | ||
|
|
||
|
|
||
| class PinSource(DataSource[nw.DataFrame]): |
There was a problem hiding this comment.
I think what we have here is solid, but I also just want to point out that we could probably also have a PinLazySource(DataSource[nw.LazyFrame]) that skips the whole download+import to DuckDB step.
Now that I'm saying this out loud though, maybe we could just document how to feed a pins URL into a pl.LazyFrame() and lean into the existing PolarsLazySource class?
There was a problem hiding this comment.
Good idea. I added documentation for this pattern in e420c94 — both the data-sources vignettes (R and Python) and the PinSource class docs now have a "Lazy queries with pins" section showing how to use pin_download() + tbl_sql (R) or pl.scan_parquet() + PolarsLazySource (Python) for large parquet pins.
The docs note that pin_download() still fetches files to a local cache, so "lazy" here means deferred query execution, not deferred data access. They also call out that the lazy path skips the DuckDB security lockdown that PinSource applies.
A proper PinLazySource or upstream pin_url() could be interesting follow-ups if there's demand.
cpsievert
left a comment
There was a problem hiding this comment.
LGTM once outstanding comments are addressed
When data_source is a pins board and table_name is omitted, fail fast with a clear error instead of letting missing_arg() propagate to PinSource and produce a confusing downstream error.
The data_source setter called old_source$cleanup() unconditionally after normalization. Since normalize_data_source() passes DataSource objects through as-is, reassigning the same instance would close the connection on the still-active object.
Same fix as the R side — normalize_data_source() passes DataSource objects through as-is, so reassigning the same instance would call cleanup() on the still-active object.
Documents PinSource usage in both R and Python data sources vignettes, including the lazy query pattern (tbl_sql for R, Polars LazyFrame for Python) for large parquet pins. Also adds the same guidance to the PinSource class-level docs.
Summary
Adds a new
PinSourcedata source to both the R and Python querychat packages, letting users chat with data stored as pins. Users can pass aPinSourcedirectly or hand a pins board toQueryChat/querychat_app()and letnormalize_data_source()create one automatically.Key design decisions:
pin_download()+CREATE TABLE AS SELECT * FROM read_parquet(...), skipping R/Python deserialization entirely. For other pin types (rds, arrow, etc.), falls back topin_read()+ in-memory registration.enable_external_access = false,autoinstall_known_extensions = false,autoload_known_extensions = false— prevents LLM-generated SQL from accessing the filesystem.data_descriptionfrom pin metadata (title, description, tags) when the user hasn't provided their own.normalize_data_source()now accepts a pins board directly asdata_source, creating aPinSourceautomatically.QueryChatinferstable_namefromDataSource.table_namewhen not provided, andPinSourcesanitizes pin names (e.g."alex.chisholm/posit_ai_interest"→"alex_chisholm_posit_ai_interest").duckdb_column_meta(),duckdb_column_stats(), andduckdb_get_schema()as shared functions, eliminating ~90 lines of duplication betweenDataFrameSourceandPinSource.Additional fixes included:
cleanup()) when replaced via thedata_sourcesetter, preventing DuckDB connection leaks in long-lived apps.pinsis an optional dependency in both R (Suggests) and Python ([project.optional-dependencies]).Verification
R
Python