Skip to content

feat: add PinSource data source for pins boards#246

Merged
gadenbuie merged 34 commits into
mainfrom
feat/pins-source
Jun 12, 2026
Merged

feat: add PinSource data source for pins boards#246
gadenbuie merged 34 commits into
mainfrom
feat/pins-source

Conversation

@gadenbuie

Copy link
Copy Markdown
Contributor

Summary

Adds a new PinSource data source to both the R and Python querychat packages, letting users chat with data stored as pins. Users can pass a PinSource directly or hand a pins board to QueryChat/querychat_app() and let normalize_data_source() create one automatically.

Key design decisions:

  • Two loading paths: For parquet, CSV, and JSON pins, DuckDB reads the file directly via pin_download() + CREATE TABLE AS SELECT * FROM read_parquet(...), skipping R/Python deserialization entirely. For other pin types (rds, arrow, etc.), falls back to pin_read() + in-memory registration.
  • DuckDB security lockdown after data loading: enable_external_access = false, autoinstall_known_extensions = false, autoload_known_extensions = false — prevents LLM-generated SQL from accessing the filesystem.
  • Auto-populates data_description from pin metadata (title, description, tags) when the user hasn't provided their own.
  • Pins board shorthand: normalize_data_source() now accepts a pins board directly as data_source, creating a PinSource automatically.
  • Table name inference: QueryChat infers table_name from DataSource.table_name when not provided, and PinSource sanitizes pin names (e.g. "alex.chisholm/posit_ai_interest""alex_chisholm_posit_ai_interest").
  • Shared DuckDB helpers (Python): Extracted duckdb_column_meta(), duckdb_column_stats(), and duckdb_get_schema() as shared functions, eliminating ~90 lines of duplication between DataFrameSource and PinSource.

Additional fixes included:

  • Old data source is now cleaned up (cleanup()) when replaced via the data_source setter, preventing DuckDB connection leaks in long-lived apps.
  • Multi-file pins are rejected with a clear error message.
  • pins is an optional dependency in both R (Suggests) and Python ([project.optional-dependencies]).

Verification

R

library(pins)
library(querychat)

board <- board_folder(withr::local_tempdir())
pin_write(board, mtcars, "mtcars", title = "Motor Trend Car Data")

# Direct PinSource usage
ps <- PinSource$new(board, "mtcars")
ps$execute_query("SELECT mpg, cyl FROM mtcars LIMIT 5")
ps$get_data_description()  # "Motor Trend Car Data"

# Shorthand: pass board directly
app <- querychat_app(board, "mtcars", greeting = "Ask me about cars!")
shiny::runApp(app)

Python

import pins
from querychat import PinSource
import pandas as pd

board = pins.board_folder("/tmp/test-board")
board.pin_write(pd.DataFrame({"x": [1, 2], "y": [3, 4]}), "mydata", title="Test Data", type="parquet")

ps = PinSource(board, "mydata")
ps.execute_query("SELECT * FROM mydata")
ps.get_data_description()  # "Test Data"

gadenbuie and others added 16 commits June 11, 2026 15:31
Add PinSource R6 class that reads data from a pins board into DuckDB.
Supports lazy file reading (parquet, csv, json via DuckDB native readers)
and in-memory path (rds, arrow via pin_read + dbWriteTable). Pin metadata
(title, description, tags) auto-populates data_description when not
provided by the user.
Add PinSource class that reads data from a pins board into DuckDB.
Supports lazy file reading (parquet, csv, json) and in-memory path
for other pin types. Pin metadata auto-populates data_description.
Pins is an optional dependency (pip install querychat[pins]).
- Use Any type for DuckDB type parameter instead of unavailable DuckDBPyType
- Refactor _make_column_meta to reduce return statements
- Fix pandas import alias and try-except-pass lint issues
- Apply ruff formatting
R: 39 tests covering lazy path (parquet, csv, json), in-memory path
(rds), error handling, schema, table_name, get_data_description, and
DuckDB security lockdown.

Python: 19 tests covering lazy path, schema generation, metadata,
table naming, and security lockdown.
Combine board creation, pin_write, and PinSource construction into a
single local_pin_source() helper. Suppress pin_write messages to keep
test output clean.
- Export PinSource in NAMESPACE and generate man page
- Track auto-derived data_description so it refreshes when data_source
  changes (prevents stale pin metadata in system prompt)
- Sanitize pin names into valid SQL identifiers when used as table names
- Fix overly broad exception type in Python security test
- Clarify R docs: non-file pins must produce a data frame, not "Arrow, etc."
- Add QueryChat integration tests verifying auto-fill from pin metadata,
  explicit data_description wins, and auto-description clears on source change
- Support multi-file pins by passing all paths from pin_download() to
  DuckDB file readers instead of only the first file
- Add pandas to the Python pins optional-dependency extra since PinSource
  always returns pandas DataFrames
- Error when pin_download() returns != 1 file; pin_write() always
  produces single-file pins, and multi-file pins from pin_upload()
  are typically heterogeneous and not safely unionable.
- Wrap DuckDB connection setup in cleanup guards (on.exit in R,
  try/except in Python) so the connection is always closed on failure.
- Regenerate PinSource.Rd to remove stale Arrow reference.
Wrap the query as a subquery before appending LIMIT 1, matching the
pattern used by PolarsLazySource and IbisSource. The previous approach
broke valid SQL statements with trailing semicolons.
Clean up the previous data source when QueryChat.data_source is
reassigned, preventing DuckDB connection leaks in long-lived apps.

Add tests for multi-file pin rejection (both R and Python) and
trailing-semicolon handling in PinSource.test_query().
- Replace bare on.exit() disarm with a con_owned flag so other exit
  handlers in the same frame aren't cleared.

- Change R sanitize_table_name() to strip all trailing underscores
  (sub("_+$",...)) matching Python's rstrip("_") behavior, so the same
  pin name produces the same table name in both languages.
…implifications

- Add get_data_description() to base DataSource class (returns "" by
  default) in both R and Python, matching get_semantic_views_description()
  pattern. Remove isinstance(PinSource) checks from QueryChat — any
  DataSource subclass can now provide a description polymorphically.

- Remove unused .board and .pin_name private fields from R PinSource.

- Simplify Python PinSource in-memory path: use a staging virtual table
  name for registration instead of the create-tmp/unregister/rename
  dance (4 ops → 3, no name collision risk).

- Refactor Python tests to use pytest fixtures with try/finally for
  cleanup, preventing DuckDB connection leaks on test failure.
Extract duckdb_column_meta(), duckdb_column_stats(), and
duckdb_get_schema() as module-level functions in _datasource.py. Both
DataFrameSource and PinSource now call the same code path for schema
generation.

This replaces DataFrameSource's narwhals-based type introspection with
DuckDB's native type metadata (which is more accurate — e.g. Python
date objects are correctly typed as DATE instead of Object/TEXT). Also
fixes a pre-existing bug where narwhals Datetime was mapped to "TIME"
instead of "TIMESTAMP".

PinSource drops ~90 lines of reimplemented schema logic.
Add assertions for table_name non-null where it's guaranteed by control
flow, and guard _auto_fill_data_description against None data_source.
@gadenbuie gadenbuie marked this pull request as ready for review June 11, 2026 21:51
@gadenbuie gadenbuie requested a review from cpsievert June 11, 2026 21:51

@cpsievert cpsievert left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a nice addition! Here are some initial thoughts from scanning the Python logic. I haven't looked yet at the R changes.

Comment thread pkg-py/src/querychat/_querychat_base.py Outdated
Comment thread pkg-py/src/querychat/_querychat_base.py Outdated
Comment thread pkg-py/src/querychat/_querychat_base.py Outdated
Comment thread pkg-py/src/querychat/_querychat_base.py
Comment thread pkg-py/src/querychat/_querychat_base.py Outdated
Comment thread pkg-py/src/querychat/__init__.py Outdated
Moves the inline try/import/isinstance board detection from
_querychat_base.py into a reusable TypeGuard function co-located
with the rest of the pins-specific logic.
- Add BaseBoard to data_source type unions in __init__, data_source
  setter, and normalize_data_source() (behind TYPE_CHECKING guard)
- Replace inline try/import/isinstance board detection with the
  is_pins_board() TypeGuard from _pin_source.py
- Replace _data_description_auto bool with _data_description_mode
  Literal["supplied", "inferred", "empty"] for clarity
Users can pass a pins board directly to QueryChat(), so direct
PinSource construction is uncommon. Available as
querychat.types.PinSource for users who need it.
Adds a test that explicitly confirms a user-provided data_description
is preserved (not overwritten by pin metadata) when the data source
is replaced via the setter.
Add PinSource to pkgdown reference index and fix nanoparquet dependency
issue. pins soft-depends on nanoparquet for parquet I/O, which wasn't
available on CI. Default test pin type changed from parquet to csv since
most tests don't need a specific format; parquet-specific tests now
explicitly set type and skip when nanoparquet is missing.
@cpsievert

Copy link
Copy Markdown
Contributor

Have you considered switching PinSource to return narwhals DataFrames (backed by polars) instead of pandas?

  • DuckDB's .pl() skips the Arrow→pandas conversion that .df() does, so every query gets faster for free. Wrapping the result in nw.from_native() gives consumers a uniform type.
  • SQLAlchemySource already sets this precedent — it's DataSource[nw.DataFrame] because, like PinSource, it doesn't have a user-provided input format to mirror back.
  • With polars available, we could also add arrow/feather to the direct-read set via pl.read_ipc() + DuckDB register. That would put all four common tabular pin types on the fast path instead of falling through to pin_read().

Practically it'd mean PinSource(DataSource[nw.DataFrame]), .pl()nw.from_native() instead of .df(), and swapping the pins extra to ["pins>=0.9.1", "polars", "pyarrow"].

@gadenbuie

Copy link
Copy Markdown
Contributor Author

Have you considered switching PinSource to return narwhals DataFrames (backed by polars) instead of pandas?

Thanks for catching this; I meant to call it out in the description. I was hoping to get a vibe check from about whether that was a good idea or not. Seems like it would be!

DuckDB's read_json_auto() requires the json extension, which isn't
bundled and can't be auto-installed in all environments. Explicitly
install and load it before reading json pins in both R and Python.
Comment thread pkg-py/src/querychat/_pin_source.py Outdated
Comment thread pkg-py/src/querychat/_pin_source.py
Comment thread pkg-r/R/QueryChat.R Outdated
PinSource now returns nw.DataFrame, matching the pattern used by
SQLAlchemySource. Uses polars (.pl()) when available for faster
DuckDB result conversion, falls back to pandas (.df()) otherwise.
Both are wrapped via nw.from_native(), giving consumers the choice
of .to_native(), .to_pandas(), or .to_polars() on results.

Drops the explicit pandas dependency from the pins extra since pins
itself hard-depends on pandas.
Arrow pins can't be read natively by DuckDB, but when polars is
installed, PinSource now uses pl.read_ipc() to read arrow pin files
directly — skipping the pin_read() → pandas deserialization overhead.
Falls back to the existing pin_read() path when polars isn't available.
@gadenbuie

Copy link
Copy Markdown
Contributor Author

Have you considered switching PinSource to return narwhals DataFrames (backed by polars) instead of pandas?

Okay, done in ad18fa7 and def2051.

PinSource now returns nw.DataFrame, consistent with SQLAlchemySource. The conversion uses a polars-if-available pattern (same as read_sql() in _df_compat.py):

def _convert_result(result: duckdb.DuckDBPyConnection) -> nw.DataFrame:
    if _has_polars():
        return nw.from_native(result.pl())
    return nw.from_native(result.df())

Polars gives you the fast .pl() path but isn't required — pandas fallback works fine. I dropped the explicit pandas from the pins extra since pins itself hard-depends on it.

I also added arrow/IPC pin support: when polars is available, PinSource uses pl.read_ipc() to read arrow pins directly (DuckDB has no native IPC reader), falling back to pin_read() otherwise. So now the four common tabular pin types (parquet, csv, json, arrow) all have a fast path when polars is installed.

gadenbuie and others added 4 commits June 12, 2026 11:13
…_mode

- Extract DuckDB security lockdown SQL into shared duckdb_lock_down()
  helper in both R (utils-duckdb.R) and Python (_datasource.py),
  replacing inline duplicates in DataFrameSource and PinSource.
- Rename R .data_description_auto (boolean) to .data_description_mode
  (3-state: "supplied"/"inferred"/"empty") for consistency with Python.
Ruff S101 disallows assert in non-test code.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new PinSource data source for both the R and Python querychat packages to query datasets stored on pins boards via DuckDB (direct file reads where possible, otherwise in-memory registration), with automatic prompt metadata and post-load DuckDB security lockdown.

Changes:

  • Introduces PinSource implementations (R + Python) with parquet/csv/json “lazy” loading and other formats via pin_read().
  • Adds automatic data_description inference from data source metadata and cleans up old data sources when replaced.
  • Refactors DuckDB schema/stat helpers (Python) and centralizes DuckDB lockdown logic (R + Python).

Reviewed changes

Copilot reviewed 16 out of 21 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pyproject.toml Adds pins as an optional extra dependency for Python.
pkg-r/tests/testthat/test-PinSource.R Adds comprehensive R tests for PinSource, security, and QueryChat integration.
pkg-r/R/utils-duckdb.R Introduces shared duckdb_lock_down() helper.
pkg-r/R/utils-check.R Adds sanitize_table_name() helper for pin/table name normalization.
pkg-r/R/QueryChat.R Adds data description inference and pins-board normalization support; updates cleanup behavior.
pkg-r/R/PinSource.R New R PinSource backed by pins + DuckDB with metadata-derived descriptions.
pkg-r/R/DataSource.R Adds get_data_description() default method to the DataSource interface.
pkg-r/R/DataFrameSource.R Uses shared duckdb_lock_down() helper.
pkg-r/pkgdown/_pkgdown.yml Updates reference index to include all *Source classes automatically.
pkg-r/NAMESPACE Exports PinSource.
pkg-r/man/TblSqlSource.Rd Documents newly inherited get_data_description() method.
pkg-r/man/PinSource.Rd Adds generated documentation for PinSource.
pkg-r/man/DBISource.Rd Documents newly inherited get_data_description() method.
pkg-r/man/DataSource.Rd Documents new get_data_description() method.
pkg-r/man/DataFrameSource.Rd Documents newly inherited get_data_description() method.
pkg-r/DESCRIPTION Adds pins (and nanoparquet) to Suggests for R.
pkg-py/tests/test_pin_source.py Adds Python tests for PinSource, security, and QueryChat integration.
pkg-py/src/querychat/types/init.py Exports PinSource from the public types module.
pkg-py/src/querychat/_querychat_base.py Adds pins-board support, description inference, and old-source cleanup behavior.
pkg-py/src/querychat/_pin_source.py New Python PinSource backed by pins + DuckDB with metadata-derived descriptions.
pkg-py/src/querychat/_datasource.py Extracts shared DuckDB schema/stats helpers and adds get_data_description() default.
Files not reviewed (5)
  • pkg-r/man/DBISource.Rd: Generated file
  • pkg-r/man/DataFrameSource.Rd: Generated file
  • pkg-r/man/DataSource.Rd: Generated file
  • pkg-r/man/PinSource.Rd: Generated file
  • pkg-r/man/TblSqlSource.Rd: Generated file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg-r/R/QueryChat.R
Comment thread pkg-r/R/QueryChat.R
Comment thread pkg-py/src/querychat/_querychat_base.py
return nw.from_native(result.df())


class PinSource(DataSource[nw.DataFrame]):

@cpsievert cpsievert Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we have here is solid, but I also just want to point out that we could probably also have a PinLazySource(DataSource[nw.LazyFrame]) that skips the whole download+import to DuckDB step.

Now that I'm saying this out loud though, maybe we could just document how to feed a pins URL into a pl.LazyFrame() and lean into the existing PolarsLazySource class?

@gadenbuie gadenbuie Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I added documentation for this pattern in e420c94 — both the data-sources vignettes (R and Python) and the PinSource class docs now have a "Lazy queries with pins" section showing how to use pin_download() + tbl_sql (R) or pl.scan_parquet() + PolarsLazySource (Python) for large parquet pins.

The docs note that pin_download() still fetches files to a local cache, so "lazy" here means deferred query execution, not deferred data access. They also call out that the lazy path skips the DuckDB security lockdown that PinSource applies.

A proper PinLazySource or upstream pin_url() could be interesting follow-ups if there's demand.

@cpsievert cpsievert left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once outstanding comments are addressed

gadenbuie and others added 5 commits June 12, 2026 12:17
When data_source is a pins board and table_name is omitted,
fail fast with a clear error instead of letting missing_arg()
propagate to PinSource and produce a confusing downstream error.
The data_source setter called old_source$cleanup() unconditionally
after normalization. Since normalize_data_source() passes DataSource
objects through as-is, reassigning the same instance would close the
connection on the still-active object.
Same fix as the R side — normalize_data_source() passes DataSource
objects through as-is, so reassigning the same instance would call
cleanup() on the still-active object.
Documents PinSource usage in both R and Python data sources vignettes,
including the lazy query pattern (tbl_sql for R, Polars LazyFrame for
Python) for large parquet pins. Also adds the same guidance to the
PinSource class-level docs.
@gadenbuie gadenbuie merged commit 07c99cd into main Jun 12, 2026
18 checks passed
@gadenbuie gadenbuie deleted the feat/pins-source branch June 12, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants