[spark] Add load_csv and export_csv procedures#7898
Conversation
Add two Spark procedures for CSV data exchange with Paimon tables: - load_csv: Import CSV files into an existing Paimon table with schema matching by column name, nested type support via from_json, and corrupt record tracking. - export_csv: Export Paimon table data to a single CSV file with optional WHERE filter and nested type serialization as JSON strings.
|
Cool! |
|
Maybe it is better to support |
@JingsongLi Thanks for the suggestion. After checking existing systems, I found two common directions:
I prefer starting with Option 1: support Databricks-style COPY INTO for import first, with CSV as the first supported format, and keep export as a procedure. This is closer to the Spark ecosystem and keeps the initial scope smaller. Snowflake-style export can be discussed separately later if needed. |
|
Hi @JunRuiLee , I think we can try to look directly at Snowflake's perspective and see if there are any substantial bottlenecks. |
Thanks @JingsongLi for suggestion, I'll take a look. |
|
Closing this PR as its contents have been superseded by PR #7926 |
Purpose
In our scenario, many algorithm engineers work directly with datasets in CSV format. This PR adds Spark
load_csvandexport_csvprocedures to make it easy to move data between CSV files and Paimon tables without writing custom Spark jobs.load_csvimports CSV files into an existing Paimon table. It matches CSV header columns to target table columns by exact name, writes missing columns as null, drops extra columns, and always uses Spark CSVPERMISSIVEmode so malformed rows are counted ininvalid_countand skipped. Nested columns are restored from JSON strings.export_csvexports a Paimon table to a Spark CSV output directory, with optionalwherefiltering. Nested columns are serialized as JSON strings, andquoteAll=trueis enabled by default so JSON values containing commas are quoted correctly. Existing output paths are overwritten.Tests
Added CsvProcedureTest.