A hands-on, copy-paste-friendly tour of the dataframe ecosystem.
By the end of this tutorial you will have:
- loaded
iris.parquetinto aDataFrame - inspected columns and rows
- filtered to a subset of rows
- derived a few new columns using typed expressions (
Expr a) - grouped + aggregated columns
- made a quick plot
- written results back out to CSV
We'll be using a notebook for this tutorial. So load up a notebook interface anyway you prefer:
- through binder - this takes some time to load/build the environment.
- Or you can run the docker image locally by following the QuickStart instructions.
All three ways already come pre-loaded with the dataframe environment.
Select the Haskell in the notebook tab. This should open a new, untitled notebook file. Save this file as iris.ipynb.
You can see the status of the Haskell kernel on the bottom left. That's it - we are ready to get to work!
Let's start by reading the data and showing the first 10 rows.
df <- D.readParquet "./data/iris.parquet"
D.take 10 dfIn early exploration, selecting a few columns makes everything easier to read.
Let's, for some time, just look at the petal columns.
:set -XOverloadedStrings
import DataFrame.Operators
df |> D.select ["petal.width", "petal.length", "variety"]
|> D.take 5We introduce two new concepts here.
Firstly, because Haskell has many different kinds of strings we turn on the OverloadedStrings extension so we can use them equally in all contexts.
We also import the "chaining" operator which allows us to write operations such that the result of the previous operation is applied to the next. So instead of writing:
D.take 5 (D.select ["petal.width", "petal.length", "variety"] df)In the example above we first select the three columns from the dataframe. Then, the outer function take 5 retrieves only the first 5 rows.
We can rewrite this with chaining as follows:
df |> D.select ["petal.width", "petal.length", "variety"]
|> D.take 5This says:
- start with the dataframe,
- pass it to the end of the select function
- take that result and pass it to the end of the take function.
This is a common idiom in languages like R and Julia, where it's sometimes called piping.
Let's say we want to filter all flowers whose petal width is greater than 6. We can use the filterWhere function. filterWhere takes a boolean expression (Expr Bool). Dataframes have an expression syntax (similar to Polars) that allow us to write logic on top of columns.
There are a couple of core functions we can use to define columns:
F.col @Int "x" -- this says "I have an int column called 'x'"
F.lit @Int 5 -- this says "I have a literal value `5` that is an integer`.Expressions support regular math operations:
-- Take the value at x and add 5 to it.
-- The type of this is `Expr Int`
F.col @Int "x" + F.lit @Int 5
-- Compute the sine of the column x and add the cosine of 0.5.
-- The type of this is `Expr Double`
sin (F.col @Double "x") + cos (F.lit @Double 0.5)However, for comparison operations we have a special syntax. Same-type (non-nullable) operators are wrapped in dots on both sides; nullable-aware operators use a single leading dot.
-- Is x greater than 5? (both sides are non-nullable Int)
-- The type of this is `Expr Bool`
F.col @Int "x" .>=. F.lit @Int 5The expression language eliminates a class of bugs from column operations (like adding a string to an integer).
Armed with this knowledge, we can go back and filter all flowers with petal length greater than 6.
We can see from the sample we printed before that petal.length is of type Double. We write the expression as follows:
df |> D.filterWhere (F.col @Double "petal.length" .>. F.lit @Double 6)
|> D.select ["petal.width", "petal.length", "variety"]Suppose we write out the wrong types in the expression. That is suppose we say that petal.length is a Int instead. This will cause a runtime failure:
-- This fails at runtime because "petal.length" is Double, not Int:
df |> D.filterWhere (F.col @Int "petal.length" .>. F.lit @Int 6)A typo like petal_length would also cause the same sort of runtime failures. We'd ideally like to catch these kinds of errors earlier so we don't have unexpected failures while running a long pipeline.
We can ask Haskell to generate the correct references and use them without fear. The declareColumns function does exactly that. It takes the column name and creates a variable with the column name as an all lowercase string, with all special characters replaced with underscores.
In our case petal.width becomes petal_width or if it were Petal Width (cm) it would become petal_width_cm_.
Once we run declareColumns (which requires TemplateHaskell to be enabled) we get the column names as completion options in the notebook.
:set -XTemplateHaskell
F.declareColumns dfIn fact, we can make this even shorter. Since Haskell knows how to create any num instance from literals, we don't have to write F.lit @Int 6. It knows, from the context, to wrap 6 so it's an Expr Double. Thus, we can write:
df |> D.filterWhere (petal_length .>. 6)
|> D.select ["petal.width", "petal.length", "variety"]You’ll often want new columns that are:
- arithmetic combinations (ratio, area),
- boolean flags, or,
- normalized features (zScore)
We can use the derive function which takes a column name and an expression. It will evaluate the expression and put the results in a new column with the given name. Again, our type-safe column references make this code safer and easier to write.
df |> D.derive "petal_area" (petal_width * petal_length)
|> D.select ["petal.width", "petal.length", "petal_area"]
|> D.take 5We can also derive many columns at once using a variant called deriveMany.
df |> D.deriveMany [ "petal_area" .= petal_width * petal_length
, "sepal_area" .= sepal_width * sepal_length
, "petal_ratio" .= petal_width / petal_length
]
|> D.select ["petal_area", "sepal_area", "petal_ratio"]
|> D.take 5You can also use custom Haskell functions to manipulate dataframe columns. Say you had the following Haskell function that takes in the petal length and bucketizes it.
let bucketize x
| x < 2 = "small"
| x < 5 = "medium"
| otherwise = "large"
df |> D.derive "petal_bucket" (F.lift bucketize petal_length)
|> D.select ["petal.length", "petal_bucket"]
|> D.take 5If we wanted to apply a function that takes in two variables to our columns we would use the lift2 function. For example, we can define petal_area as "petal_area" .= F.lift2 (*) petal_width petal_length.
Let's create a "report" that computes counts and basic stats per group.
We use the groupBy to group by some columns and we use aggregate to combine column values.
df |> D.groupBy ["variety"]
|> D.aggregate [ "n" .= F.count petal_length
, "meanPL" .= F.mean petal_length
, "sdPL" .= F.stddev petal_length
, "minPW" .= F.minimum petal_width
, "maxPW" .= F.maximum petal_width
]Leaving a column unaggregated causes a runtime error:
df |> D.groupBy ["variety"]
|> D.aggregate ["some_random_thing" .= petal_width]All base columns should be aggregated in the final aggregation expression.
We can show the sepal width and length by variety with the following terminal plot:
D.plotScatterBy "sepal.width" "sepal.length" "variety" dfAdd the following to the end of your notebook:
df2 = df |> D.groupBy ["variety"]
|> D.aggregate
[ "n" .= F.count petal_length
, "meanPL" .= F.mean petal_length
, "sdPL" .= F.stddev petal_length
, "minPW" .= F.minimum petal_width
, "maxPW" .= F.maximum petal_width
]
D.writeCsv "out/report.csv" df2Run and check that the CSV exists.











