Srihari Thyagarajan commited on
Commit
d46f7d6
·
unverified ·
2 Parent(s): c419769 69973db

Merge pull request #104 from jesshart/tutorial-dataframe-transformer

Browse files
Files changed (1) hide show
  1. polars/06_Dataframe_Transformer.py +376 -0
polars/06_Dataframe_Transformer.py ADDED
@@ -0,0 +1,376 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # dependencies = [
3
+ # "marimo",
4
+ # "numpy==2.2.3",
5
+ # "plotly[express]==6.0.0",
6
+ # "polars==1.28.1",
7
+ # "requests==2.32.3",
8
+ # ]
9
+ # [tool.marimo.runtime]
10
+ # auto_instantiate = false
11
+ # ///
12
+
13
+ import marimo
14
+
15
+ __generated_with = "0.14.10"
16
+ app = marimo.App(width="medium")
17
+
18
+
19
+ @app.cell(hide_code=True)
20
+ def _(mo):
21
+ mo.md(
22
+ r"""
23
+ # Polars with Marimo's Dataframe Transformer
24
+
25
+ *By [jesshart](https://github.com/jesshart)*
26
+
27
+ The goal of this notebook is to explore Marimo's data explore capabilities alonside the power of polars. Feel free to reference the latest about these Marimo features here: https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes
28
+ """
29
+ )
30
+ return
31
+
32
+
33
+ @app.cell
34
+ def _(requests):
35
+ json_data = requests.get(
36
+ "https://raw.githubusercontent.com/jesshart/fake-datasets/refs/heads/main/orders.json"
37
+ )
38
+ return (json_data,)
39
+
40
+
41
+ @app.cell(hide_code=True)
42
+ def _(mo):
43
+ mo.md(
44
+ r"""
45
+ # Loading Data
46
+ Let's start by loading our data and getting into the `.lazy()` format so our transformations and queries are speedy.
47
+
48
+ Read more about `.lazy()` here: https://docs.pola.rs/user-guide/lazy/
49
+ """
50
+ )
51
+ return
52
+
53
+
54
+ @app.cell
55
+ def _(json_data, pl):
56
+ demand: pl.LazyFrame = pl.read_json(json_data.content).lazy()
57
+ demand
58
+ return (demand,)
59
+
60
+
61
+ @app.cell(hide_code=True)
62
+ def _(mo):
63
+ mo.md(
64
+ r"""
65
+ Above, you will notice that when you reference the object as a standalone, you get out-of-the-box convenince from `marimo`. You have the `Table` and `Query Plan` options to choose from.
66
+
67
+ - 💡 Try out the `Table` view! You can click the `Preview data` button to get a quick view of your data.
68
+ - 💡 Take a look at the `Query plan`. Learn more about Polar's query plan here: https://docs.pola.rs/user-guide/lazy/query-plan/
69
+ """
70
+ )
71
+ return
72
+
73
+
74
+ @app.cell(hide_code=True)
75
+ def _(mo):
76
+ mo.md(
77
+ r"""
78
+ ## marimo's Native Dataframe UI
79
+
80
+ There are a few ways to leverage marimo's native dataframe UI. One is by doing what we saw above—by referencing a `pl.LazyFrame` directly. You can also try,
81
+
82
+ - Reference a `pl.LazyFrame` (we already did this!)
83
+ - Referencing a `pl.DataFrame` and see how it different from its corresponding lazy version
84
+ - Use `mo.ui.table`
85
+ - Use `mo.ui.dataframe`
86
+ """
87
+ )
88
+ return
89
+
90
+
91
+ @app.cell(hide_code=True)
92
+ def _(mo):
93
+ mo.md(
94
+ r"""
95
+ ## Reference a `pl.DataFrame`
96
+ Let's reference the same frame as before, but this time as a `pl.DataFrame` by calling `.collect()` on it.
97
+ """
98
+ )
99
+ return
100
+
101
+
102
+ @app.cell
103
+ def _(demand: "pl.LazyFrame"):
104
+ demand.collect()
105
+ return
106
+
107
+
108
+ @app.cell(hide_code=True)
109
+ def _(mo):
110
+ mo.md(
111
+ r"""
112
+ Note how much functionality we have right out-of-the-box. Click on column names to see rich features like sorting, freezing, filtering, searching, and more!
113
+
114
+ Notice how `order_quantity` has a green bar chart under it indicating the ditribution of values for the field!
115
+
116
+ Don't miss the `Download` feature as well which supports downloading in CSV, json, or parquet format!
117
+ """
118
+ )
119
+ return
120
+
121
+
122
+ @app.cell(hide_code=True)
123
+ def _(mo):
124
+ mo.md(
125
+ r"""
126
+ ## Use `mo.ui.table`
127
+ The `mo.ui.table` allows you to select rows for use downstream. You can select the rows you want, and then use these as filtered rows downstream.
128
+ """
129
+ )
130
+ return
131
+
132
+
133
+ @app.cell
134
+ def _(demand: "pl.LazyFrame", mo):
135
+ demand_table = mo.ui.table(demand, label="Demand Table")
136
+ return (demand_table,)
137
+
138
+
139
+ @app.cell
140
+ def _(demand_table):
141
+ demand_table
142
+ return
143
+
144
+
145
+ @app.cell(hide_code=True)
146
+ def _(mo):
147
+ mo.md(r"""I like to use this feature to select groupings based on summary statistics so I can quickly explore subsets of categories. Let me show you what I mean.""")
148
+ return
149
+
150
+
151
+ @app.cell
152
+ def _(demand: "pl.LazyFrame", pl):
153
+ summary: pl.LazyFrame = demand.group_by("product_family").agg(
154
+ pl.mean("order_quantity").alias("mean"),
155
+ pl.sum("order_quantity").alias("sum"),
156
+ pl.std("order_quantity").alias("std"),
157
+ pl.min("order_quantity").alias("min"),
158
+ pl.max("order_quantity").alias("max"),
159
+ pl.col("order_quantity").null_count().alias("null_count"),
160
+ )
161
+ return (summary,)
162
+
163
+
164
+ @app.cell
165
+ def _(mo, summary: "pl.LazyFrame"):
166
+ summary_table = mo.ui.table(summary)
167
+ return (summary_table,)
168
+
169
+
170
+ @app.cell
171
+ def _(summary_table):
172
+ summary_table
173
+ return
174
+
175
+
176
+ @app.cell(hide_code=True)
177
+ def _(mo):
178
+ mo.md(
179
+ r"""
180
+ Now, instead of manually creating a filter for what I want to take a closer look at, I simply select from the ui and do a simple join to get that aggregated level with more detail.
181
+
182
+ The following cell uses the output of the `mo.ui.table` selection, selects its unique keys, and uses that to join for the selected subset of the original table.
183
+ """
184
+ )
185
+ return
186
+
187
+
188
+ @app.cell
189
+ def _(demand: "pl.LazyFrame", pl, summary_table):
190
+ selection_keys: pl.LazyFrame = (
191
+ summary_table.value.lazy().select("product_family").unique()
192
+ )
193
+ selection: pl.lazyframe = selection_keys.join(
194
+ demand, on="product_family", how="left"
195
+ )
196
+ selection.collect()
197
+ return
198
+
199
+
200
+ @app.cell(hide_code=True)
201
+ def _(mo):
202
+ mo.md("""You can learn more about joins in Polars by checking out my other interactive notebook here: https://marimo.io/p/@jesshart/basic-polars-joins""")
203
+ return
204
+
205
+
206
+ @app.cell(hide_code=True)
207
+ def _(mo):
208
+ mo.md(r"""## Use `mo.ui.dataframe`""")
209
+ return
210
+
211
+
212
+ @app.cell
213
+ def _(demand: "pl.LazyFrame", mo):
214
+ demand_cached = demand.collect()
215
+ mo_dataframe = mo.ui.dataframe(demand_cached)
216
+ return demand_cached, mo_dataframe
217
+
218
+
219
+ @app.cell(hide_code=True)
220
+ def _(mo):
221
+ mo.md(r"""Below I simply call the object into view. We will play with it in the following cells.""")
222
+ return
223
+
224
+
225
+ @app.cell
226
+ def _(mo_dataframe):
227
+ mo_dataframe
228
+ return
229
+
230
+
231
+ @app.cell(hide_code=True)
232
+ def _(mo):
233
+ mo.md(r"""One way to group this data in polars code directly would be to group by product family to get the mean. This is how it is done in polars:""")
234
+ return
235
+
236
+
237
+ @app.cell
238
+ def _(demand_cached, pl):
239
+ demand_agg: pl.DataFrame = demand_cached.group_by("product_family").agg(
240
+ pl.mean("order_quantity").name.suffix("_mean")
241
+ )
242
+ demand_agg
243
+ return (demand_agg,)
244
+
245
+
246
+ @app.cell(hide_code=True)
247
+ def _(mo):
248
+ mo.md(
249
+ f"""
250
+ ## Try Before You Buy
251
+
252
+ 1. Now try to do the same summary using Marimo's `mo.ui.dataframe` object above. Also, note how your aggregated column is already renamed! Nice touch!
253
+ 2. Try (1) again but use select statements first (This is actually better polars practice anyway since it reduces the frame as you move to aggregation.)
254
+
255
+ *When you are ready, check the `Python Code` tab at the top of the table to compare your output to the answer below.*
256
+ """
257
+ )
258
+ return
259
+
260
+
261
+ @app.cell(hide_code=True)
262
+ def _():
263
+ mean_code = """
264
+ This may seem verbose compared to what I came up with, but quick and dirty outputs like this are really helpful for quickly exploring the data and learning the polars library at the same time.
265
+ ```python
266
+ df_next = df
267
+ df_next = df_next.group_by(
268
+ [pl.col("product_family")], maintain_order=True
269
+ ).agg(
270
+ [
271
+ pl.col("order_date").mean().alias("order_date_mean"),
272
+ pl.col("order_quantity").mean().alias("order_quantity_mean"),
273
+ pl.col("product").mean().alias("product_mean"),
274
+ ]
275
+ )
276
+ ```
277
+ """
278
+
279
+ mean_again_code = """
280
+ ```python
281
+ df_next = df
282
+ df_next = df_next.select(["product_family", "order_quantity"])
283
+ df_next = df_next.group_by(
284
+ [pl.col("product_family")], maintain_order=True
285
+ ).agg(
286
+ [
287
+ pl.col("order_date").mean().alias("order_date_mean"),
288
+ pl.col("order_quantity").mean().alias("order_quantity_mean"),
289
+ pl.col("product").mean().alias("product_mean"),
290
+ ]
291
+ )
292
+ ```
293
+ """
294
+ return mean_again_code, mean_code
295
+
296
+
297
+ @app.cell(hide_code=True)
298
+ def _(mean_again_code, mean_code, mo):
299
+ mo.accordion(
300
+ {
301
+ "Show Code (1)": mean_code,
302
+ "Show Code (2)": mean_again_code,
303
+ }
304
+ )
305
+ return
306
+
307
+
308
+ @app.cell
309
+ def _(demand_agg: "pl.DataFrame", mo, px):
310
+ bar_graph = px.bar(
311
+ demand_agg,
312
+ x="product_family",
313
+ y="order_quantity_mean",
314
+ title="Mean Quantity over Product Family",
315
+ )
316
+
317
+ note: str = """
318
+ Note: This graph will only show if the above mo_dataframe is correct!
319
+
320
+ If you want more on interactive graphs, check out https://github.com/marimo-team/learn/blob/main/polars/05_reactive_plots.py
321
+ """
322
+
323
+ mo.vstack(
324
+ [
325
+ mo.md(note),
326
+ bar_graph,
327
+ ]
328
+ )
329
+ return
330
+
331
+
332
+ @app.cell(hide_code=True)
333
+ def _(mo):
334
+ mo.md(
335
+ r"""
336
+ # About this Notebook
337
+ Polars and Marimo are both relatively new to the data wrangling space, but their power (and the thrill of their use) cannot be overstated—well, I suppose it could, but you get the meaning. In this notebook, you learn how to leverage basic Polars skills to load-in and explore your data in concert with Marimo's powerful UI elements.
338
+
339
+ ## 📚 Documentation References
340
+
341
+ - **Marimo: Dataframe Transformation Guide**
342
+ https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes
343
+
344
+ - **Polars: Lazy API Overview**
345
+ https://docs.pola.rs/user-guide/lazy/
346
+
347
+ - **Polars: Query Plan Explained**
348
+ https://docs.pola.rs/user-guide/lazy/query-plan/
349
+
350
+ - **Marimo Notebook: Basic Polars Joins (by jesshart)**
351
+ https://marimo.io/p/@jesshart/basic-polars-joins
352
+
353
+ - **Marimo Learn: Interactive Graphs with Polars**
354
+ https://github.com/marimo-team/learn/blob/main/polars/05_reactive_plots.py
355
+ """
356
+ )
357
+ return
358
+
359
+
360
+ @app.cell
361
+ def _():
362
+ import marimo as mo
363
+ return (mo,)
364
+
365
+
366
+ @app.cell
367
+ def _():
368
+ import polars as pl
369
+ import requests
370
+ import json
371
+ import plotly.express as px
372
+ return pl, px, requests
373
+
374
+
375
+ if __name__ == "__main__":
376
+ app.run()