Srihari Thyagarajan commited on
Commit
a01a779
Β·
unverified Β·
2 Parent(s): f9be907 fa5f4bb

Merge pull request #98 from peter-gy/peter-gy/daft-ch01

Browse files

Daft Course Kick-off via `What Makes Daft Special?` notebook

Files changed (3) hide show
  1. README.md +1 -0
  2. daft/01_what_makes_daft_special.py +316 -0
  3. daft/README.md +27 -0
README.md CHANGED
@@ -30,6 +30,7 @@ notebooks for educators, students, and practitioners.
30
  - ❄️ Polars
31
  - πŸ”₯ Pytorch
32
  - πŸ—„οΈ Duckdb
 
33
  - πŸ“ˆ Altair
34
  - πŸ“ˆ Plotly
35
  - πŸ“ˆ matplotlib
 
30
  - ❄️ Polars
31
  - πŸ”₯ Pytorch
32
  - πŸ—„οΈ Duckdb
33
+ - πŸ’œ Daft
34
  - πŸ“ˆ Altair
35
  - πŸ“ˆ Plotly
36
  - πŸ“ˆ matplotlib
daft/01_what_makes_daft_special.py ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.12"
3
+ # dependencies = [
4
+ # "daft==0.4.14",
5
+ # "marimo",
6
+ # ]
7
+ # ///
8
+
9
+ import marimo
10
+
11
+ __generated_with = "0.13.6"
12
+ app = marimo.App(width="medium")
13
+
14
+
15
+ @app.cell(hide_code=True)
16
+ def _(mo):
17
+ mo.md(
18
+ r"""
19
+ # What Makes Daft Special?
20
+
21
+ > _By [PΓ©ter Ferenc Gyarmati](http://github.com/peter-gy)_.
22
+
23
+ Welcome to the course on [Daft](https://www.getdaft.io/), the distributed dataframe library! In this first chapter, we'll explore what Daft is and what makes it a noteworthy tool in the landscape of data processing. We'll look at its core design choices and how they aim to help you work with data more effectively, whether you're a data engineer, data scientist, or analyst.
24
+ """
25
+ )
26
+ return
27
+
28
+
29
+ @app.cell(hide_code=True)
30
+ def _(mo):
31
+ mo.md(
32
+ r"""
33
+ ## 🎯 Introducing Daft: A Unified Data Engine
34
+
35
+ Daft is a distributed query engine designed to handle a wide array of data tasks, from data engineering and analytics to powering ML/AI workflows. It provides both a Python DataFrame API, familiar to users of libraries like Pandas, and a SQL interface, allowing you to choose the interaction style that best suits your needs or the task at hand.
36
+
37
+ The main goal of Daft is to provide a robust and versatile platform for processing data, whether it's gigabytes on your laptop or petabytes on a cluster.
38
+
39
+ Let's go ahead and `pip install daft` to see it in action!
40
+ """
41
+ )
42
+ return
43
+
44
+
45
+ @app.cell(hide_code=True)
46
+ def _(df_with_discount, discount_slider, mo):
47
+ mo.vstack(
48
+ [
49
+ discount_slider,
50
+ df_with_discount.collect(),
51
+ ]
52
+ )
53
+ return
54
+
55
+
56
+ @app.cell
57
+ def _(daft, discount_slider):
58
+ # Let's create a very simple Daft DataFrame
59
+ df = daft.from_pydict(
60
+ {
61
+ "id": [1, 2, 3],
62
+ "product_name": ["Laptop", "Mouse", "Keyboard"],
63
+ "price": [1200, 25, 75],
64
+ }
65
+ )
66
+
67
+ # Perform a basic operation: calculate a new price after discount
68
+ df_with_discount = df.with_column(
69
+ "discounted_price",
70
+ df["price"] * (1 - discount_slider.value),
71
+ )
72
+ return (df_with_discount,)
73
+
74
+
75
+ @app.cell(hide_code=True)
76
+ def _(mo):
77
+ discount_slider = mo.ui.slider(
78
+ start=0.05,
79
+ stop=0.5,
80
+ step=0.05,
81
+ label="Discount Rate:",
82
+ show_value=True,
83
+ )
84
+ return (discount_slider,)
85
+
86
+
87
+ @app.cell(hide_code=True)
88
+ def _(mo):
89
+ mo.md(
90
+ r"""
91
+ ## πŸ¦€ Built with Rust: Performance and Simplicity
92
+
93
+ One of Daft's key characteristics is that its core engine is written in Rust. This choice has several implications for users:
94
+
95
+ * **Performance**: [Rust](https://www.rust-lang.org/) is known for its speed and memory efficiency. Unlike systems built on the Java Virtual Machine (JVM), Rust doesn't have a garbage collector that can introduce unpredictable pauses. This often translates to faster execution and more predictable performance.
96
+ * **Efficient Python Integration**: Daft uses Rust's native Python bindings. This allows Python code (like your DataFrame operations or User-Defined Functions, which we'll cover later) to interact closely with the Rust engine. This can reduce the overhead often seen when bridging Python with JVM-based systems (e.g., PySpark), especially for custom Python logic.
97
+ * **Simplified Developer Experience**: Rust-based systems typically require less configuration tuning compared to JVM-based systems. You don't need to worry about JVM heap sizes, garbage collection parameters, or managing Java dependencies.
98
+
99
+ Daft also leverages [Apache Arrow](https://arrow.apache.org/) for its in-memory data format. This allows for efficient data exchange between Daft's Rust core and Python, often with zero-copy data sharing, further enhancing performance.
100
+ """
101
+ )
102
+ return
103
+
104
+
105
+ @app.cell(hide_code=True)
106
+ def _(mo):
107
+ mo.center(
108
+ mo.image(
109
+ src="https://minio.peter.gy/static/assets/marimo/learn/daft/daft-anti-spark-social-club.jpeg",
110
+ alt="Daft Anti Spark Social Club Meme",
111
+ caption="πŸ’‘ Fun Fact: Creators of Daft are proud members of the 'Anti Spark Social Club'.",
112
+ width=512,
113
+ height=682,
114
+ )
115
+ )
116
+ return
117
+
118
+
119
+ @app.cell(hide_code=True)
120
+ def _(mo):
121
+ mo.md(r"""A cornerstone of Daft's design is **lazy execution**. Imagine defining a DataFrame with a trillion rows on your laptop – usually not a great prospect for your device's memory!""")
122
+ return
123
+
124
+
125
+ @app.cell
126
+ def _(daft):
127
+ trillion_rows_df = (
128
+ daft.range(1_000_000_000_000)
129
+ .with_column("times_2", daft.col("id") * 2)
130
+ .filter(daft.col("id") % 2 == 0)
131
+ )
132
+ trillion_rows_df
133
+ return (trillion_rows_df,)
134
+
135
+
136
+ @app.cell(hide_code=True)
137
+ def _(mo):
138
+ mo.md(r"""With Daft, this is perfectly fine. Operations like `with_column` or `filter` don't compute results immediately. Instead, Daft builds a *logical plan* – a blueprint of the transformations you've defined. You can inspect this plan:""")
139
+ return
140
+
141
+
142
+ @app.cell(hide_code=True)
143
+ def _(mo, trillion_rows_df):
144
+ mo.mermaid(trillion_rows_df.explain(format="mermaid").split("\nSet")[0][11:-3])
145
+ return
146
+
147
+
148
+ @app.cell(hide_code=True)
149
+ def _(mo):
150
+ mo.md(r"""This plan is only executed (and data materialized) when you explicitly request it (e.g., with `.show()`, `.collect()`, or by writing to a file). Before execution, Daft's optimizer works to make your query run as efficiently as possible. This approach allows you to define complex operations on massive datasets without immediate computational cost or memory overflow.""")
151
+ return
152
+
153
+
154
+ @app.cell(hide_code=True)
155
+ def _(mo):
156
+ mo.md(
157
+ r"""
158
+ ## 🌐 Scale Your Work: From Laptop to Cluster
159
+
160
+ Daft is designed with scalability in mind. As the trillion-row dataframe example above illustrates, you can write your data processing logic using Daft's Python API, and this same code can run:
161
+
162
+ * **Locally**: Utilizing multiple cores on your laptop or a single powerful machine for development or processing moderately sized datasets.
163
+ * **On a Cluster**: By integrating with [Ray](https://www.ray.io/), a framework for distributed computing. This allows Daft to scale out to process very large datasets across many machines.
164
+
165
+ This "write once, scale anywhere" approach means you don't need to significantly refactor your code when moving from local development to large-scale distributed execution. We'll delve into distributed computing with Ray in a later chapter.
166
+ """
167
+ )
168
+ return
169
+
170
+
171
+ @app.cell(hide_code=True)
172
+ def _(mo):
173
+ mo.md(
174
+ r"""
175
+ ## πŸ–ΌοΈ Handling More Than Just Tables: Multimodal Data Support
176
+
177
+ Modern datasets often contain more than just numbers and text. They might include images, audio clips, URLs pointing to external files, tensor data from machine learning models, or complex nested structures like JSON.
178
+
179
+ Daft is built to accommodate these **multimodal data types** as integral parts of a DataFrame. This means you can have columns containing image data, embeddings, or other complex Python objects, and Daft provides mechanisms to process them. This is particularly useful for ML/AI pipelines and advanced analytics where diverse data sources are common.
180
+
181
+ As an example of how Daft simplifies working with such complex data, let's see how we can process image URLs. With just a few lines of Daft code, we can pull open data from the [National Gallery of Art](https://github.com/NationalGalleryOfArt/opendata), then directly fetch, decode, and even resize the images within our DataFrame:
182
+ """
183
+ )
184
+ return
185
+
186
+
187
+ @app.cell
188
+ def _(daft):
189
+ (
190
+ # Fetch open data from the National Gallery of Art
191
+ daft.read_csv(
192
+ "https://github.com/NationalGalleryOfArt/opendata/raw/refs/heads/main/data/published_images.csv"
193
+ )
194
+ # Working only with first 5 rows to reduce latency of image fetching during this example
195
+ .limit(5)
196
+ # Select the object ID and the image thumbnail URL
197
+ .select(
198
+ daft.col("depictstmsobjectid").alias("objectid"),
199
+ daft.col("iiifthumburl")
200
+ # Download the content from the URL (string -> bytes)
201
+ .url.download(on_error="null")
202
+ # Decode the image bytes into an image object (bytes -> image)
203
+ .image.decode()
204
+ .alias("thumbnail"),
205
+ )
206
+ # Use Daft's built-in image resizing function to create smaller thumbnails
207
+ .with_column(
208
+ "thumbnail_resized",
209
+ # Resize the 'thumbnail' image column
210
+ daft.col("thumbnail").image.resize(w=32, h=32),
211
+ )
212
+ # Execute the plan and bring the results into memory
213
+ .collect()
214
+ )
215
+ return
216
+
217
+
218
+ @app.cell(hide_code=True)
219
+ def _(mo):
220
+ mo.md(r"""> Example inspired by the great post [Exploring Art with TypeScript, Jupyter, Polars, and Observable Plot](https://deno.com/blog/exploring-art-with-typescript-and-jupyter) published on Deno's blog.""")
221
+ return
222
+
223
+
224
+ @app.cell(hide_code=True)
225
+ def _(mo):
226
+ mo.md(r"""In later chapters, we'll explore in more detail how to work with these image objects and other complex types, including applying User-Defined Functions (UDFs) for custom processing. Until then, you can [take a look at a more complex example](https://blog.getdaft.io/p/we-cloned-over-15000-repos-to-find), in which Daft is used to clone over 15,000 GitHub repos to find the best developers.""")
227
+ return
228
+
229
+
230
+ @app.cell(hide_code=True)
231
+ def _(mo):
232
+ mo.md(
233
+ r"""
234
+ ## πŸ§‘β€πŸ’» Designed for Developers: Python and SQL Interfaces
235
+
236
+ Daft aims to be developer-friendly by offering flexible ways to interact with your data:
237
+
238
+ * **Pythonic DataFrame API**: If you've used Pandas, Polars or similar libraries, Daft's Python API for DataFrames will feel quite natural. It provides a rich set of methods for data manipulation, transformation, and analysis.
239
+ * **SQL Interface**: For those who prefer SQL or have existing SQL-based logic, Daft allows you to write queries using SQL syntax. Daft can execute SQL queries directly or even translate SQL expressions into its native expression system.
240
+
241
+ This dual-interface approach allows developers to choose the most appropriate tool for their specific task or leverage existing skills.
242
+ """
243
+ )
244
+ return
245
+
246
+
247
+ @app.cell
248
+ def _(daft):
249
+ df_simple = daft.from_pydict(
250
+ {
251
+ "item_code": [101, 102, 103, 104],
252
+ "quantity": [5, 0, 12, 7],
253
+ "region": ["North", "South", "North", "East"],
254
+ }
255
+ )
256
+ return (df_simple,)
257
+
258
+
259
+ @app.cell
260
+ def _(df_simple):
261
+ # Pandas-flavored API
262
+ df_simple.where(
263
+ (df_simple["quantity"] > 0) & (df_simple["region"] == "North")
264
+ ).collect()
265
+ return
266
+
267
+
268
+ @app.cell
269
+ def _(daft, df_simple):
270
+ # Polars-flavored API
271
+ df_simple.where(
272
+ (daft.col("quantity") > 0) & (daft.col("region") == "North")
273
+ ).collect()
274
+ return
275
+
276
+
277
+ @app.cell
278
+ def _(daft):
279
+ # SQL Interface
280
+ daft.sql(
281
+ "SELECT * FROM df_simple WHERE quantity > 0 AND region = 'North'"
282
+ ).collect()
283
+ return
284
+
285
+
286
+ @app.cell(hide_code=True)
287
+ def _(mo):
288
+ mo.md(
289
+ r"""
290
+ ## 🟣 Daft's Value Proposition
291
+
292
+ So, what makes Daft special? It's the combination of these design choices:
293
+
294
+ * A **Rust-based core engine** provides a solid foundation for performance and memory management.
295
+ * **Built-in scalability** means your code can transition from local development to distributed clusters (with Ray) with minimal changes.
296
+ * **Native handling of multimodal data** opens doors for complex ML/AI and analytics tasks that go beyond traditional tabular data.
297
+ * **Developer-centric Python and SQL APIs** offer flexibility and ease of use.
298
+
299
+ These elements combine to make Daft a versatile tool for tackling modern data challenges.
300
+
301
+ And this is just scratching the surface. Daft is a growing data engine with an ambitious vision: to unify data engineering, analytics, and ML/AI workflows πŸš€.
302
+ """
303
+ )
304
+ return
305
+
306
+
307
+ @app.cell
308
+ def _():
309
+ import daft
310
+ import marimo as mo
311
+
312
+ return daft, mo
313
+
314
+
315
+ if __name__ == "__main__":
316
+ app.run()
daft/README.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Learn Daft
2
+
3
+ _🚧 This collection is a work in progress. Please help us add notebooks!_
4
+
5
+ This collection of marimo notebooks is designed to teach you the basics of
6
+ Daft, a distributed dataframe engine that unifies data engineering, analytics & ML/AI workflows.
7
+
8
+ **Help us build this course! βš’οΈ**
9
+
10
+ We're seeking contributors to help us build these notebooks. Every contributor
11
+ will be acknowledged as an author in this README and in their contributed
12
+ notebooks. Head over to the [tracking
13
+ issue](https://github.com/marimo-team/learn/issues/43) to sign up for a planned
14
+ notebook or propose your own.
15
+
16
+ **Running notebooks.** To run a notebook locally, use
17
+
18
+ ```bash
19
+ uvx marimo edit <file_url>
20
+ ```
21
+
22
+ You can also open notebooks in our online playground by appending marimo.app/ to a notebook's URL.
23
+
24
+ **Authors.**
25
+
26
+ * [PΓ©ter Gyarmati](https://github.com/peter-gy)
27
+ Thanks to all our notebook authors!