Spaces:
Running
Running
Merge pull request #98 from peter-gy/peter-gy/daft-ch01
Browse filesDaft Course Kick-off via `What Makes Daft Special?` notebook
- README.md +1 -0
- daft/01_what_makes_daft_special.py +316 -0
- daft/README.md +27 -0
README.md
CHANGED
@@ -30,6 +30,7 @@ notebooks for educators, students, and practitioners.
|
|
30 |
- βοΈ Polars
|
31 |
- π₯ Pytorch
|
32 |
- ποΈ Duckdb
|
|
|
33 |
- π Altair
|
34 |
- π Plotly
|
35 |
- π matplotlib
|
|
|
30 |
- βοΈ Polars
|
31 |
- π₯ Pytorch
|
32 |
- ποΈ Duckdb
|
33 |
+
- π Daft
|
34 |
- π Altair
|
35 |
- π Plotly
|
36 |
- π matplotlib
|
daft/01_what_makes_daft_special.py
ADDED
@@ -0,0 +1,316 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# /// script
|
2 |
+
# requires-python = ">=3.12"
|
3 |
+
# dependencies = [
|
4 |
+
# "daft==0.4.14",
|
5 |
+
# "marimo",
|
6 |
+
# ]
|
7 |
+
# ///
|
8 |
+
|
9 |
+
import marimo
|
10 |
+
|
11 |
+
__generated_with = "0.13.6"
|
12 |
+
app = marimo.App(width="medium")
|
13 |
+
|
14 |
+
|
15 |
+
@app.cell(hide_code=True)
|
16 |
+
def _(mo):
|
17 |
+
mo.md(
|
18 |
+
r"""
|
19 |
+
# What Makes Daft Special?
|
20 |
+
|
21 |
+
> _By [PΓ©ter Ferenc Gyarmati](http://github.com/peter-gy)_.
|
22 |
+
|
23 |
+
Welcome to the course on [Daft](https://www.getdaft.io/), the distributed dataframe library! In this first chapter, we'll explore what Daft is and what makes it a noteworthy tool in the landscape of data processing. We'll look at its core design choices and how they aim to help you work with data more effectively, whether you're a data engineer, data scientist, or analyst.
|
24 |
+
"""
|
25 |
+
)
|
26 |
+
return
|
27 |
+
|
28 |
+
|
29 |
+
@app.cell(hide_code=True)
|
30 |
+
def _(mo):
|
31 |
+
mo.md(
|
32 |
+
r"""
|
33 |
+
## π― Introducing Daft: A Unified Data Engine
|
34 |
+
|
35 |
+
Daft is a distributed query engine designed to handle a wide array of data tasks, from data engineering and analytics to powering ML/AI workflows. It provides both a Python DataFrame API, familiar to users of libraries like Pandas, and a SQL interface, allowing you to choose the interaction style that best suits your needs or the task at hand.
|
36 |
+
|
37 |
+
The main goal of Daft is to provide a robust and versatile platform for processing data, whether it's gigabytes on your laptop or petabytes on a cluster.
|
38 |
+
|
39 |
+
Let's go ahead and `pip install daft` to see it in action!
|
40 |
+
"""
|
41 |
+
)
|
42 |
+
return
|
43 |
+
|
44 |
+
|
45 |
+
@app.cell(hide_code=True)
|
46 |
+
def _(df_with_discount, discount_slider, mo):
|
47 |
+
mo.vstack(
|
48 |
+
[
|
49 |
+
discount_slider,
|
50 |
+
df_with_discount.collect(),
|
51 |
+
]
|
52 |
+
)
|
53 |
+
return
|
54 |
+
|
55 |
+
|
56 |
+
@app.cell
|
57 |
+
def _(daft, discount_slider):
|
58 |
+
# Let's create a very simple Daft DataFrame
|
59 |
+
df = daft.from_pydict(
|
60 |
+
{
|
61 |
+
"id": [1, 2, 3],
|
62 |
+
"product_name": ["Laptop", "Mouse", "Keyboard"],
|
63 |
+
"price": [1200, 25, 75],
|
64 |
+
}
|
65 |
+
)
|
66 |
+
|
67 |
+
# Perform a basic operation: calculate a new price after discount
|
68 |
+
df_with_discount = df.with_column(
|
69 |
+
"discounted_price",
|
70 |
+
df["price"] * (1 - discount_slider.value),
|
71 |
+
)
|
72 |
+
return (df_with_discount,)
|
73 |
+
|
74 |
+
|
75 |
+
@app.cell(hide_code=True)
|
76 |
+
def _(mo):
|
77 |
+
discount_slider = mo.ui.slider(
|
78 |
+
start=0.05,
|
79 |
+
stop=0.5,
|
80 |
+
step=0.05,
|
81 |
+
label="Discount Rate:",
|
82 |
+
show_value=True,
|
83 |
+
)
|
84 |
+
return (discount_slider,)
|
85 |
+
|
86 |
+
|
87 |
+
@app.cell(hide_code=True)
|
88 |
+
def _(mo):
|
89 |
+
mo.md(
|
90 |
+
r"""
|
91 |
+
## π¦ Built with Rust: Performance and Simplicity
|
92 |
+
|
93 |
+
One of Daft's key characteristics is that its core engine is written in Rust. This choice has several implications for users:
|
94 |
+
|
95 |
+
* **Performance**: [Rust](https://www.rust-lang.org/) is known for its speed and memory efficiency. Unlike systems built on the Java Virtual Machine (JVM), Rust doesn't have a garbage collector that can introduce unpredictable pauses. This often translates to faster execution and more predictable performance.
|
96 |
+
* **Efficient Python Integration**: Daft uses Rust's native Python bindings. This allows Python code (like your DataFrame operations or User-Defined Functions, which we'll cover later) to interact closely with the Rust engine. This can reduce the overhead often seen when bridging Python with JVM-based systems (e.g., PySpark), especially for custom Python logic.
|
97 |
+
* **Simplified Developer Experience**: Rust-based systems typically require less configuration tuning compared to JVM-based systems. You don't need to worry about JVM heap sizes, garbage collection parameters, or managing Java dependencies.
|
98 |
+
|
99 |
+
Daft also leverages [Apache Arrow](https://arrow.apache.org/) for its in-memory data format. This allows for efficient data exchange between Daft's Rust core and Python, often with zero-copy data sharing, further enhancing performance.
|
100 |
+
"""
|
101 |
+
)
|
102 |
+
return
|
103 |
+
|
104 |
+
|
105 |
+
@app.cell(hide_code=True)
|
106 |
+
def _(mo):
|
107 |
+
mo.center(
|
108 |
+
mo.image(
|
109 |
+
src="https://minio.peter.gy/static/assets/marimo/learn/daft/daft-anti-spark-social-club.jpeg",
|
110 |
+
alt="Daft Anti Spark Social Club Meme",
|
111 |
+
caption="π‘ Fun Fact: Creators of Daft are proud members of the 'Anti Spark Social Club'.",
|
112 |
+
width=512,
|
113 |
+
height=682,
|
114 |
+
)
|
115 |
+
)
|
116 |
+
return
|
117 |
+
|
118 |
+
|
119 |
+
@app.cell(hide_code=True)
|
120 |
+
def _(mo):
|
121 |
+
mo.md(r"""A cornerstone of Daft's design is **lazy execution**. Imagine defining a DataFrame with a trillion rows on your laptop β usually not a great prospect for your device's memory!""")
|
122 |
+
return
|
123 |
+
|
124 |
+
|
125 |
+
@app.cell
|
126 |
+
def _(daft):
|
127 |
+
trillion_rows_df = (
|
128 |
+
daft.range(1_000_000_000_000)
|
129 |
+
.with_column("times_2", daft.col("id") * 2)
|
130 |
+
.filter(daft.col("id") % 2 == 0)
|
131 |
+
)
|
132 |
+
trillion_rows_df
|
133 |
+
return (trillion_rows_df,)
|
134 |
+
|
135 |
+
|
136 |
+
@app.cell(hide_code=True)
|
137 |
+
def _(mo):
|
138 |
+
mo.md(r"""With Daft, this is perfectly fine. Operations like `with_column` or `filter` don't compute results immediately. Instead, Daft builds a *logical plan* β a blueprint of the transformations you've defined. You can inspect this plan:""")
|
139 |
+
return
|
140 |
+
|
141 |
+
|
142 |
+
@app.cell(hide_code=True)
|
143 |
+
def _(mo, trillion_rows_df):
|
144 |
+
mo.mermaid(trillion_rows_df.explain(format="mermaid").split("\nSet")[0][11:-3])
|
145 |
+
return
|
146 |
+
|
147 |
+
|
148 |
+
@app.cell(hide_code=True)
|
149 |
+
def _(mo):
|
150 |
+
mo.md(r"""This plan is only executed (and data materialized) when you explicitly request it (e.g., with `.show()`, `.collect()`, or by writing to a file). Before execution, Daft's optimizer works to make your query run as efficiently as possible. This approach allows you to define complex operations on massive datasets without immediate computational cost or memory overflow.""")
|
151 |
+
return
|
152 |
+
|
153 |
+
|
154 |
+
@app.cell(hide_code=True)
|
155 |
+
def _(mo):
|
156 |
+
mo.md(
|
157 |
+
r"""
|
158 |
+
## π Scale Your Work: From Laptop to Cluster
|
159 |
+
|
160 |
+
Daft is designed with scalability in mind. As the trillion-row dataframe example above illustrates, you can write your data processing logic using Daft's Python API, and this same code can run:
|
161 |
+
|
162 |
+
* **Locally**: Utilizing multiple cores on your laptop or a single powerful machine for development or processing moderately sized datasets.
|
163 |
+
* **On a Cluster**: By integrating with [Ray](https://www.ray.io/), a framework for distributed computing. This allows Daft to scale out to process very large datasets across many machines.
|
164 |
+
|
165 |
+
This "write once, scale anywhere" approach means you don't need to significantly refactor your code when moving from local development to large-scale distributed execution. We'll delve into distributed computing with Ray in a later chapter.
|
166 |
+
"""
|
167 |
+
)
|
168 |
+
return
|
169 |
+
|
170 |
+
|
171 |
+
@app.cell(hide_code=True)
|
172 |
+
def _(mo):
|
173 |
+
mo.md(
|
174 |
+
r"""
|
175 |
+
## πΌοΈ Handling More Than Just Tables: Multimodal Data Support
|
176 |
+
|
177 |
+
Modern datasets often contain more than just numbers and text. They might include images, audio clips, URLs pointing to external files, tensor data from machine learning models, or complex nested structures like JSON.
|
178 |
+
|
179 |
+
Daft is built to accommodate these **multimodal data types** as integral parts of a DataFrame. This means you can have columns containing image data, embeddings, or other complex Python objects, and Daft provides mechanisms to process them. This is particularly useful for ML/AI pipelines and advanced analytics where diverse data sources are common.
|
180 |
+
|
181 |
+
As an example of how Daft simplifies working with such complex data, let's see how we can process image URLs. With just a few lines of Daft code, we can pull open data from the [National Gallery of Art](https://github.com/NationalGalleryOfArt/opendata), then directly fetch, decode, and even resize the images within our DataFrame:
|
182 |
+
"""
|
183 |
+
)
|
184 |
+
return
|
185 |
+
|
186 |
+
|
187 |
+
@app.cell
|
188 |
+
def _(daft):
|
189 |
+
(
|
190 |
+
# Fetch open data from the National Gallery of Art
|
191 |
+
daft.read_csv(
|
192 |
+
"https://github.com/NationalGalleryOfArt/opendata/raw/refs/heads/main/data/published_images.csv"
|
193 |
+
)
|
194 |
+
# Working only with first 5 rows to reduce latency of image fetching during this example
|
195 |
+
.limit(5)
|
196 |
+
# Select the object ID and the image thumbnail URL
|
197 |
+
.select(
|
198 |
+
daft.col("depictstmsobjectid").alias("objectid"),
|
199 |
+
daft.col("iiifthumburl")
|
200 |
+
# Download the content from the URL (string -> bytes)
|
201 |
+
.url.download(on_error="null")
|
202 |
+
# Decode the image bytes into an image object (bytes -> image)
|
203 |
+
.image.decode()
|
204 |
+
.alias("thumbnail"),
|
205 |
+
)
|
206 |
+
# Use Daft's built-in image resizing function to create smaller thumbnails
|
207 |
+
.with_column(
|
208 |
+
"thumbnail_resized",
|
209 |
+
# Resize the 'thumbnail' image column
|
210 |
+
daft.col("thumbnail").image.resize(w=32, h=32),
|
211 |
+
)
|
212 |
+
# Execute the plan and bring the results into memory
|
213 |
+
.collect()
|
214 |
+
)
|
215 |
+
return
|
216 |
+
|
217 |
+
|
218 |
+
@app.cell(hide_code=True)
|
219 |
+
def _(mo):
|
220 |
+
mo.md(r"""> Example inspired by the great post [Exploring Art with TypeScript, Jupyter, Polars, and Observable Plot](https://deno.com/blog/exploring-art-with-typescript-and-jupyter) published on Deno's blog.""")
|
221 |
+
return
|
222 |
+
|
223 |
+
|
224 |
+
@app.cell(hide_code=True)
|
225 |
+
def _(mo):
|
226 |
+
mo.md(r"""In later chapters, we'll explore in more detail how to work with these image objects and other complex types, including applying User-Defined Functions (UDFs) for custom processing. Until then, you can [take a look at a more complex example](https://blog.getdaft.io/p/we-cloned-over-15000-repos-to-find), in which Daft is used to clone over 15,000 GitHub repos to find the best developers.""")
|
227 |
+
return
|
228 |
+
|
229 |
+
|
230 |
+
@app.cell(hide_code=True)
|
231 |
+
def _(mo):
|
232 |
+
mo.md(
|
233 |
+
r"""
|
234 |
+
## π§βπ» Designed for Developers: Python and SQL Interfaces
|
235 |
+
|
236 |
+
Daft aims to be developer-friendly by offering flexible ways to interact with your data:
|
237 |
+
|
238 |
+
* **Pythonic DataFrame API**: If you've used Pandas, Polars or similar libraries, Daft's Python API for DataFrames will feel quite natural. It provides a rich set of methods for data manipulation, transformation, and analysis.
|
239 |
+
* **SQL Interface**: For those who prefer SQL or have existing SQL-based logic, Daft allows you to write queries using SQL syntax. Daft can execute SQL queries directly or even translate SQL expressions into its native expression system.
|
240 |
+
|
241 |
+
This dual-interface approach allows developers to choose the most appropriate tool for their specific task or leverage existing skills.
|
242 |
+
"""
|
243 |
+
)
|
244 |
+
return
|
245 |
+
|
246 |
+
|
247 |
+
@app.cell
|
248 |
+
def _(daft):
|
249 |
+
df_simple = daft.from_pydict(
|
250 |
+
{
|
251 |
+
"item_code": [101, 102, 103, 104],
|
252 |
+
"quantity": [5, 0, 12, 7],
|
253 |
+
"region": ["North", "South", "North", "East"],
|
254 |
+
}
|
255 |
+
)
|
256 |
+
return (df_simple,)
|
257 |
+
|
258 |
+
|
259 |
+
@app.cell
|
260 |
+
def _(df_simple):
|
261 |
+
# Pandas-flavored API
|
262 |
+
df_simple.where(
|
263 |
+
(df_simple["quantity"] > 0) & (df_simple["region"] == "North")
|
264 |
+
).collect()
|
265 |
+
return
|
266 |
+
|
267 |
+
|
268 |
+
@app.cell
|
269 |
+
def _(daft, df_simple):
|
270 |
+
# Polars-flavored API
|
271 |
+
df_simple.where(
|
272 |
+
(daft.col("quantity") > 0) & (daft.col("region") == "North")
|
273 |
+
).collect()
|
274 |
+
return
|
275 |
+
|
276 |
+
|
277 |
+
@app.cell
|
278 |
+
def _(daft):
|
279 |
+
# SQL Interface
|
280 |
+
daft.sql(
|
281 |
+
"SELECT * FROM df_simple WHERE quantity > 0 AND region = 'North'"
|
282 |
+
).collect()
|
283 |
+
return
|
284 |
+
|
285 |
+
|
286 |
+
@app.cell(hide_code=True)
|
287 |
+
def _(mo):
|
288 |
+
mo.md(
|
289 |
+
r"""
|
290 |
+
## π£ Daft's Value Proposition
|
291 |
+
|
292 |
+
So, what makes Daft special? It's the combination of these design choices:
|
293 |
+
|
294 |
+
* A **Rust-based core engine** provides a solid foundation for performance and memory management.
|
295 |
+
* **Built-in scalability** means your code can transition from local development to distributed clusters (with Ray) with minimal changes.
|
296 |
+
* **Native handling of multimodal data** opens doors for complex ML/AI and analytics tasks that go beyond traditional tabular data.
|
297 |
+
* **Developer-centric Python and SQL APIs** offer flexibility and ease of use.
|
298 |
+
|
299 |
+
These elements combine to make Daft a versatile tool for tackling modern data challenges.
|
300 |
+
|
301 |
+
And this is just scratching the surface. Daft is a growing data engine with an ambitious vision: to unify data engineering, analytics, and ML/AI workflows π.
|
302 |
+
"""
|
303 |
+
)
|
304 |
+
return
|
305 |
+
|
306 |
+
|
307 |
+
@app.cell
|
308 |
+
def _():
|
309 |
+
import daft
|
310 |
+
import marimo as mo
|
311 |
+
|
312 |
+
return daft, mo
|
313 |
+
|
314 |
+
|
315 |
+
if __name__ == "__main__":
|
316 |
+
app.run()
|
daft/README.md
ADDED
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Learn Daft
|
2 |
+
|
3 |
+
_π§ This collection is a work in progress. Please help us add notebooks!_
|
4 |
+
|
5 |
+
This collection of marimo notebooks is designed to teach you the basics of
|
6 |
+
Daft, a distributed dataframe engine that unifies data engineering, analytics & ML/AI workflows.
|
7 |
+
|
8 |
+
**Help us build this course! βοΈ**
|
9 |
+
|
10 |
+
We're seeking contributors to help us build these notebooks. Every contributor
|
11 |
+
will be acknowledged as an author in this README and in their contributed
|
12 |
+
notebooks. Head over to the [tracking
|
13 |
+
issue](https://github.com/marimo-team/learn/issues/43) to sign up for a planned
|
14 |
+
notebook or propose your own.
|
15 |
+
|
16 |
+
**Running notebooks.** To run a notebook locally, use
|
17 |
+
|
18 |
+
```bash
|
19 |
+
uvx marimo edit <file_url>
|
20 |
+
```
|
21 |
+
|
22 |
+
You can also open notebooks in our online playground by appending marimo.app/ to a notebook's URL.
|
23 |
+
|
24 |
+
**Authors.**
|
25 |
+
|
26 |
+
* [PΓ©ter Gyarmati](https://github.com/peter-gy)
|
27 |
+
Thanks to all our notebook authors!
|