Srihari Thyagarajan commited on
Commit
68f7784
·
unverified ·
2 Parent(s): 5e07263 34f04d3

Merge pull request #83 from Azmi-84/duckdb

Browse files

Add DuckDB getting started guide with interactive examples.

Files changed (1) hide show
  1. duckdb/01_getting_started.py +1792 -0
duckdb/01_getting_started.py ADDED
@@ -0,0 +1,1792 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "marimo",
5
+ # "duckdb==1.2.2",
6
+ # "polars==1.27.0",
7
+ # "numpy==2.2.4",
8
+ # "pyarrow==19.0.1",
9
+ # "pandas==2.2.3",
10
+ # "sqlglot==26.12.1",
11
+ # "plotly==5.23.1",
12
+ # "statsmodels==0.14.4",
13
+ # ]
14
+ # ///
15
+
16
+ import marimo
17
+
18
+ __generated_with = "0.13.4"
19
+ app = marimo.App(width="medium")
20
+
21
+
22
+ @app.cell(hide_code=True)
23
+ def _(mo):
24
+ mo.md(
25
+ rf"""
26
+ <p align="center">
27
+ <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxHAqB0W_61zuIGVMiU6sEeQyTaw-9xwiprw&s" alt="DuckDB Image"/>
28
+ </p>
29
+ """
30
+ )
31
+ return
32
+
33
+
34
+ @app.cell(hide_code=True)
35
+ def _(mo):
36
+ mo.md(
37
+ rf"""
38
+ # 🦆 **DuckDB**: An Embeddable Analytical Database System
39
+
40
+ ## What is DuckDB?
41
+
42
+ [DuckDB](https://duckdb.org/) is a _high-performance_, in-process, embeddable SQL OLAP (Online Analytical Processing) Database Management System (DBMS) designed for simplicity and speed. It's essentially a fully-featured database that runs directly within your application's process, without needing a separate server. This makes it excellent for complex analytical workloads, offering a robust SQL interface and efficient processing – perfect for learning about databases and data analysis concepts. It's a great alternative to heavier database systems like PostgreSQL or MySQL when you don't need a full-blown server.
43
+
44
+ ---
45
+
46
+ ## ⚡ Key Features
47
+
48
+ | Feature | Description |
49
+ |:---------|:-------------|
50
+ | **In-Process Architecture** | Runs directly within your application's memory space - no separate server needed, simplifying deployment |
51
+ | **Columnar Storage** | Data stored in columns instead of rows, dramatically improving performance for analytical queries |
52
+ | **Vectorized Execution** | Performs operations on entire columns at once, significantly speeding up data processing |
53
+ | **ACID Transactions** | Ensures data integrity and reliability across operations |
54
+ | **Multi-Language Support** | Provides APIs for `Python`, `R`, `Java`, `C++`, and more |
55
+ | **Zero External Dependencies** | Minimal dependencies, making setup and deployment straightforward |
56
+ | **High Portability** | Works across various operating systems (Windows, macOS, Linux) and hardware architectures |
57
+
58
+ ---
59
+
60
+ ## [Use Cases](https://github.com/davidgasquez/awesome-duckdb?tab=readme-ov-file):
61
+
62
+ - **Data Analysis and Exploration:** DuckDB is ideal for quickly querying and analyzing datasets, especially for initial exploratory analysis.
63
+ - **Embedded Analytics in Applications:** You can integrate DuckDB directly into your applications to provide analytical capabilities without the need for a separate database server.
64
+ - **ETL (Extract, Transform, Load) Processes:** DuckDB can be used to perform initial data transformation and cleaning steps as part of an ETL pipeline.
65
+ - **Data Science and Machine Learning Workflows:** It's a lightweight alternative to larger databases for prototyping data analysis and machine learning models.
66
+ - **Rapid Prototyping of Data Analysis Pipelines:** Quickly test and iterate on data analysis ideas without the complexity of setting up a full-blown database environment.
67
+ - **Small to Medium Datasets:** DuckDB shines when working with datasets that don't require the massive scalability of a traditional database server.
68
+
69
+ ---
70
+
71
+ ### [Installation](https://duckdb.org/docs/installation/?version=stable&environment=python):
72
+
73
+ - Python installation:
74
+ ```
75
+ pip install duckdb
76
+ ```
77
+ ```
78
+ conda install python-duckdb -c conda-forge.
79
+ ```
80
+
81
+ <!-- >**_Note_:** DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system. -->
82
+
83
+ /// attention | Note
84
+ DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system.
85
+ ///
86
+ """
87
+ )
88
+ return
89
+
90
+
91
+ @app.cell(hide_code=True)
92
+ def _(mo):
93
+ mo.md(
94
+ r"""
95
+ # [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html)
96
+
97
+ DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage.
98
+
99
+ ---
100
+
101
+ | Feature | In-Memory Connection | File-Based Connection |
102
+ |:---------|:---------------------|:----------------------|
103
+ | Persistence | Temporary (lost when session ends) | Stored on disk (persists between sessions) |
104
+ | Use Cases | Quick analysis, ephemeral data, testing | Long-term storage, data that needs to be accessed later |
105
+ | Performance | Faster for most operations | Slightly slower but provides persistence |
106
+ | Creation | duckdb.connect(':memory:') | duckdb.connect('filename.db') |
107
+ | Multiple Connection Access | Limited to single connection | Multiple connections can access the same database |
108
+ """
109
+ )
110
+ return
111
+
112
+
113
+ @app.cell(hide_code=True)
114
+ def _(os):
115
+ # Remove previous database if it exists
116
+ if os.path.exists("example.db"):
117
+ os.remove("example.db")
118
+
119
+ if not os.path.exists("data"):
120
+ os.makedirs("data")
121
+ return
122
+
123
+
124
+ @app.cell(hide_code=True)
125
+ def _(mo):
126
+ _df = mo.sql(
127
+ f"""
128
+ -- Print the DuckDB version
129
+ SELECT version() AS version_info
130
+ """
131
+ )
132
+ return
133
+
134
+
135
+ @app.cell(hide_code=True)
136
+ def _(mo):
137
+ mo.md(
138
+ """
139
+ ## Creating DuckDB Connections
140
+
141
+ Let's create both types of DuckDB connections and explore their characteristics.
142
+
143
+ 1. **In-memory connection**: Data exists only during the current session
144
+ 2. **File-based connection**: Data persists between sessions
145
+
146
+ We'll then demonstrate the key differences between these connection types.
147
+ """
148
+ )
149
+ return
150
+
151
+
152
+ @app.cell(hide_code=True)
153
+ def _(duckdb):
154
+ # Create an in-memory DuckDB connection
155
+ memory_db = duckdb.connect(":memory:")
156
+
157
+ # Create a file-based DuckDB connection
158
+ file_db = duckdb.connect("example.db")
159
+ return file_db, memory_db
160
+
161
+
162
+ @app.cell(hide_code=True)
163
+ def _(file_db, memory_db):
164
+ # Test both connections
165
+ memory_db.execute(
166
+ "CREATE TABLE IF NOT EXISTS mem_test (id INTEGER, name VARCHAR)"
167
+ )
168
+ memory_db.execute("INSERT INTO mem_test VALUES (1, 'Memory Test')")
169
+
170
+ file_db.execute(
171
+ "CREATE TABLE IF NOT EXISTS file_test (id INTEGER, name VARCHAR)"
172
+ )
173
+ file_db.execute("INSERT INTO file_test VALUES (1, 'File Test')")
174
+ return
175
+
176
+
177
+ @app.cell(hide_code=True)
178
+ def _(mo):
179
+ mo.md(
180
+ r"""
181
+ ## Testing Connection Persistence
182
+
183
+ Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist.
184
+
185
+ 1. First, we'll query our tables to confirm the data was properly inserted
186
+ 2. Then, we'll simulate an application restart by creating new connections
187
+ 3. Finally, we'll check which data persists after the "restart"
188
+ """
189
+ )
190
+ return
191
+
192
+
193
+ @app.cell(hide_code=True)
194
+ def _(mo):
195
+ mo.md(r"""## Current Database Contents""")
196
+ return
197
+
198
+
199
+ @app.cell(hide_code=True)
200
+ def _(mem_test, memory_db, mo):
201
+ _df = mo.sql(
202
+ f"""
203
+ SELECT * FROM mem_test
204
+ """,
205
+ engine=memory_db
206
+ )
207
+ return
208
+
209
+
210
+ @app.cell(hide_code=True)
211
+ def _(file_db, file_test, mo):
212
+ _df = mo.sql(
213
+ f"""
214
+ SELECT * FROM file_test
215
+ """,
216
+ engine=file_db
217
+ )
218
+ return
219
+
220
+
221
+ @app.cell
222
+ def _():
223
+ # We don't actually close the connections here since we need them for later cells
224
+ # Just a placeholder for the concept
225
+ return
226
+
227
+
228
+ @app.cell(hide_code=True)
229
+ def _(mo):
230
+ mo.md(rf"""## 🔄 Simulating Application Restart...""")
231
+ return
232
+
233
+
234
+ @app.cell(hide_code=True)
235
+ def _(duckdb):
236
+ # Create new connections (simulating restart)
237
+ new_memory_db = duckdb.connect(":memory:")
238
+ new_file_db = duckdb.connect("example.db")
239
+ return new_file_db, new_memory_db
240
+
241
+
242
+ @app.cell(hide_code=True)
243
+ def _(new_memory_db):
244
+ # Try to query tables in the new memory connection
245
+ try:
246
+ new_memory_db.execute("SELECT * FROM mem_test").df()
247
+ memory_persistence = "✅ Data persisted in memory (unexpected)"
248
+ memory_data_available = True
249
+ except Exception as e:
250
+ memory_persistence = "❌ Data lost from memory (expected behavior)"
251
+ memory_data_available = False
252
+ return memory_data_available, memory_persistence
253
+
254
+
255
+ @app.cell(hide_code=True)
256
+ def _(new_file_db):
257
+ # Try to query tables in the new file connection
258
+ try:
259
+ file_data = new_file_db.execute("SELECT * FROM file_test").df()
260
+ file_persistence = "✅ Data persisted in file (expected behavior)"
261
+ file_data_available = True
262
+ except Exception as e:
263
+ file_persistence = "❌ Data lost from file (unexpected)"
264
+ file_data_available = False
265
+ file_data = None
266
+ return file_data, file_data_available, file_persistence
267
+
268
+
269
+ @app.cell(hide_code=True)
270
+ def _(
271
+ file_data_available,
272
+ file_persistence,
273
+ memory_data_available,
274
+ memory_persistence,
275
+ mo,
276
+ ):
277
+ # Create an interactive display to show persistence results
278
+ persistence_results = mo.ui.table(
279
+ {
280
+ "Connection Type": ["In-Memory Database", "File-Based Database"],
281
+ "Persistence Status": [memory_persistence, file_persistence],
282
+ "Data Available After Restart": [
283
+ memory_data_available,
284
+ file_data_available,
285
+ ],
286
+ }
287
+ )
288
+ return (persistence_results,)
289
+
290
+
291
+ @app.cell(hide_code=True)
292
+ def _(mo, persistence_results):
293
+ mo.vstack(
294
+ [
295
+ mo.vstack([mo.md(f"""## Persistence Test Results""")], align="center"),
296
+ persistence_results,
297
+ ],
298
+ gap=2,
299
+ justify="space-between",
300
+ )
301
+ return
302
+
303
+
304
+ @app.cell(hide_code=True)
305
+ def _(file_data, file_data_available, mo):
306
+ if file_data_available:
307
+ mo.md("### Persisted File-Based Data:")
308
+ mo.ui.table(file_data)
309
+ return
310
+
311
+
312
+ @app.cell(hide_code=True)
313
+ def _(mo):
314
+ mo.md(
315
+ r"""
316
+ # [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html)
317
+
318
+ DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints.
319
+
320
+ ## Table Creation Options
321
+
322
+ DuckDB supports various table creation options, including:
323
+
324
+ - **Basic tables** with column definitions
325
+ - **Temporary tables** that exist only during the session
326
+ - **CREATE OR REPLACE** to recreate tables
327
+ - **Primary keys** and other constraints
328
+ - **Various data types** including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc.
329
+ """
330
+ )
331
+ return
332
+
333
+
334
+ @app.cell(hide_code=True)
335
+ def _(file_db, new_memory_db):
336
+ # For the memory database
337
+ try:
338
+ new_memory_db.execute("DROP TABLE IF EXISTS users_memory")
339
+ except:
340
+ pass
341
+
342
+ # For the file database
343
+ try:
344
+ file_db.execute("DROP TABLE IF EXISTS users_file")
345
+ except:
346
+ pass
347
+ return
348
+
349
+
350
+ @app.cell(hide_code=True)
351
+ def _(file_db, new_memory_db):
352
+ # Create advanced users table in memory database with primary key
353
+ new_memory_db.execute("""
354
+ CREATE TABLE users_memory (
355
+ id INTEGER PRIMARY KEY,
356
+ name VARCHAR NOT NULL,
357
+ age INTEGER CHECK (age > 0),
358
+ email VARCHAR UNIQUE,
359
+ registration_date DATE DEFAULT CURRENT_DATE,
360
+ last_login TIMESTAMP,
361
+ account_balance DECIMAL(10,2) DEFAULT 0.00
362
+ )
363
+ """)
364
+
365
+ # Create users table in file database
366
+ file_db.execute("""
367
+ CREATE TABLE users_file (
368
+ id INTEGER PRIMARY KEY,
369
+ name VARCHAR NOT NULL,
370
+ age INTEGER CHECK (age > 0),
371
+ email VARCHAR UNIQUE,
372
+ registration_date DATE DEFAULT CURRENT_DATE,
373
+ last_login TIMESTAMP,
374
+ account_balance DECIMAL(10,2) DEFAULT 0.00
375
+ )
376
+ """)
377
+ return
378
+
379
+
380
+ @app.cell(hide_code=True)
381
+ def _(new_memory_db):
382
+ # Get table schema information using DuckDB's internal system tables
383
+ memory_schema = new_memory_db.execute("""
384
+ SELECT column_name, data_type, is_nullable
385
+ FROM information_schema.columns
386
+ WHERE table_name = 'users_memory'
387
+ ORDER BY ordinal_position
388
+ """).df()
389
+ return (memory_schema,)
390
+
391
+
392
+ @app.cell(hide_code=True)
393
+ def _(memory_schema, mo):
394
+ mo.vstack(
395
+ [
396
+ mo.vstack(
397
+ [mo.md(f"""## 🔍 Table Schema Information """)], align="center"
398
+ ),
399
+ mo.ui.table(memory_schema),
400
+ ],
401
+ gap=2,
402
+ justify="space-between",
403
+ )
404
+ return
405
+
406
+
407
+ @app.cell(hide_code=True)
408
+ def _(mo):
409
+ mo.md(
410
+ r"""
411
+ # [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert)
412
+
413
+ DuckDB supports multiple ways to insert data:
414
+
415
+ 1. **INSERT INTO VALUES**: Insert specific values
416
+ 2. **INSERT INTO SELECT**: Insert data from query results
417
+ 3. **Parameterized inserts**: Using prepared statements
418
+ 4. **Bulk inserts**: For efficient loading of multiple rows
419
+
420
+ Let's demonstrate these different insertion methods:
421
+ """
422
+ )
423
+ return
424
+
425
+
426
+ @app.cell(hide_code=True)
427
+ def _(date):
428
+ today = date.today()
429
+
430
+
431
+ # First check if records already exist to avoid duplicate key errors
432
+ def safe_insert(connection, table_name, data):
433
+ """
434
+ Safely insert data into a table by checking for existing IDs first
435
+ """
436
+ # Check which IDs already exist in the table
437
+ existing_ids = (
438
+ connection.execute(f"SELECT id FROM {table_name}")
439
+ .fetchdf()["id"]
440
+ .tolist()
441
+ )
442
+
443
+ # Filter out data with IDs that already exist
444
+ new_data = [record for record in data if record[0] not in existing_ids]
445
+
446
+ if not new_data:
447
+ print(
448
+ f"No new records to insert into {table_name}. All IDs already exist."
449
+ )
450
+ return 0
451
+
452
+ # Prepare the placeholders for the SQL statement
453
+ placeholders = ", ".join(
454
+ ["(" + ", ".join(["?"] * len(new_data[0])) + ")"] * len(new_data)
455
+ )
456
+
457
+ # Flatten the list of tuples for parameter binding
458
+ flat_data = [item for sublist in new_data for item in sublist]
459
+
460
+ # Perform the insertion
461
+ if flat_data:
462
+ columns = "(id, name, age, email, registration_date, last_login, account_balance)"
463
+ connection.execute(
464
+ f"INSERT INTO {table_name} {columns} VALUES {placeholders}",
465
+ flat_data,
466
+ )
467
+ return len(new_data)
468
+ return 0
469
+ return (safe_insert,)
470
+
471
+
472
+ @app.cell(hide_code=True)
473
+ def _():
474
+ # Prepare the data
475
+ user_data = [
476
+ (
477
+ 1,
478
+ "Alice",
479
+ 25,
480
+ "alice@example.com",
481
+ "2021-01-01",
482
+ "2023-01-15 14:30:00",
483
+ 1250.75,
484
+ ),
485
+ (
486
+ 2,
487
+ "Bob",
488
+ 30,
489
+ "bob@example.com",
490
+ "2021-02-01",
491
+ "2023-02-10 09:15:22",
492
+ 750.50,
493
+ ),
494
+ (
495
+ 3,
496
+ "Charlie",
497
+ 35,
498
+ "charlie@example.com",
499
+ "2021-03-01",
500
+ "2023-03-05 17:45:10",
501
+ 3200.25,
502
+ ),
503
+ (
504
+ 4,
505
+ "David",
506
+ 40,
507
+ "david@example.com",
508
+ "2021-04-01",
509
+ "2023-04-20 10:30:45",
510
+ 1800.00,
511
+ ),
512
+ (
513
+ 5,
514
+ "Emma",
515
+ 45,
516
+ "emma@example.com",
517
+ "2021-05-01",
518
+ "2023-05-12 11:20:30",
519
+ 2500.00,
520
+ ),
521
+ (
522
+ 6,
523
+ "Frank",
524
+ 50,
525
+ "frank@example.com",
526
+ "2021-06-01",
527
+ "2023-06-18 16:10:15",
528
+ 900.25,
529
+ ),
530
+ ]
531
+ return (user_data,)
532
+
533
+
534
+ @app.cell(hide_code=True)
535
+ def _(file_db, new_memory_db, safe_insert, user_data):
536
+ # Safely insert data into memory database
537
+ safe_insert(new_memory_db, "users_memory", user_data)
538
+
539
+ # Safely insert data into file database
540
+ safe_insert(file_db, "users_file", user_data)
541
+ return
542
+
543
+
544
+ @app.cell(hide_code=True)
545
+ def _():
546
+ # If you need to add just one record, you can use a similar approach:
547
+ new_user = (
548
+ 7,
549
+ "Grace",
550
+ 28,
551
+ "grace@example.com",
552
+ "2021-07-01",
553
+ "2023-07-22 13:45:10",
554
+ 1675.50,
555
+ )
556
+ return (new_user,)
557
+
558
+
559
+ @app.cell(hide_code=True)
560
+ def _(new_memory_db, new_user):
561
+ # Check if the ID exists before inserting
562
+ if not new_memory_db.execute(
563
+ "SELECT id FROM users_memory WHERE id = ?", [new_user[0]]
564
+ ).fetchone():
565
+ new_memory_db.execute(
566
+ """
567
+ INSERT INTO users_memory (id, name, age, email, registration_date, last_login, account_balance)
568
+ VALUES (?, ?, ?, ?, ?, ?, ?)
569
+ """,
570
+ new_user,
571
+ )
572
+ print(f"Added user {new_user[1]} to users_memory")
573
+ else:
574
+ print(f"User with ID {new_user[0]} already exists in users_memory")
575
+ return
576
+
577
+
578
+ @app.cell(hide_code=True)
579
+ def _(file_db, new_user):
580
+ # Do the same for the file database
581
+ if not file_db.execute(
582
+ "SELECT id FROM users_file WHERE id = ?", [new_user[0]]
583
+ ).fetchone():
584
+ file_db.execute(
585
+ """
586
+ INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
587
+ VALUES (?, ?, ?, ?, ?, ?, ?)
588
+ """,
589
+ new_user,
590
+ )
591
+ print(f"Added user {new_user[1]} to users_file")
592
+ else:
593
+ print(f"User with ID {new_user[0]} already exists in users_file")
594
+ return
595
+
596
+
597
+ @app.cell(hide_code=True)
598
+ def _(new_memory_db):
599
+ # First try to update
600
+ cursor = new_memory_db.execute(
601
+ """
602
+ UPDATE users_memory
603
+ SET name = ?, age = ?, email = ?,
604
+ registration_date = ?, last_login = ?, account_balance = ?
605
+ WHERE id = ?
606
+ """,
607
+ (
608
+ "Henry",
609
+ 33,
610
+ "henry@example.com",
611
+ "2021-08-01",
612
+ "2023-08-05 09:10:15",
613
+ 3100.75,
614
+ 8, # ID should be the last parameter
615
+ ),
616
+ )
617
+ return (cursor,)
618
+
619
+
620
+ @app.cell(hide_code=True)
621
+ def _(cursor, mo, new_memory_db):
622
+ # If no rows were updated, perform an insert
623
+ if cursor.rowcount == 0:
624
+ new_memory_db.execute(
625
+ """
626
+ INSERT INTO users_memory
627
+ (id, name, age, email, registration_date, last_login, account_balance)
628
+ VALUES (?, ?, ?, ?, ?, ?, ?)
629
+ """,
630
+ (
631
+ 8,
632
+ "Henry",
633
+ 33,
634
+ "henry@example.com",
635
+ "2021-08-01",
636
+ "2023-08-05 09:10:15",
637
+ 3100.75,
638
+ ),
639
+ )
640
+
641
+ mo.md(
642
+ f"""
643
+ Upserted Henry into users_memory.
644
+ """
645
+ )
646
+ return
647
+
648
+
649
+ @app.cell(hide_code=True)
650
+ def _(file_db, mo):
651
+ # For DuckDB using ON CONFLICT, we need to specify the conflict target column
652
+ file_db.execute(
653
+ """
654
+ INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
655
+ VALUES (?, ?, ?, ?, ?, ?, ?)
656
+ ON CONFLICT (id) DO UPDATE SET
657
+ name = EXCLUDED.name,
658
+ age = EXCLUDED.age,
659
+ email = EXCLUDED.email,
660
+ registration_date = EXCLUDED.registration_date,
661
+ last_login = EXCLUDED.last_login,
662
+ account_balance = EXCLUDED.account_balance
663
+ """,
664
+ (
665
+ 8,
666
+ "Henry",
667
+ 33,
668
+ "henry@example.com",
669
+ "2021-08-01",
670
+ "2023-08-05 09:10:15",
671
+ 3100.75,
672
+ ),
673
+ )
674
+
675
+ mo.md(
676
+ f"""
677
+ Upserted Henry into users_file.
678
+ """
679
+ )
680
+ return
681
+
682
+
683
+ @app.cell(hide_code=True)
684
+ def _(new_memory_db):
685
+ # Display memory data using DuckDB's query capabilities
686
+ memory_results = new_memory_db.execute("""
687
+ SELECT
688
+ id,
689
+ name,
690
+ age,
691
+ email,
692
+ registration_date,
693
+ last_login,
694
+ account_balance
695
+ FROM users_memory
696
+ ORDER BY id
697
+ """).df()
698
+ return (memory_results,)
699
+
700
+
701
+ @app.cell(hide_code=True)
702
+ def _(file_db):
703
+ # Display file data with formatting
704
+ file_results = file_db.execute("""
705
+ SELECT
706
+ id,
707
+ name,
708
+ age,
709
+ email,
710
+ registration_date,
711
+ last_login,
712
+ CAST(account_balance AS DECIMAL(10,2)) AS account_balance
713
+ FROM users_file
714
+ ORDER BY id
715
+ """).df()
716
+ return (file_results,)
717
+
718
+
719
+ @app.cell(hide_code=True)
720
+ def _(file_results, memory_results, mo):
721
+ tabs = mo.ui.tabs(
722
+ {
723
+ "In-Memory Database": mo.ui.table(memory_results),
724
+ "File-Based Database": mo.ui.table(file_results),
725
+ }
726
+ )
727
+
728
+ mo.vstack(
729
+ [
730
+ mo.vstack(
731
+ [mo.md(f"""## 📊 Database Contents After Insertion""")],
732
+ align="center",
733
+ ),
734
+ tabs,
735
+ ],
736
+ gap=2,
737
+ justify="space-between",
738
+ )
739
+ return
740
+
741
+
742
+ @app.cell(hide_code=True)
743
+ def _(mo):
744
+ mo.md(
745
+ r"""
746
+ # [4. Using SQL Directly in marimo](https://duckdb.org/docs/stable/sql/query_syntax/select)
747
+
748
+ There are multiple ways to leverage DuckDB's SQL capabilities in marimo:
749
+
750
+ 1. **Direct execution**: Using DuckDB connections to execute SQL
751
+ 2. **marimo SQL**: Using marimo's built-in SQL engine
752
+ 3. **Interactive queries**: Combining UI elements with SQL execution
753
+
754
+ Let's explore these approaches:
755
+ """
756
+ )
757
+ return
758
+
759
+
760
+ @app.cell(hide_code=True)
761
+ def _(mo):
762
+ mo.vstack(
763
+ [
764
+ mo.vstack([mo.md(f"""## 🔍 Query with marimo SQL""")], align="center"),
765
+ mo.md(
766
+ "### marimo has its own [built-in SQL engine](https://docs.marimo.io/guides/working_with_data/sql/) that can work with DataFrames."
767
+ ),
768
+ ],
769
+ gap=2,
770
+ justify="space-between",
771
+ )
772
+ return
773
+
774
+
775
+ @app.cell(hide_code=True)
776
+ def _(memory_results, mo):
777
+ # Create a SQL selector for users with age threshold
778
+ age_threshold = mo.ui.slider(
779
+ 25, 50, value=30, label="Minimum Age", full_width=True, show_value=True
780
+ )
781
+
782
+
783
+ # Create a function to filter users based on the slider value
784
+ def filtered_users():
785
+ # Use DuckDB directly instead of mo.sql with users param
786
+ filtered_df = memory_results[memory_results["age"] >= age_threshold.value]
787
+ filtered_df = filtered_df.sort_values("age")
788
+ return mo.ui.table(filtered_df)
789
+ return age_threshold, filtered_users
790
+
791
+
792
+ @app.cell(hide_code=True)
793
+ def _(age_threshold, filtered_users, mo):
794
+ layout = mo.vstack(
795
+ [
796
+ mo.md("### Select minimum age:"),
797
+ age_threshold,
798
+ mo.md("### Users meeting age criteria:"),
799
+ filtered_users(),
800
+ ],
801
+ gap=2,
802
+ justify="space-between",
803
+ )
804
+
805
+ layout
806
+ return
807
+
808
+
809
+ @app.cell(hide_code=True)
810
+ def _(mo):
811
+ mo.md(r"""# [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html)""")
812
+ return
813
+
814
+
815
+ @app.cell(hide_code=True)
816
+ def _(pl):
817
+ # Create a Polars DataFrame
818
+ polars_df = pl.DataFrame(
819
+ {
820
+ "id": [101, 102, 103],
821
+ "name": ["Product A", "Product B", "Product C"],
822
+ "price": [29.99, 49.99, 19.99],
823
+ "category": ["Electronics", "Furniture", "Books"],
824
+ }
825
+ )
826
+ return (polars_df,)
827
+
828
+
829
+ @app.cell(hide_code=True)
830
+ def _(mo, polars_df):
831
+ mo.vstack(
832
+ [
833
+ mo.vstack(
834
+ [mo.md(f"""## Original Polars DataFrame:""")], align="center"
835
+ ),
836
+ mo.ui.table(polars_df),
837
+ ],
838
+ gap=2,
839
+ justify="space-between",
840
+ )
841
+ return
842
+
843
+
844
+ @app.cell(hide_code=True)
845
+ def _(new_memory_db, polars_df):
846
+ # Register the Polars DataFrame as a DuckDB table in memory connection
847
+ new_memory_db.register("products_polars", polars_df)
848
+
849
+ # Query the registered table
850
+ polars_query_result = new_memory_db.execute(
851
+ "SELECT * FROM products_polars WHERE price > 25"
852
+ ).df()
853
+ return (polars_query_result,)
854
+
855
+
856
+ @app.cell(hide_code=True)
857
+ def _(mo, polars_query_result):
858
+ mo.vstack(
859
+ [
860
+ mo.vstack(
861
+ [mo.md(f"""## DuckDB Query Result (From Polars Data):""")],
862
+ align="center",
863
+ ),
864
+ mo.ui.table(polars_query_result),
865
+ ],
866
+ gap=2,
867
+ justify="space-between",
868
+ )
869
+ return
870
+
871
+
872
+ @app.cell(hide_code=True)
873
+ def _(new_memory_db):
874
+ # Demonstrate a more complex query
875
+ complex_query_result = new_memory_db.execute("""
876
+ SELECT
877
+ category,
878
+ COUNT(*) as product_count,
879
+ AVG(price) as avg_price,
880
+ MIN(price) as min_price,
881
+ MAX(price) as max_price
882
+ FROM products_polars
883
+ GROUP BY category
884
+ ORDER BY avg_price DESC
885
+ """).df()
886
+ return (complex_query_result,)
887
+
888
+
889
+ @app.cell(hide_code=True)
890
+ def _(complex_query_result, mo):
891
+ mo.vstack(
892
+ [
893
+ mo.vstack(
894
+ [mo.md(f"""## Aggregated Product Data by Category:""")],
895
+ align="center",
896
+ ),
897
+ mo.ui.table(complex_query_result),
898
+ ],
899
+ gap=2,
900
+ justify="space-between",
901
+ )
902
+ return
903
+
904
+
905
+ @app.cell(hide_code=True)
906
+ def _(mo):
907
+ mo.md(r"""# [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html)""")
908
+ return
909
+
910
+
911
+ @app.cell(hide_code=True)
912
+ def _(new_memory_db):
913
+ # Create another table to join with
914
+ new_memory_db.execute("""
915
+ CREATE TABLE IF NOT EXISTS departments (
916
+ id INTEGER,
917
+ department_name VARCHAR,
918
+ manager_id INTEGER
919
+ )
920
+ """)
921
+ return
922
+
923
+
924
+ @app.cell(hide_code=True)
925
+ def _(new_memory_db):
926
+ new_memory_db.execute("""
927
+ INSERT INTO departments VALUES
928
+ (101, 'Engineering', 1),
929
+ (102, 'Marketing', 2),
930
+ (103, 'Finance', NULL)
931
+ """)
932
+ return
933
+
934
+
935
+ @app.cell(hide_code=True)
936
+ def _(new_memory_db):
937
+ # Execute a join query
938
+ join_result = new_memory_db.execute("""
939
+ SELECT
940
+ u.id,
941
+ u.name,
942
+ u.age,
943
+ d.department_name
944
+ FROM users_memory u
945
+ LEFT JOIN departments d ON u.id = d.manager_id
946
+ ORDER BY u.id
947
+ """).df()
948
+ return (join_result,)
949
+
950
+
951
+ @app.cell(hide_code=True)
952
+ def _(mo):
953
+ mo.md(
954
+ rf"""
955
+ <!-- Display the join result -->
956
+ ## Join Result (Users and Departments):
957
+ """
958
+ )
959
+ return
960
+
961
+
962
+ @app.cell
963
+ def _(join_result, mo):
964
+ mo.ui.table(join_result)
965
+ return
966
+
967
+
968
+ @app.cell(hide_code=True)
969
+ def _(mo):
970
+ mo.md(
971
+ rf"""
972
+ <!-- Demonstrate different types of joins -->
973
+ ## Different Types of Joins
974
+ """
975
+ )
976
+ return
977
+
978
+
979
+ @app.cell(hide_code=True)
980
+ def _(new_memory_db):
981
+ # Inner join
982
+ inner_join = new_memory_db.execute("""
983
+ SELECT u.id, u.name, d.department_name
984
+ FROM users_memory u
985
+ INNER JOIN departments d ON u.id = d.manager_id
986
+ """).df()
987
+
988
+ # Right join
989
+ right_join = new_memory_db.execute("""
990
+ SELECT u.id, u.name, d.department_name
991
+ FROM users_memory u
992
+ RIGHT JOIN departments d ON u.id = d.manager_id
993
+ """).df()
994
+
995
+ # Full outer join
996
+ full_join = new_memory_db.execute("""
997
+ SELECT u.id, u.name, d.department_name
998
+ FROM users_memory u
999
+ FULL OUTER JOIN departments d ON u.id = d.manager_id
1000
+ """).df()
1001
+
1002
+ # Cross join
1003
+ cross_join = new_memory_db.execute("""
1004
+ SELECT u.id, u.name, d.department_name
1005
+ FROM users_memory u
1006
+ CROSS JOIN departments d
1007
+ """).df()
1008
+
1009
+ # Self join (Joining user table with itself to find users with the same age)
1010
+ self_join = new_memory_db.execute("""
1011
+ SELECT u1.id, u1.name, u2.name AS same_age_user
1012
+ FROM users_memory u1
1013
+ JOIN users_memory u2 ON u1.age = u2.age AND u1.id <> u2.id
1014
+ """).df()
1015
+
1016
+ # Semi join (Finding users who are also managers)
1017
+ semi_join = new_memory_db.execute("""
1018
+ SELECT u.id, u.name, u.age
1019
+ FROM users_memory u
1020
+ WHERE EXISTS (
1021
+ SELECT 1 FROM departments d
1022
+ WHERE u.id = d.manager_id
1023
+ )
1024
+ """).df()
1025
+
1026
+ # Anti join (Finding users who are not managers)
1027
+ anti_join = new_memory_db.execute("""
1028
+ SELECT u.id, u.name, u.age
1029
+ FROM users_memory u
1030
+ WHERE NOT EXISTS (
1031
+ SELECT 1 FROM departments d
1032
+ WHERE u.id = d.manager_id
1033
+ )
1034
+ """).df()
1035
+ return (
1036
+ anti_join,
1037
+ cross_join,
1038
+ full_join,
1039
+ inner_join,
1040
+ right_join,
1041
+ self_join,
1042
+ semi_join,
1043
+ )
1044
+
1045
+
1046
+ @app.cell(hide_code=True)
1047
+ def _(mo, new_memory_db):
1048
+ # Display base table side by side
1049
+ users = new_memory_db.execute("SELECT * FROM users_memory").df()
1050
+ departments = new_memory_db.execute("SELECT * FROM departments").df()
1051
+
1052
+ base_tables = mo.vstack(
1053
+ [
1054
+ mo.vstack([mo.md(f"""# Base Tables""")], align="center"),
1055
+ mo.ui.tabs(
1056
+ {
1057
+ "User Table": mo.ui.table(users),
1058
+ "Departments Table": mo.ui.table(departments),
1059
+ }
1060
+ ),
1061
+ ]
1062
+ )
1063
+ base_tables
1064
+ return
1065
+
1066
+
1067
+ @app.cell(hide_code=True)
1068
+ def _(
1069
+ anti_join,
1070
+ cross_join,
1071
+ full_join,
1072
+ inner_join,
1073
+ join_result,
1074
+ mo,
1075
+ right_join,
1076
+ self_join,
1077
+ semi_join,
1078
+ ):
1079
+ join_description = {
1080
+ "Left Join": "Shows all records from the left table and matching records from the right table. Non-matches filled with NULL.",
1081
+ "Inner Join": "Shows only the records where there's a match in both tables.",
1082
+ "Right Join": "Shows all records from the right table and matching records from the left table. Non-matches filled with NULL.",
1083
+ "Full Outer Join": "Shows all records from both tables, with NULL values where there's no match.",
1084
+ "Cross Join": "Returns the Cartesian product - all possible combinations of rows from both tables.",
1085
+ "Self Join": "Joins a table with itself, used to compare rows within the same table.",
1086
+ "Semi Join": "Returns rows from the first table where one or more matches exist in the second table.",
1087
+ "Anti Join": "Returns rows from the first table where no matches exist in the second table.",
1088
+ }
1089
+
1090
+
1091
+ join_tabs = mo.ui.tabs(
1092
+ {
1093
+ "Left Join": mo.ui.table(join_result),
1094
+ "Inner Join": mo.ui.table(inner_join),
1095
+ "Right Join": mo.ui.table(right_join),
1096
+ "Full Outer Join": mo.ui.table(full_join),
1097
+ "Cross Join": mo.ui.table(cross_join),
1098
+ "Self Join": mo.ui.table(self_join),
1099
+ "Semi Join": mo.ui.table(semi_join),
1100
+ "Anti Join": mo.ui.table(anti_join),
1101
+ }
1102
+ )
1103
+ return join_description, join_tabs
1104
+
1105
+
1106
+ @app.cell(hide_code=True)
1107
+ def _(join_description, join_tabs, mo):
1108
+ join_display = mo.vstack(
1109
+ [
1110
+ mo.vstack([mo.md(f"""# SQL Join Operations""")], align="center"),
1111
+ mo.md(f"**{join_tabs.value}**: {join_description[join_tabs.value]}"),
1112
+ mo.md("## Join Results"),
1113
+ join_tabs,
1114
+ ],
1115
+ gap=2,
1116
+ justify="space-between",
1117
+ )
1118
+
1119
+ join_display
1120
+ return
1121
+
1122
+
1123
+ @app.cell(hide_code=True)
1124
+ def _(mo):
1125
+ mo.md(r"""# [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html)""")
1126
+ return
1127
+
1128
+
1129
+ @app.cell(hide_code=True)
1130
+ def _(new_memory_db):
1131
+ # Execute an aggregate query
1132
+ agg_result = new_memory_db.execute("""
1133
+ SELECT
1134
+ AVG(age) as avg_age,
1135
+ MAX(age) as max_age,
1136
+ MIN(age) as min_age,
1137
+ COUNT(*) as total_users,
1138
+ SUM(account_balance) as total_balance
1139
+ FROM users_memory
1140
+ """).df()
1141
+ return (agg_result,)
1142
+
1143
+
1144
+ @app.cell(hide_code=True)
1145
+ def _(agg_result, mo):
1146
+ mo.vstack(
1147
+ [
1148
+ mo.vstack(
1149
+ [mo.md(f"""## Aggregate Results (All Users):""")], align="center"
1150
+ ),
1151
+ mo.ui.table(agg_result),
1152
+ ],
1153
+ gap=2,
1154
+ justify="space-between",
1155
+ )
1156
+ return
1157
+
1158
+
1159
+ @app.cell(hide_code=True)
1160
+ def _(new_memory_db):
1161
+ age_groups = new_memory_db.execute("""
1162
+ SELECT
1163
+ CASE
1164
+ WHEN age < 30 THEN 'Under 30'
1165
+ WHEN age BETWEEN 30 AND 40 THEN '30 to 40'
1166
+ ELSE 'Over 40'
1167
+ END as age_group,
1168
+ COUNT(*) as count,
1169
+ AVG(age) as avg_age,
1170
+ AVG(account_balance) as avg_balance
1171
+ FROM users_memory
1172
+ GROUP BY 1
1173
+ ORDER BY 1
1174
+ """).df()
1175
+ return (age_groups,)
1176
+
1177
+
1178
+ @app.cell(hide_code=True)
1179
+ def _(age_groups, mo):
1180
+ mo.ui.table(age_groups)
1181
+ mo.vstack(
1182
+ [
1183
+ mo.vstack(
1184
+ [mo.md(f"""## Aggregate Results (Grouped by Age Range):""")],
1185
+ align="center",
1186
+ ),
1187
+ mo.ui.table(age_groups),
1188
+ ],
1189
+ gap=2,
1190
+ justify="space-between",
1191
+ )
1192
+ return
1193
+
1194
+
1195
+ @app.cell(hide_code=True)
1196
+ def _(new_memory_db):
1197
+ window_result = new_memory_db.execute("""
1198
+ SELECT
1199
+ id,
1200
+ name,
1201
+ age,
1202
+ account_balance,
1203
+ RANK() OVER (ORDER BY account_balance DESC) as balance_rank,
1204
+ account_balance - AVG(account_balance) OVER () as diff_from_avg,
1205
+ account_balance / SUM(account_balance) OVER () * 100 as pct_of_total
1206
+ FROM users_memory
1207
+ ORDER BY balance_rank
1208
+ """).df()
1209
+ return (window_result,)
1210
+
1211
+
1212
+ @app.cell(hide_code=True)
1213
+ def _(mo, window_result):
1214
+ mo.vstack(
1215
+ [
1216
+ mo.vstack([mo.md(f"""## Window Functions Example""")], align="center"),
1217
+ mo.ui.table(window_result),
1218
+ ],
1219
+ gap=2,
1220
+ justify="space-between",
1221
+ )
1222
+ return
1223
+
1224
+
1225
+ @app.cell(hide_code=True)
1226
+ def _(mo):
1227
+ mo.md(r"""# [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html)""")
1228
+ return
1229
+
1230
+
1231
+ @app.cell(hide_code=True)
1232
+ def _(new_memory_db):
1233
+ polars_result = new_memory_db.execute(
1234
+ """SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
1235
+ ).pl()
1236
+ return (polars_result,)
1237
+
1238
+
1239
+ @app.cell(hide_code=True)
1240
+ def _(mo, polars_result):
1241
+ mo.vstack(
1242
+ [
1243
+ mo.vstack(
1244
+ [mo.md(f"""## Query Result as Polars DataFrame:""")],
1245
+ align="center",
1246
+ ),
1247
+ mo.ui.table(polars_result),
1248
+ ],
1249
+ gap=2,
1250
+ justify="space-between",
1251
+ )
1252
+ return
1253
+
1254
+
1255
+ @app.cell(hide_code=True)
1256
+ def _(new_memory_db):
1257
+ pandas_result = new_memory_db.execute(
1258
+ """SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
1259
+ ).fetch_df()
1260
+ return (pandas_result,)
1261
+
1262
+
1263
+ @app.cell(hide_code=True)
1264
+ def _(mo, pandas_result):
1265
+ mo.vstack(
1266
+ [
1267
+ mo.vstack(
1268
+ [mo.md(f"""## Same Query Result as Pandas DataFrame:""")],
1269
+ align="center",
1270
+ ),
1271
+ mo.ui.table(pandas_result),
1272
+ ],
1273
+ gap=2,
1274
+ justify="space-between",
1275
+ )
1276
+ return
1277
+
1278
+
1279
+ @app.cell(hide_code=True)
1280
+ def _(mo):
1281
+ mo.vstack(
1282
+ [
1283
+ mo.vstack(
1284
+ [mo.md(f"""## Differences in DataFrame Handling""")],
1285
+ align="center",
1286
+ ),
1287
+ mo.vstack(
1288
+ [
1289
+ mo.md(
1290
+ f"""## Polars: Filter users over 35 and calculate average balance"""
1291
+ )
1292
+ ],
1293
+ align="start",
1294
+ ),
1295
+ ],
1296
+ gap=2, justify="space-between",
1297
+ )
1298
+ return
1299
+
1300
+
1301
+ @app.cell(hide_code=True)
1302
+ def _(mo, pl, polars_result):
1303
+ def _():
1304
+ polars_filtered = polars_result.filter(pl.col("age") > 35)
1305
+ polars_avg = polars_filtered.select(
1306
+ pl.col("account_balance").mean().alias("avg_balance")
1307
+ )
1308
+
1309
+ layout = mo.vstack(
1310
+ [
1311
+ mo.md("### Filtered Polars DataFrame (Age > 35):"),
1312
+ mo.ui.table(polars_filtered),
1313
+ mo.md("### Average Account Balance:"),
1314
+ mo.ui.table(polars_avg),
1315
+ ],
1316
+ gap=2,
1317
+ )
1318
+ return layout
1319
+
1320
+
1321
+ _()
1322
+ return
1323
+
1324
+
1325
+ @app.cell(hide_code=True)
1326
+ def _(mo, pandas_result):
1327
+ pandas_avg = pandas_result[pandas_result["age"] > 35]["account_balance"].mean()
1328
+ mo.vstack(
1329
+ [
1330
+ mo.vstack(
1331
+ [mo.md(f"""## Pandas: Same operation in pandas style""")],
1332
+ align="center",
1333
+ ),
1334
+ mo.vstack(
1335
+ [mo.md(f"""### Average balance: {pandas_avg:.2f}""")],
1336
+ align="start",
1337
+ ),
1338
+ ]
1339
+ )
1340
+ return
1341
+
1342
+
1343
+ @app.cell(hide_code=True)
1344
+ def _(mo):
1345
+ mo.md("""# 9. Data Visualization with DuckDB and Plotly""")
1346
+ return
1347
+
1348
+
1349
+ @app.cell(hide_code=True)
1350
+ def _(age_groups, mo, new_memory_db, plotly_express):
1351
+ # User distribution by age group
1352
+ fig1 = plotly_express.bar(
1353
+ age_groups,
1354
+ x="age_group",
1355
+ y="count",
1356
+ title="User Distribution by Age Group",
1357
+ labels={"count": "Number of Users", "age_group": "Age Group"},
1358
+ color="age_group",
1359
+ color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
1360
+ )
1361
+ fig1.update_traces(
1362
+ text=age_groups["count"],
1363
+ textposition="outside",
1364
+ )
1365
+ fig1.update_layout(
1366
+ height=450,
1367
+ margin=dict(t=50, b=50, l=50, r=25),
1368
+ hoverlabel=dict(bgcolor="white", font_size=12),
1369
+ template="plotly_white",
1370
+ )
1371
+
1372
+
1373
+ # Average balance by age group
1374
+ fig2 = plotly_express.bar(
1375
+ age_groups,
1376
+ x="age_group",
1377
+ y="avg_balance",
1378
+ title="Average Account Balance by Age Group",
1379
+ labels={"avg_balance": "Average Balance ($)", "age_group": "Age Group"},
1380
+ color="age_group",
1381
+ color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
1382
+ )
1383
+ fig2.update_traces(
1384
+ text=[f"${val:.2f}" for val in age_groups["avg_balance"]],
1385
+ textposition="outside",
1386
+ )
1387
+ fig2.update_layout(
1388
+ height=450,
1389
+ margin=dict(t=50, b=50, l=50, r=25),
1390
+ hoverlabel=dict(bgcolor="white", font_size=12),
1391
+ template="plotly_white",
1392
+ )
1393
+
1394
+
1395
+ # Age vs Account Balance scatter plot
1396
+ scatter_data = new_memory_db.execute(
1397
+ """
1398
+ SELECT
1399
+ name,
1400
+ age,
1401
+ account_balance
1402
+ FROM users_memory
1403
+ ORDER BY age
1404
+ """
1405
+ ).df()
1406
+
1407
+ fig3 = plotly_express.scatter(
1408
+ scatter_data,
1409
+ x="age",
1410
+ y="account_balance",
1411
+ title="Age vs. Account Balance",
1412
+ labels={"account_balance": "Account Balance ($)", "age": "Age"},
1413
+ color_discrete_sequence=["#FF7F0E"],
1414
+ trendline="ols",
1415
+ hover_data=["age", "account_balance"],
1416
+ size_max=15,
1417
+ )
1418
+ fig3.update_traces(marker=dict(size=12))
1419
+ fig3.update_layout(
1420
+ height=450,
1421
+ margin=dict(t=50, b=50, l=50, r=25),
1422
+ hoverlabel=dict(bgcolor="white", font_size=12),
1423
+ template="plotly_white",
1424
+ )
1425
+
1426
+
1427
+ # Distribution of account balances
1428
+ balance_data = new_memory_db.execute(
1429
+ """
1430
+ SELECT
1431
+ name,
1432
+ account_balance
1433
+ FROM users_memory
1434
+ ORDER BY account_balance DESC
1435
+ """
1436
+ ).df()
1437
+
1438
+ fig4 = plotly_express.pie(
1439
+ balance_data,
1440
+ names="name",
1441
+ values="account_balance",
1442
+ title="Distribution of Account Balances",
1443
+ labels={"account_balance": "Account Balance ($)", "name": "User"},
1444
+ color_discrete_sequence=plotly_express.colors.qualitative.Pastel,
1445
+ )
1446
+ fig4.update_traces(textinfo="percent+label", textposition="inside")
1447
+ fig4.update_layout(
1448
+ height=450,
1449
+ margin=dict(t=50, b=50, l=50, r=25),
1450
+ hoverlabel=dict(bgcolor="white", font_size=12),
1451
+ template="plotly_white",
1452
+ )
1453
+
1454
+
1455
+ category_tabs = mo.ui.tabs(
1456
+ {
1457
+ "Age Group Analysis": mo.vstack(
1458
+ [
1459
+ mo.ui.tabs(
1460
+ {
1461
+ "User Distribution": mo.ui.plotly(fig1),
1462
+ "Average Balance": mo.ui.plotly(fig2),
1463
+ }
1464
+ )
1465
+ ],
1466
+ gap=2,
1467
+ justify="space-between",
1468
+ ),
1469
+ "Financial Analysis": mo.vstack(
1470
+ [
1471
+ mo.ui.tabs(
1472
+ {
1473
+ "Age vs Balance": mo.ui.plotly(fig3),
1474
+ "Balance Distribution": mo.ui.plotly(fig4),
1475
+ }
1476
+ )
1477
+ ],
1478
+ gap=2,
1479
+ justify="space-between",
1480
+ ),
1481
+ },
1482
+ lazy=True,
1483
+ )
1484
+
1485
+ mo.vstack(
1486
+ [
1487
+ mo.vstack(
1488
+ [mo.md(f"""## Select a visualization category:""")],
1489
+ align="start",
1490
+ ),
1491
+ category_tabs,
1492
+ ],
1493
+ gap=2,
1494
+ justify="space-between",
1495
+ )
1496
+ return
1497
+
1498
+
1499
+ @app.cell(hide_code=True)
1500
+ def _(mo):
1501
+ mo.md(
1502
+ r"""
1503
+ /// admonition |
1504
+ ## Database Management Best Practices
1505
+ ///
1506
+
1507
+ ### Closing Connections
1508
+
1509
+ It's important to close database connections when you're done with them, especially for file-based connections:
1510
+
1511
+ ```python
1512
+ memory_db.close()
1513
+ file_db.close()
1514
+ ```
1515
+
1516
+ ### Transaction Management
1517
+
1518
+ DuckDB supports transactions, which can be useful for more complex operations:
1519
+
1520
+ ```python
1521
+ conn = duckdb.connect('mydb.db')
1522
+ conn.begin() # Start transaction
1523
+
1524
+ try:
1525
+ conn.execute("INSERT INTO users VALUES (1, 'Test User')")
1526
+ conn.execute("UPDATE balances SET amount = amount - 100 WHERE user_id = 1")
1527
+ conn.commit() # Commit changes
1528
+ except:
1529
+ conn.rollback() # Undo changes if error
1530
+ raise
1531
+ ```
1532
+
1533
+ ### Query Performance
1534
+
1535
+ DuckDB is optimized for analytical queries. For best performance:
1536
+
1537
+ - Use appropriate data types
1538
+ - Create indexes for frequently queried columns
1539
+ - For large datasets, consider partitioning
1540
+ - Use prepared statements for repeated queries
1541
+ """
1542
+ )
1543
+ return
1544
+
1545
+
1546
+ @app.cell(hide_code=True)
1547
+ def _(mo):
1548
+ mo.md(rf"""## 10. Interactive DuckDB Dashboard with marimo and Plotly""")
1549
+ return
1550
+
1551
+
1552
+ @app.cell(hide_code=True)
1553
+ def _(mo):
1554
+ # Create an interactive filter for age range
1555
+ min_age = mo.ui.slider(20, 50, value=25, label="Minimum Age")
1556
+ max_age = mo.ui.slider(20, 50, value=50, label="Maximum Age")
1557
+ return max_age, min_age
1558
+
1559
+
1560
+ @app.cell(hide_code=True)
1561
+ def _(max_age, min_age, new_memory_db):
1562
+ # Create a function to filter data and update visualizations
1563
+ def get_filtered_data(min_val=min_age.value, max_val=max_age.value):
1564
+ # Get filtered data based on slider values using parameterized query for safety
1565
+ return new_memory_db.execute(
1566
+ """
1567
+ SELECT
1568
+ id,
1569
+ name,
1570
+ age,
1571
+ email,
1572
+ account_balance,
1573
+ registration_date
1574
+ FROM users_memory
1575
+ WHERE age >= ? AND age <= ?
1576
+ ORDER BY age
1577
+ """,
1578
+ [min_val, max_val],
1579
+ ).df()
1580
+ return (get_filtered_data,)
1581
+
1582
+
1583
+ @app.cell(hide_code=True)
1584
+ def _(get_filtered_data):
1585
+ def get_metrics(data=get_filtered_data()):
1586
+ return {
1587
+ "user count": len(data),
1588
+ "avg_balance": data["account_balance"].mean() if len(data) > 0 else 0,
1589
+ "total_balance": data["account_balance"].sum() if len(data) > 0 else 0,
1590
+ }
1591
+ return (get_metrics,)
1592
+
1593
+
1594
+ @app.cell(hide_code=True)
1595
+ def _(get_metrics, mo):
1596
+ def metrics_display(metrics=get_metrics()):
1597
+ return mo.hstack(
1598
+ [
1599
+ mo.vstack(
1600
+ [
1601
+ mo.md("### Selected Users"),
1602
+ mo.md(f"## {metrics['user count']}"),
1603
+ ],
1604
+ align="center",
1605
+ ),
1606
+ mo.vstack(
1607
+ [
1608
+ mo.md("### Average Balance"),
1609
+ mo.md(f"## ${metrics['avg_balance']:.2f}"),
1610
+ ],
1611
+ align="center",
1612
+ ),
1613
+ mo.vstack(
1614
+ [
1615
+ mo.md("### Total Balance"),
1616
+ mo.md(f"## ${metrics['total_balance']:.2f}"),
1617
+ ],
1618
+ align="center",
1619
+ ),
1620
+ ],
1621
+ justify="space-between",
1622
+ gap=2,
1623
+ )
1624
+ return (metrics_display,)
1625
+
1626
+
1627
+ @app.cell(hide_code=True)
1628
+ def _(get_filtered_data, max_age, min_age, mo, plotly_express):
1629
+ def create_visualization(
1630
+ data=get_filtered_data(), min_val=min_age.value, max_val=max_age.value
1631
+ ):
1632
+ if len(data) == 0:
1633
+ return mo.ui.text("No data available for the selected age range.")
1634
+
1635
+ # Create visualizations for filtered data
1636
+ fig1 = plotly_express.bar(
1637
+ data,
1638
+ x="name",
1639
+ y="account_balance",
1640
+ title=f"Account Balance by User (Age {min_val} - {max_val})",
1641
+ labels={"account_balance": "Account Balance ($)", "name": "User"},
1642
+ color="account_balance",
1643
+ color_continuous_scale=plotly_express.colors.sequential.Plasma,
1644
+ text_auto=".2s",
1645
+ )
1646
+ fig1.update_layout(
1647
+ height=400,
1648
+ xaxis_tickangle=-45,
1649
+ margin=dict(t=50, b=70, l=50, r=30),
1650
+ hoverlabel=dict(bgcolor="white", font_size=12),
1651
+ template="plotly_white",
1652
+ )
1653
+ fig1.update_traces(
1654
+ textposition="outside",
1655
+ )
1656
+
1657
+ fig2 = plotly_express.histogram(
1658
+ data,
1659
+ x="age",
1660
+ nbins=min(10, len(set(data["age"]))),
1661
+ title=f"Age Distribution (Age {min_val} - {max_val})",
1662
+ color_discrete_sequence=["#4C78A8"],
1663
+ opacity=0.8,
1664
+ histnorm="probability density",
1665
+ )
1666
+ fig2.update_layout(
1667
+ height=400,
1668
+ margin=dict(t=50, b=70, l=50, r=30),
1669
+ bargap=0.1,
1670
+ hoverlabel=dict(bgcolor="white", font_size=12),
1671
+ template="plotly_white",
1672
+ )
1673
+
1674
+ fig3 = plotly_express.scatter(
1675
+ data,
1676
+ x="age",
1677
+ y="account_balance",
1678
+ title=f"Age vs. Account Balance (Age {min_val} - {max_val})",
1679
+ labels={"account_balance": "Account Balance ($)", "age": "Age"},
1680
+ color="age",
1681
+ color_continuous_scale="Viridis",
1682
+ size_max=25,
1683
+ size="account_balance",
1684
+ hover_name="name",
1685
+ )
1686
+ fig3.update_layout(
1687
+ height=400,
1688
+ margin=dict(t=50, b=70, l=50, r=30),
1689
+ hoverlabel=dict(bgcolor="white", font_size=12),
1690
+ template="plotly_white",
1691
+ )
1692
+
1693
+ return mo.ui.tabs(
1694
+ {
1695
+ "Account Balance by User": mo.ui.plotly(fig1),
1696
+ "Age Distribution": mo.ui.plotly(fig2),
1697
+ "Age vs. Account Balance": mo.ui.plotly(fig3),
1698
+ }
1699
+ )
1700
+ return (create_visualization,)
1701
+
1702
+
1703
+ @app.cell(hide_code=True)
1704
+ def _(
1705
+ create_visualization,
1706
+ get_filtered_data,
1707
+ max_age,
1708
+ metrics_display,
1709
+ min_age,
1710
+ mo,
1711
+ ):
1712
+ def dashboard(
1713
+ min_val=min_age.value,
1714
+ max_val=max_age.value,
1715
+ metrics=metrics_display(),
1716
+ data=get_filtered_data(),
1717
+ visualization=create_visualization(),
1718
+ ):
1719
+ return mo.vstack(
1720
+ [
1721
+ mo.md(f"### Interactive Dashboard (Age {min_val} - {max_val})"),
1722
+ metrics,
1723
+ mo.md("### Data Table"),
1724
+ mo.ui.table(data, page_size=5),
1725
+ mo.md("### Visualizations"),
1726
+ visualization,
1727
+ ],
1728
+ gap=2,
1729
+ justify="space-between",
1730
+ )
1731
+
1732
+
1733
+ dashboard()
1734
+ return
1735
+
1736
+
1737
+ @app.cell(hide_code=True)
1738
+ def _(mo):
1739
+ mo.md(
1740
+ rf"""
1741
+ # Summary and Key Takeaways
1742
+
1743
+ In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered:
1744
+
1745
+ 1. **Connection types**: We learned the difference between in-memory databases (temporary) and file-based databases (persistent).
1746
+
1747
+ 2. **Table creation**: We created tables with various data types, constraints, and primary keys.
1748
+
1749
+ 3. **Data insertion**: We demonstrated different ways to insert data, including single inserts and bulk loading.
1750
+
1751
+ 4. **SQL queries**: We executed various SQL queries directly and through marimo's UI components.
1752
+
1753
+ 5. **Integration with Polars**: We showed how DuckDB can work seamlessly with Polars DataFrames.
1754
+
1755
+ 6. **Joins and relationships**: We performed JOIN operations between tables to combine related data.
1756
+
1757
+ 7. **Aggregation**: We used aggregate functions to summarize and analyze data.
1758
+
1759
+ 8. **Data conversion**: We converted DuckDB results to both Polars and Pandas DataFrames.
1760
+
1761
+ 9. **Best practices**: We reviewed best practices for managing DuckDB connections and transactions.
1762
+
1763
+ 10. **Visualization**: We created interactive visualizations and dashboards with Plotly and marimo.
1764
+
1765
+ DuckDB is an excellent tool for data analysis, especially for analytical workloads. Its in-process nature makes it fast and easy to use, while its SQL compatibility makes it accessible for anyone familiar with SQL databases.
1766
+
1767
+ ### Next Steps
1768
+
1769
+ - Try loading larger datasets into DuckDB
1770
+ - Experiment with more complex queries and window functions
1771
+ - Use DuckDB's COPY functionality to import/export data from/to files
1772
+ - Create more advanced interactive dashboards with marimo and Plotly
1773
+ """
1774
+ )
1775
+ return
1776
+
1777
+
1778
+ @app.cell(hide_code=True)
1779
+ def _():
1780
+ import marimo as mo
1781
+ import duckdb
1782
+ import polars as pl
1783
+ import os
1784
+ from datetime import date
1785
+ import plotly.express as plotly_express
1786
+ import plotly.graph_objects as plotly_graph_objects
1787
+ import numpy as np
1788
+ return date, duckdb, mo, os, pl, plotly_express
1789
+
1790
+
1791
+ if __name__ == "__main__":
1792
+ app.run()