Azmi-84 commited on
Commit
e5fc993
·
1 Parent(s): dd84d5b

Enhance and redesign DuckDB introductory notebook

Browse files

This commit addresses and resolves the suggestions provided in the review, including:

- Ensuring the notebook follows the best practices outlined in the contribution guidelines.
- Removing irrelevant markdown blocks and using marimo features.

Additionally, the notebook has been completely redesigned with:
- Improved structure and flow for better readability and learning experience.
- Enhanced examples and interactive content for database connections, table creation, and data manipulation.
- Better integration of visuals using Plotly and Marimo for basic interactive analysis.
- Updated dependency management using for reproducibility.

The notebook now provides a polished and user-friendly guide to DuckDB, ensuring a high-quality learning experience for users.

Files changed (1) hide show
  1. duckdb/01_getting_started.py +1531 -121
duckdb/01_getting_started.py CHANGED
@@ -1,232 +1,1638 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import marimo
2
 
3
- __generated_with = "0.11.30"
4
- app = marimo.App(width="medium")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
 
7
  @app.cell(hide_code=True)
8
- def _introduction(mo):
9
  mo.md(
10
- """
11
- # DuckDB: An Embeddable Analytical Database System
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- ### What is DuckDB?
 
 
 
 
 
 
 
 
14
 
15
- [DuckDB](https://duckdb.org/) is a high-performance, in-process analytical database management system (DBMS) designed for speed and simplicity. It's particularly well-suited for analytical query workloads, offering a robust SQL interface and efficient data processing capabilities. This document highlights key features and aspects of DuckDB relevant for a course on database systems or data analysis.
16
 
17
- ### [Key Features](https://duckdb.org/why_duckdb):
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- - In-Process: Easy integration, zero external dependencies.
20
- - Portable: Works on various OS and architectures.
21
- - Columnar Storage: Efficient for analytical queries.
22
- - Vectorized Execution: Speeds up data processing.
23
- - ACID Transactions: Ensures data integrity.
24
- - Multi-Language APIs: Python, R, Java, etc.
25
 
26
- ### [Use Cases](https://github.com/davidgasquez/awesome-duckdb?tab=readme-ov-file):
 
 
 
27
 
28
- - Data analysis and exploration
29
- - Embedded analytics in applications
30
- - ETL (Extract, Transform, Load) processes
31
- - Data science and machine learning workflows
32
- - Rapid prototyping of data analysis pipelines.
33
 
34
- ### [Installation](https://duckdb.org/docs/installation/?version=stable&environment=python):
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- - The DuckDB Python API can be installed using pip:
37
- ```
38
- pip install duckdb
39
- ```
40
- - It is also possible to install DuckDB using conda:
41
- ```
42
- conda install python-duckdb -c conda-forge.
43
- ```
44
 
45
- **Python version:** DuckDB requires Python 3.7 or newer.
46
- """
 
 
 
 
 
47
  )
48
  return
49
 
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  @app.cell(hide_code=True)
52
  def _(mo):
53
  mo.md(
54
  r"""
55
- # [1. DuckDB Basic Connection](https://duckdb.org/docs/stable/connect/overview.html)
56
-
57
- DuckDB can run entirely in your computer's RAM, known as in-memory mode, which you can enable by using `:memory:` as the database name or by not providing a database file. It's crucial to understand that this means all data is temporary and will be completely erased when the program closes, as it isn't saved to disk.
58
- """
59
  )
60
  return
61
 
62
 
63
  @app.cell
64
- def _database_connection(duckdb):
65
- # Create a connection to an in-memory database
66
- database_connection = duckdb.connect(database=":memory:")
67
- print(
68
- f"DuckDB version: {database_connection.execute('SELECT version()').fetchone()[0]}"
69
- )
70
- return (database_connection,)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
 
73
  @app.cell(hide_code=True)
74
  def _(mo):
75
- mo.md(
76
- r"""# [2. Creating Tables](https://duckdb.org/docs/stable/sql/statements/create_table.html)"""
77
- )
78
  return
79
 
80
 
81
  @app.cell
82
- def _create_users_table(database_connection):
83
- database_connection.execute(
84
- """
85
- CREATE TABLE users (
86
  id INTEGER,
87
- name VARCHAR,
88
- age INTEGER,
89
- registration_date DATE
90
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  """
92
  )
93
  return
94
 
95
 
 
 
 
 
 
 
96
  @app.cell(hide_code=True)
97
  def _(mo):
98
  mo.md(
99
- r"""# [3. Instering data into table](https://duckdb.org/docs/stable/sql/statements/insert)"""
 
 
 
100
  )
101
  return
102
 
103
 
104
  @app.cell
105
- def _insert_user_data(database_connection):
106
- database_connection.execute(
107
- """
108
- INSERT INTO users VALUES
109
- (1, 'Alice', 25, '2021-01-01'),
110
- (2, 'Bob', 30, '2021-02-01'),
111
- (3, 'Charlie', 35, '2021-03-01')
112
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  )
 
 
 
 
 
 
 
 
114
  return
115
 
116
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
  @app.cell(hide_code=True)
118
  def _(mo):
119
  mo.md(
120
- r"""# [4. Basic Queries](https://duckdb.org/docs/stable/sql/query_syntax/select)"""
 
 
 
121
  )
122
  return
123
 
124
 
125
  @app.cell
126
- def _basic_queries(database_connection):
127
- # Select all data
128
- user_results = database_connection.execute("SELECT * FROM users").fetchall()
129
- for user_row in user_results:
130
- print(user_row)
131
- return user_results, user_row
132
 
133
 
134
  @app.cell(hide_code=True)
135
  def _(mo):
136
  mo.md(
137
- r"""# [5. Working with Polars](https://duckdb.org/docs/stable/guides/python/polars.html)"""
 
 
 
138
  )
139
  return
140
 
141
 
142
  @app.cell
143
- def _polars_dataframe(database_connection, pl):
144
- # Create a Polars DataFrame
145
- polars_dataframe = pl.DataFrame(
146
- {
147
- "id": [1, 2, 3],
148
- "name": ["Alice", "Bob", "Charlie"],
149
- "age": [25, 30, 35],
150
- "registration_date": ["2021-01-01", "2021-02-01", "2021-03-01"],
151
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
  )
 
 
153
 
154
- # Register the Polars DataFrame as a DuckDB table
155
- database_connection.register("users_polars", polars_dataframe)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
- # Query the Polars DataFrame using DuckDB
158
- polars_results = database_connection.execute(
159
- "SELECT * FROM users_polars"
160
- ).fetchall()
161
- print("New Table:")
162
- for polars_row in polars_results:
163
- print(polars_row)
164
- return polars_dataframe, polars_results, polars_row
165
 
166
 
167
  @app.cell(hide_code=True)
168
  def _(mo):
169
  mo.md(
170
- r"""# [6. Join Operations](https://duckdb.org/docs/stable/guides/performance/join_operations.html)"""
 
 
 
171
  )
172
  return
173
 
174
 
175
  @app.cell
176
- def _join_operations(database_connection):
177
- join_results = database_connection.execute(
178
- """
179
- SELECT u.id, u.name, u.age, nu.registration_date
180
- FROM users u
181
- JOIN users_polars nu ON u.age < nu.age
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
  """
183
  )
184
- print("Join Result:")
185
- for join_row in join_results.fetchall():
186
- print(join_row)
187
- return join_results, join_row
188
 
189
 
190
  @app.cell(hide_code=True)
191
  def _(mo):
192
  mo.md(
193
- r"""# [7. Aggregate Functions](https://duckdb.org/docs/stable/sql/functions/aggregates.html)"""
 
 
 
194
  )
195
  return
196
 
197
 
198
  @app.cell
199
- def _aggregate_functions(database_connection):
200
- aggregate_results = database_connection.execute(
201
- """
202
- SELECT AVG(age) as avg_age, MAX(age) as max_age, MIN(age) as min_age
203
- FROM (SELECT * FROM users UNION ALL SELECT * FROM users_polars) AS all_users
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
  """
205
- ).fetchall()
206
- print(
207
- f"Average Age: {aggregate_results[0][0]:.1f}, "
208
- f"Max Age: {aggregate_results[0][1]}, "
209
- f"Min Age: {aggregate_results[0][2]}"
210
  )
211
- return (aggregate_results,)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
212
 
213
 
214
  @app.cell(hide_code=True)
215
  def _(mo):
216
  mo.md(
217
- r"""# [8. Converting Results to Polars DataFrames](https://duckdb.org/docs/stable/guides/python/polars.html)"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218
  )
219
  return
220
 
221
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
222
  @app.cell
223
- def _convert_to_polars(database_connection):
224
- # -- 8. Converting Results to Polars DataFrames --
225
- # Convert the result to a Polars DataFrame
226
- polars_result_df = database_connection.execute("SELECT * FROM users").df()
227
- print("Result as Polars DataFrame:")
228
- print(polars_result_df)
229
- return (polars_result_df,)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
230
 
231
 
232
  @app.cell(hide_code=True)
@@ -234,8 +1640,12 @@ def _():
234
  import marimo as mo
235
  import duckdb
236
  import polars as pl
237
-
238
- return duckdb, mo, pl
 
 
 
 
239
 
240
 
241
  if __name__ == "__main__":
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "marimo",
5
+ # "duckdb==1.2.2",
6
+ # "polars==1.27.0",
7
+ # "numpy==2.2.4",
8
+ # "pyarrow==19.0.1",
9
+ # "pandas==2.2.3",
10
+ # "sqlglot==26.12.1",
11
+ # "plotly==5.23.1",
12
+ # ]
13
+ # ///
14
+
15
  import marimo
16
 
17
+ __generated_with = "0.13.4"
18
+ app = marimo.App(width="medium")
19
+
20
+
21
+ @app.cell(hide_code=True)
22
+ def _(mo):
23
+ mo.md(
24
+ rf"""
25
+ <p align="center">
26
+ <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxHAqB0W_61zuIGVMiU6sEeQyTaw-9xwiprw&s" alt="DuckDB Image"/>
27
+ </p>
28
+ """
29
+ )
30
+ return
31
+
32
+
33
+ @app.cell(hide_code=True)
34
+ def _(mo):
35
+ mo.md(
36
+ rf"""
37
+ # 🦆 **DuckDB**: An Embeddable Analytical Database System
38
+
39
+ ## What is DuckDB?
40
+
41
+ [DuckDB](https://duckdb.org/) is a _high-performance_, in-process, embeddable SQL OLAP (Online Analytical Processing) Database Management System (DBMS) designed for simplicity and speed. It's essentially a fully-featured database that runs directly within your application's process, without needing a separate server. This makes it excellent for complex analytical workloads, offering a robust SQL interface and efficient processing – perfect for learning about databases and data analysis concepts. It's a great alternative to heavier database systems like PostgreSQL or MySQL when you don't need a full-blown server.
42
+
43
+ ---
44
+
45
+ ## ⚡ Key Features
46
+
47
+ | Feature | Description |
48
+ |:---------|:-------------|
49
+ | **In-Process Architecture** | Runs directly within your application's memory space - no separate server needed, simplifying deployment |
50
+ | **Columnar Storage** | Data stored in columns instead of rows, dramatically improving performance for analytical queries |
51
+ | **Vectorized Execution** | Performs operations on entire columns at once, significantly speeding up data processing |
52
+ | **ACID Transactions** | Ensures data integrity and reliability across operations |
53
+ | **Multi-Language Support** | Provides APIs for `Python`, `R`, `Java`, `C++`, and more |
54
+ | **Zero External Dependencies** | Minimal dependencies, making setup and deployment straightforward |
55
+ | **High Portability** | Works across various operating systems (Windows, macOS, Linux) and hardware architectures |
56
+
57
+ ---
58
+
59
+ ## [Use Cases](https://github.com/davidgasquez/awesome-duckdb?tab=readme-ov-file):
60
+
61
+ - **Data Analysis and Exploration:** DuckDB is ideal for quickly querying and analyzing datasets, especially for initial exploratory analysis.
62
+ - **Embedded Analytics in Applications:** You can integrate DuckDB directly into your applications to provide analytical capabilities without the need for a separate database server.
63
+ - **ETL (Extract, Transform, Load) Processes:** DuckDB can be used to perform initial data transformation and cleaning steps as part of an ETL pipeline.
64
+ - **Data Science and Machine Learning Workflows:** It's a lightweight alternative to larger databases for prototyping data analysis and machine learning models.
65
+ - **Rapid Prototyping of Data Analysis Pipelines:** Quickly test and iterate on data analysis ideas without the complexity of setting up a full-blown database environment.
66
+ - **Small to Medium Datasets:** DuckDB shines when working with datasets that don't require the massive scalability of a traditional database server.
67
+
68
+ ---
69
+
70
+ ### [Installation](https://duckdb.org/docs/installation/?version=stable&environment=python):
71
+
72
+ - Python installation:
73
+ ```
74
+ pip install duckdb
75
+ ```
76
+ ```
77
+ conda install python-duckdb -c conda-forge.
78
+ ```
79
+
80
+ <!-- >**_Note_:** DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system. -->
81
+
82
+ /// attention | Note
83
+ DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system.
84
+ ///
85
+ """
86
+ )
87
+ return
88
+
89
+
90
+ @app.cell(hide_code=True)
91
+ def _(mo):
92
+ mo.md(
93
+ r"""
94
+ # [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html)
95
+
96
+ DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage.
97
+
98
+ ---
99
+
100
+ | Feature | In-Memory Connection | File-Based Connection |
101
+ |:---------|:---------------------|:----------------------|
102
+ | Persistence | Temporary (lost when session ends) | Stored on disk (persists between sessions) |
103
+ | Use Cases | Quick analysis, ephemeral data, testing | Long-term storage, data that needs to be accessed later |
104
+ | Performance | Faster for most operations | Slightly slower but provides persistence |
105
+ | Creation | duckdb.connect(':memory:') | duckdb.connect('filename.db') |
106
+ | Multiple Connection Access | Limited to single connection | Multiple connections can access the same database |
107
+
108
+ """
109
+ )
110
+ return
111
+
112
+
113
+ @app.cell
114
+ def _(os):
115
+ # Remove previous database if it exists
116
+ if os.path.exists("example.db"):
117
+ os.remove("example.db")
118
+
119
+ if not os.path.exists("data"):
120
+ os.makedirs("data")
121
+ return
122
+
123
+
124
+ @app.cell
125
+ def _(mo):
126
+ _df = mo.sql(
127
+ f"""
128
+ -- Print the DuckDB version
129
+ SELECT version() AS version_info
130
+ """
131
+ )
132
+ return
133
+
134
+
135
+ @app.cell(hide_code=True)
136
+ def _(mo):
137
+ mo.md(
138
+ """
139
+ ## Creating DuckDB Connections
140
+
141
+ Let's create both types of DuckDB connections and explore their characteristics.
142
+
143
+ 1. **In-memory connection**: Data exists only during the current session
144
+ 2. **File-based connection**: Data persists between sessions
145
+
146
+ We'll then demonstrate the key differences between these connection types.
147
+ """
148
+ )
149
+ return
150
+
151
+
152
+ @app.cell
153
+ def _(duckdb):
154
+ # Create an in-memory DuckDB connection
155
+ memory_db = duckdb.connect(":memory:")
156
+
157
+ # Create a file-based DuckDB connection
158
+ file_db = duckdb.connect("example.db")
159
+ return file_db, memory_db
160
+
161
+
162
+ @app.cell
163
+ def _(file_db, memory_db):
164
+ # Test both connections
165
+ memory_db.execute(
166
+ "CREATE TABLE IF NOT EXISTS mem_test (id INTEGER, name VARCHAR)"
167
+ )
168
+ memory_db.execute("INSERT INTO mem_test VALUES (1, 'Memory Test')")
169
+
170
+ file_db.execute(
171
+ "CREATE TABLE IF NOT EXISTS file_test (id INTEGER, name VARCHAR)"
172
+ )
173
+ file_db.execute("INSERT INTO file_test VALUES (1, 'File Test')")
174
+ return
175
+
176
+
177
+ @app.cell(hide_code=True)
178
+ def _(mo):
179
+ mo.md(
180
+ r"""
181
+ ## Testing Connection Persistence
182
+
183
+ Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist.
184
+
185
+ 1. First, we'll query our tables to confirm the data was properly inserted
186
+ 2. Then, we'll simulate an application restart by creating new connections
187
+ 3. Finally, we'll check which data persists after the "restart"
188
+ """
189
+ )
190
+ return
191
+
192
+
193
+ @app.cell(hide_code=True)
194
+ def _(mo):
195
+ mo.md(r"""## Current Database Contents""")
196
+ return
197
+
198
+
199
+ @app.cell
200
+ def _(mem_test, memory_db, mo):
201
+ _df = mo.sql(
202
+ f"""
203
+ SELECT * FROM mem_test
204
+ """,
205
+ engine=memory_db
206
+ )
207
+ return
208
+
209
+
210
+ @app.cell
211
+ def _(file_db, file_test, mo):
212
+ _df = mo.sql(
213
+ f"""
214
+ SELECT * FROM file_test
215
+ """,
216
+ engine=file_db
217
+ )
218
+ return
219
+
220
+
221
+ @app.cell
222
+ def _():
223
+ # We don't actually close the connections here since we need them for later cells
224
+ # Just a placeholder for the concept
225
+ return
226
+
227
+
228
+ @app.cell(hide_code=True)
229
+ def _file_query(mo):
230
+ mo.md(rf"""## 🔄 Simulating Application Restart...""")
231
+ return
232
+
233
+
234
+ @app.cell
235
+ def _(duckdb):
236
+ # Create new connections (simulating restart)
237
+ new_memory_db = duckdb.connect(":memory:")
238
+ new_file_db = duckdb.connect("example.db")
239
+ return new_file_db, new_memory_db
240
+
241
+
242
+ @app.cell
243
+ def _(new_memory_db):
244
+ # Try to query tables in the new memory connection
245
+ try:
246
+ new_memory_db.execute("SELECT * FROM mem_test").df()
247
+ memory_persistence = "✅ Data persisted in memory (unexpected)"
248
+ memory_data_available = True
249
+ except Exception as e:
250
+ memory_persistence = "❌ Data lost from memory (expected behavior)"
251
+ memory_data_available = False
252
+ return memory_data_available, memory_persistence
253
+
254
+
255
+ @app.cell
256
+ def _(new_file_db):
257
+ # Try to query tables in the new file connection
258
+ try:
259
+ file_data = new_file_db.execute("SELECT * FROM file_test").df()
260
+ file_persistence = "✅ Data persisted in file (expected behavior)"
261
+ file_data_available = True
262
+ except Exception as e:
263
+ file_persistence = "❌ Data lost from file (unexpected)"
264
+ file_data_available = False
265
+ file_data = None
266
+ return file_data, file_data_available, file_persistence
267
+
268
+
269
+ @app.cell
270
+ def _(
271
+ file_data_available,
272
+ file_persistence,
273
+ memory_data_available,
274
+ memory_persistence,
275
+ mo,
276
+ ):
277
+ # Create an interactive display to show persistence results
278
+ persistence_results = mo.ui.table(
279
+ {
280
+ "Connection Type": ["In-Memory Database", "File-Based Database"],
281
+ "Persistence Status": [memory_persistence, file_persistence],
282
+ "Data Available After Restart": [
283
+ memory_data_available,
284
+ file_data_available,
285
+ ],
286
+ }
287
+ )
288
+
289
+ mo.md("### Persistence Test Results")
290
+ return (persistence_results,)
291
+
292
+
293
+ @app.cell
294
+ def _(persistence_results):
295
+ persistence_results
296
+ return
297
+
298
+
299
+ @app.cell
300
+ def _(file_data, file_data_available, mo):
301
+ if file_data_available:
302
+ mo.md("### Persisted File-Based Data:")
303
+ mo.ui.table(file_data)
304
+ return
305
+
306
+
307
+ @app.cell(hide_code=True)
308
+ def _(mo):
309
+ mo.md(
310
+ r"""
311
+ # [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html)
312
+
313
+ DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints.
314
+
315
+ ## Table Creation Options
316
+
317
+ DuckDB supports various table creation options, including:
318
+
319
+ - **Basic tables** with column definitions
320
+ - **Temporary tables** that exist only during the session
321
+ - **CREATE OR REPLACE** to recreate tables
322
+ - **Primary keys** and other constraints
323
+ - **Various data types** including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc.
324
+ """
325
+ )
326
+ return
327
+
328
+
329
+ @app.cell
330
+ def _create_users_tables(file_db, new_memory_db):
331
+ # For the memory database
332
+ try:
333
+ new_memory_db.execute("DROP TABLE IF EXISTS users_memory")
334
+ except:
335
+ pass
336
+
337
+ # For the file database
338
+ try:
339
+ file_db.execute("DROP TABLE IF EXISTS users_file")
340
+ except:
341
+ pass
342
+ return
343
+
344
+
345
+ @app.cell
346
+ def _(file_db, new_memory_db):
347
+ # Create advanced users table in memory database with primary key
348
+ new_memory_db.execute("""
349
+ CREATE TABLE users_memory (
350
+ id INTEGER PRIMARY KEY,
351
+ name VARCHAR NOT NULL,
352
+ age INTEGER CHECK (age > 0),
353
+ email VARCHAR UNIQUE,
354
+ registration_date DATE DEFAULT CURRENT_DATE,
355
+ last_login TIMESTAMP,
356
+ account_balance DECIMAL(10,2) DEFAULT 0.00
357
+ )
358
+ """)
359
+
360
+ # Create users table in file database
361
+ file_db.execute("""
362
+ CREATE TABLE users_file (
363
+ id INTEGER PRIMARY KEY,
364
+ name VARCHAR NOT NULL,
365
+ age INTEGER CHECK (age > 0),
366
+ email VARCHAR UNIQUE,
367
+ registration_date DATE DEFAULT CURRENT_DATE,
368
+ last_login TIMESTAMP,
369
+ account_balance DECIMAL(10,2) DEFAULT 0.00
370
+ )
371
+ """)
372
+ return
373
+
374
+
375
+ @app.cell
376
+ def _(mo, new_memory_db):
377
+ # Get table schema information using DuckDB's internal system tables
378
+ memory_schema = new_memory_db.execute("""
379
+ SELECT column_name, data_type, is_nullable
380
+ FROM information_schema.columns
381
+ WHERE table_name = 'users_memory'
382
+ ORDER BY ordinal_position
383
+ """).df()
384
+
385
+ # Display the schema using marimo's UI components
386
+ mo.md("### 🔍 Table Schema Information")
387
+ return (memory_schema,)
388
+
389
+
390
+ @app.cell(hide_code=True)
391
+ def _(memory_schema, mo):
392
+ mo.ui.table(memory_schema)
393
+ return
394
+
395
+
396
+ @app.cell(hide_code=True)
397
+ def _(mo):
398
+ mo.md(
399
+ r"""
400
+ # [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert)
401
+
402
+ DuckDB supports multiple ways to insert data:
403
+
404
+ 1. **INSERT INTO VALUES**: Insert specific values
405
+ 2. **INSERT INTO SELECT**: Insert data from query results
406
+ 3. **Parameterized inserts**: Using prepared statements
407
+ 4. **Bulk inserts**: For efficient loading of multiple rows
408
+
409
+ Let's demonstrate these different insertion methods:
410
+ """
411
+ )
412
+ return
413
+
414
+
415
+ @app.cell
416
+ def _insert_user_data(date):
417
+ today = date.today()
418
+
419
+
420
+ # First check if records already exist to avoid duplicate key errors
421
+ def safe_insert(connection, table_name, data):
422
+ """
423
+ Safely insert data into a table by checking for existing IDs first
424
+ """
425
+ # Check which IDs already exist in the table
426
+ existing_ids = (
427
+ connection.execute(f"SELECT id FROM {table_name}")
428
+ .fetchdf()["id"]
429
+ .tolist()
430
+ )
431
+
432
+ # Filter out data with IDs that already exist
433
+ new_data = [record for record in data if record[0] not in existing_ids]
434
+
435
+ if not new_data:
436
+ print(
437
+ f"No new records to insert into {table_name}. All IDs already exist."
438
+ )
439
+ return 0
440
+
441
+ # Prepare the placeholders for the SQL statement
442
+ placeholders = ", ".join(
443
+ ["(" + ", ".join(["?"] * len(new_data[0])) + ")"] * len(new_data)
444
+ )
445
+
446
+ # Flatten the list of tuples for parameter binding
447
+ flat_data = [item for sublist in new_data for item in sublist]
448
+
449
+ # Perform the insertion
450
+ if flat_data:
451
+ columns = "(id, name, age, email, registration_date, last_login, account_balance)"
452
+ connection.execute(
453
+ f"INSERT INTO {table_name} {columns} VALUES {placeholders}",
454
+ flat_data,
455
+ )
456
+ return len(new_data)
457
+ return 0
458
+ return (safe_insert,)
459
+
460
+
461
+ @app.cell
462
+ def _():
463
+ # Prepare the data
464
+ user_data = [
465
+ (
466
+ 1,
467
+ "Alice",
468
+ 25,
469
+ "alice@example.com",
470
+ "2021-01-01",
471
+ "2023-01-15 14:30:00",
472
+ 1250.75,
473
+ ),
474
+ (
475
+ 2,
476
+ "Bob",
477
+ 30,
478
+ "bob@example.com",
479
+ "2021-02-01",
480
+ "2023-02-10 09:15:22",
481
+ 750.50,
482
+ ),
483
+ (
484
+ 3,
485
+ "Charlie",
486
+ 35,
487
+ "charlie@example.com",
488
+ "2021-03-01",
489
+ "2023-03-05 17:45:10",
490
+ 3200.25,
491
+ ),
492
+ (
493
+ 4,
494
+ "David",
495
+ 40,
496
+ "david@example.com",
497
+ "2021-04-01",
498
+ "2023-04-20 10:30:45",
499
+ 1800.00,
500
+ ),
501
+ (
502
+ 5,
503
+ "Emma",
504
+ 45,
505
+ "emma@example.com",
506
+ "2021-05-01",
507
+ "2023-05-12 11:20:30",
508
+ 2500.00,
509
+ ),
510
+ (
511
+ 6,
512
+ "Frank",
513
+ 50,
514
+ "frank@example.com",
515
+ "2021-06-01",
516
+ "2023-06-18 16:10:15",
517
+ 900.25,
518
+ ),
519
+ ]
520
+ return (user_data,)
521
+
522
+
523
+ @app.cell
524
+ def _(mo, new_memory_db, safe_insert, user_data):
525
+ # Safely insert data into memory database
526
+ records_inserted = safe_insert(new_memory_db, "users_memory", user_data)
527
+ mo.md(
528
+ f"""
529
+ Inserted {records_inserted} new records into users_memory.
530
+ """
531
+ )
532
+ return
533
+
534
+
535
+ @app.cell
536
+ def _(file_db, safe_insert, user_data):
537
+ def _():
538
+ # Safely insert data into file database
539
+ records_inserted = safe_insert(file_db, "users_file", user_data)
540
+ return print(f"Inserted {records_inserted} new records into users_file")
541
+
542
+
543
+ _()
544
+ return
545
+
546
+
547
+ @app.cell
548
+ def _():
549
+ # If you need to add just one record, you can use a similar approach:
550
+ new_user = (
551
+ 7,
552
+ "Grace",
553
+ 28,
554
+ "grace@example.com",
555
+ "2021-07-01",
556
+ "2023-07-22 13:45:10",
557
+ 1675.50,
558
+ )
559
+ return (new_user,)
560
+
561
+
562
+ @app.cell
563
+ def _(new_memory_db, new_user):
564
+ # Check if the ID exists before inserting
565
+ if not new_memory_db.execute(
566
+ "SELECT id FROM users_memory WHERE id = ?", [new_user[0]]
567
+ ).fetchone():
568
+ new_memory_db.execute(
569
+ """
570
+ INSERT INTO users_memory (id, name, age, email, registration_date, last_login, account_balance)
571
+ VALUES (?, ?, ?, ?, ?, ?, ?)
572
+ """,
573
+ new_user,
574
+ )
575
+ print(f"Added user {new_user[1]} to users_memory")
576
+ else:
577
+ print(f"User with ID {new_user[0]} already exists in users_memory")
578
+ return
579
+
580
+
581
+ @app.cell
582
+ def _(file_db, new_user):
583
+ # Do the same for the file database
584
+ if not file_db.execute(
585
+ "SELECT id FROM users_file WHERE id = ?", [new_user[0]]
586
+ ).fetchone():
587
+ file_db.execute(
588
+ """
589
+ INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
590
+ VALUES (?, ?, ?, ?, ?, ?, ?)
591
+ """,
592
+ new_user,
593
+ )
594
+ print(f"Added user {new_user[1]} to users_file")
595
+ else:
596
+ print(f"User with ID {new_user[0]} already exists in users_file")
597
+ return
598
+
599
+
600
+ @app.cell
601
+ def _(new_memory_db):
602
+ # First try to update
603
+ cursor = new_memory_db.execute(
604
+ """
605
+ UPDATE users_memory
606
+ SET name = ?, age = ?, email = ?,
607
+ registration_date = ?, last_login = ?, account_balance = ?
608
+ WHERE id = ?
609
+ """,
610
+ (
611
+ "Henry",
612
+ 33,
613
+ "henry@example.com",
614
+ "2021-08-01",
615
+ "2023-08-05 09:10:15",
616
+ 3100.75,
617
+ 8, # ID should be the last parameter
618
+ ),
619
+ )
620
+ return (cursor,)
621
+
622
+
623
+ @app.cell
624
+ def _(cursor, mo, new_memory_db):
625
+ # If no rows were updated, perform an insert
626
+ if cursor.rowcount == 0:
627
+ new_memory_db.execute(
628
+ """
629
+ INSERT INTO users_memory
630
+ (id, name, age, email, registration_date, last_login, account_balance)
631
+ VALUES (?, ?, ?, ?, ?, ?, ?)
632
+ """,
633
+ (
634
+ 8,
635
+ "Henry",
636
+ 33,
637
+ "henry@example.com",
638
+ "2021-08-01",
639
+ "2023-08-05 09:10:15",
640
+ 3100.75,
641
+ ),
642
+ )
643
+
644
+ mo.md(
645
+ f"""
646
+ Upserted Henry into users_memory.
647
+ """
648
+ )
649
+ return
650
+
651
+
652
+ @app.cell
653
+ def _(file_db, mo):
654
+ # For DuckDB using ON CONFLICT, we need to specify the conflict target column
655
+ file_db.execute(
656
+ """
657
+ INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
658
+ VALUES (?, ?, ?, ?, ?, ?, ?)
659
+ ON CONFLICT (id) DO UPDATE SET
660
+ name = EXCLUDED.name,
661
+ age = EXCLUDED.age,
662
+ email = EXCLUDED.email,
663
+ registration_date = EXCLUDED.registration_date,
664
+ last_login = EXCLUDED.last_login,
665
+ account_balance = EXCLUDED.account_balance
666
+ """,
667
+ (
668
+ 8,
669
+ "Henry",
670
+ 33,
671
+ "henry@example.com",
672
+ "2021-08-01",
673
+ "2023-08-05 09:10:15",
674
+ 3100.75,
675
+ ),
676
+ )
677
+
678
+ mo.md(
679
+ f"""
680
+ Upserted Henry into users_file.
681
+ """
682
+ )
683
+ return
684
+
685
+
686
+ @app.cell
687
+ def _view_tables_after_insert(new_memory_db):
688
+ # Display memory data using DuckDB's query capabilities
689
+ memory_results = new_memory_db.execute("""
690
+ SELECT
691
+ id,
692
+ name,
693
+ age,
694
+ email,
695
+ registration_date,
696
+ last_login,
697
+ account_balance
698
+ FROM users_memory
699
+ ORDER BY id
700
+ """).df()
701
+ return (memory_results,)
702
+
703
+
704
+ @app.cell
705
+ def _(file_db):
706
+ # Display file data with formatting
707
+ file_results = file_db.execute("""
708
+ SELECT
709
+ id,
710
+ name,
711
+ age,
712
+ email,
713
+ registration_date,
714
+ last_login,
715
+ CAST(account_balance AS DECIMAL(10,2)) AS account_balance
716
+ FROM users_file
717
+ ORDER BY id
718
+ """).df()
719
+ return (file_results,)
720
+
721
+
722
+ @app.cell
723
+ def _(mo):
724
+ mo.md(
725
+ r"""
726
+ <!-- Create an interactive display with tabs using marimo components -->
727
+ ## 📊 Database Contents After Insertion
728
+ """
729
+ )
730
+ return
731
+
732
+
733
+ @app.cell(hide_code=True)
734
+ def _(file_results, memory_results, mo):
735
+ tabs = mo.ui.tabs(
736
+ {
737
+ "In-Memory Database": mo.ui.table(memory_results),
738
+ "File-Based Database": mo.ui.table(file_results),
739
+ }
740
+ )
741
+ tabs
742
+ return
743
+
744
+
745
+ @app.cell(hide_code=True)
746
+ def _(mo):
747
+ mo.md(
748
+ r"""
749
+ # [4. Using SQL Directly in Marimo](https://duckdb.org/docs/stable/sql/query_syntax/select)
750
+
751
+ There are multiple ways to leverage DuckDB's SQL capabilities in marimo:
752
+
753
+ 1. **Direct execution**: Using DuckDB connections to execute SQL
754
+ 2. **Marimo SQL**: Using Marimo's built-in SQL engine
755
+ 3. **Interactive queries**: Combining UI elements with SQL execution
756
+
757
+ Let's explore these approaches:
758
+ """
759
+ )
760
+ return
761
+
762
+
763
+ @app.cell(hide_code=True)
764
+ def _sql_with_marimo(mo):
765
+ mo.md(
766
+ rf"""
767
+ <!-- Using Marimo's SQL engine with direct SQL on memory_results DataFrame -->
768
+ ## 🔍 Query with Marimo SQL
769
+ """
770
+ )
771
+ return
772
 
773
 
774
  @app.cell(hide_code=True)
775
+ def _(mo):
776
  mo.md(
777
+ rf"""
778
+ ## Marimo has its own built-in SQL engine that can work with DataFrames.
779
+ Let's use it to filter our users:
780
+ """
781
+ )
782
+ return
783
+
784
+
785
+ @app.cell
786
+ def _(mo):
787
+ # Create a SQL selector for users with age threshold
788
+ age_threshold = mo.ui.slider(25, 50, value=30, label="Minimum Age")
789
+ return (age_threshold,)
790
+
791
 
792
+ @app.cell
793
+ def _(age_threshold, memory_results, mo):
794
+ # Create a function to filter users based on the slider value
795
+ def filtered_users():
796
+ # Use DuckDB directly instead of mo.sql with users param
797
+ filtered_df = memory_results[memory_results["age"] >= age_threshold.value]
798
+ filtered_df = filtered_df.sort_values("age")
799
+ return mo.ui.table(filtered_df)
800
+ return (filtered_users,)
801
 
 
802
 
803
+ @app.cell
804
+ def _(age_threshold, filtered_users, mo):
805
+ layout = mo.vstack(
806
+ [
807
+ mo.md("### Select minimum age:"),
808
+ age_threshold,
809
+ mo.md("### Users meeting age criteria:"),
810
+ filtered_users(),
811
+ ],
812
+ gap=1.5,
813
+ )
814
+ layout
815
+ return
816
 
 
 
 
 
 
 
817
 
818
+ @app.cell(hide_code=True)
819
+ def _(mo):
820
+ mo.md(r"""# [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html)""")
821
+ return
822
 
 
 
 
 
 
823
 
824
+ @app.cell
825
+ def _polars_integration(pl):
826
+ # Create a Polars DataFrame
827
+ polars_df = pl.DataFrame(
828
+ {
829
+ "id": [101, 102, 103],
830
+ "name": ["Product A", "Product B", "Product C"],
831
+ "price": [29.99, 49.99, 19.99],
832
+ "category": ["Electronics", "Furniture", "Books"],
833
+ }
834
+ )
835
+ return (polars_df,)
836
 
 
 
 
 
 
 
 
 
837
 
838
+ @app.cell
839
+ def _(mo):
840
+ mo.md(
841
+ rf"""
842
+ <!-- Display the Polars DataFrame -->
843
+ ## Original Polars DataFrame:
844
+ """
845
  )
846
  return
847
 
848
 
849
+ @app.cell
850
+ def _(mo, polars_df):
851
+ mo.ui.table(polars_df)
852
+ return
853
+
854
+
855
+ @app.cell
856
+ def _(new_memory_db, polars_df):
857
+ # Register the Polars DataFrame as a DuckDB table in memory connection
858
+ new_memory_db.register("products_polars", polars_df)
859
+
860
+ # Query the registered table
861
+ polars_query_result = new_memory_db.execute(
862
+ "SELECT * FROM products_polars WHERE price > 25"
863
+ ).df()
864
+ return (polars_query_result,)
865
+
866
+
867
  @app.cell(hide_code=True)
868
  def _(mo):
869
  mo.md(
870
  r"""
871
+ <!-- Display the query result -->
872
+ ## DuckDB Query Result (From Polars Data):
873
+ """
 
874
  )
875
  return
876
 
877
 
878
  @app.cell
879
+ def _(mo, polars_query_result):
880
+ mo.ui.table(polars_query_result)
881
+ return
882
+
883
+
884
+ @app.cell
885
+ def _(mo, new_memory_db):
886
+ # Demonstrate a more complex query
887
+ complex_query_result = new_memory_db.execute("""
888
+ SELECT
889
+ category,
890
+ COUNT(*) as product_count,
891
+ AVG(price) as avg_price,
892
+ MIN(price) as min_price,
893
+ MAX(price) as max_price
894
+ FROM products_polars
895
+ GROUP BY category
896
+ ORDER BY avg_price DESC
897
+ """).df()
898
+
899
+ mo.md("## Aggregated Product Data by Category:")
900
+ return (complex_query_result,)
901
+
902
+
903
+ @app.cell
904
+ def _(complex_query_result, mo):
905
+ mo.ui.table(complex_query_result)
906
+ return
907
 
908
 
909
  @app.cell(hide_code=True)
910
  def _(mo):
911
+ mo.md(r"""# [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html)""")
 
 
912
  return
913
 
914
 
915
  @app.cell
916
+ def _join_operations(new_memory_db):
917
+ # Create another table to join with
918
+ new_memory_db.execute("""
919
+ CREATE TABLE IF NOT EXISTS departments (
920
  id INTEGER,
921
+ department_name VARCHAR,
922
+ manager_id INTEGER
 
923
  )
924
+ """)
925
+ return
926
+
927
+
928
+ @app.cell
929
+ def _(new_memory_db):
930
+ new_memory_db.execute("""
931
+ INSERT INTO departments VALUES
932
+ (101, 'Engineering', 1),
933
+ (102, 'Marketing', 2),
934
+ (103, 'Finance', NULL)
935
+ """)
936
+ return
937
+
938
+
939
+ @app.cell
940
+ def _(new_memory_db):
941
+ # Execute a join query
942
+ join_result = new_memory_db.execute("""
943
+ SELECT
944
+ u.id,
945
+ u.name,
946
+ u.age,
947
+ d.department_name
948
+ FROM users_memory u
949
+ LEFT JOIN departments d ON u.id = d.manager_id
950
+ ORDER BY u.id
951
+ """).df()
952
+ return (join_result,)
953
+
954
+
955
+ @app.cell(hide_code=True)
956
+ def _(mo):
957
+ mo.md(
958
+ rf"""
959
+ <!-- Display the join result -->
960
+ ## Join Result (Users and Departments):
961
  """
962
  )
963
  return
964
 
965
 
966
+ @app.cell
967
+ def _(join_result, mo):
968
+ mo.ui.table(join_result)
969
+ return
970
+
971
+
972
  @app.cell(hide_code=True)
973
  def _(mo):
974
  mo.md(
975
+ rf"""
976
+ <!-- Demonstrate different types of joins -->
977
+ ## Different Types of Joins
978
+ """
979
  )
980
  return
981
 
982
 
983
  @app.cell
984
+ def _(new_memory_db):
985
+ # Inner join
986
+ inner_join = new_memory_db.execute("""
987
+ SELECT u.id, u.name, d.department_name
988
+ FROM users_memory u
989
+ INNER JOIN departments d ON u.id = d.manager_id
990
+ """).df()
991
+
992
+ # Right join
993
+ right_join = new_memory_db.execute("""
994
+ SELECT u.id, u.name, d.department_name
995
+ FROM users_memory u
996
+ RIGHT JOIN departments d ON u.id = d.manager_id
997
+ """).df()
998
+
999
+ # Full outer join
1000
+ full_join = new_memory_db.execute("""
1001
+ SELECT u.id, u.name, d.department_name
1002
+ FROM users_memory u
1003
+ FULL OUTER JOIN departments d ON u.id = d.manager_id
1004
+ """).df()
1005
+ return full_join, inner_join, right_join
1006
+
1007
+
1008
+ @app.cell
1009
+ def _(full_join, inner_join, join_result, mo, right_join):
1010
+ join_tabs = mo.ui.tabs(
1011
+ {
1012
+ "Left Join": mo.ui.table(join_result),
1013
+ "Inner Join": mo.ui.table(inner_join),
1014
+ "Right Join": mo.ui.table(right_join),
1015
+ "Full Outer Join": mo.ui.table(full_join),
1016
+ }
1017
  )
1018
+
1019
+ join_tabs
1020
+ return
1021
+
1022
+
1023
+ @app.cell(hide_code=True)
1024
+ def _(mo):
1025
+ mo.md(r"""# [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html)""")
1026
  return
1027
 
1028
 
1029
+ @app.cell
1030
+ def _aggregate_operations(new_memory_db):
1031
+ # Execute an aggregate query
1032
+ agg_result = new_memory_db.execute("""
1033
+ SELECT
1034
+ AVG(age) as avg_age,
1035
+ MAX(age) as max_age,
1036
+ MIN(age) as min_age,
1037
+ COUNT(*) as total_users,
1038
+ SUM(account_balance) as total_balance
1039
+ FROM users_memory
1040
+ """).df()
1041
+ return (agg_result,)
1042
+
1043
+
1044
  @app.cell(hide_code=True)
1045
  def _(mo):
1046
  mo.md(
1047
+ rf"""
1048
+ <!-- Display the aggregate result -->
1049
+ ## Aggregate Results (All Users):
1050
+ """
1051
  )
1052
  return
1053
 
1054
 
1055
  @app.cell
1056
+ def _(agg_result, mo):
1057
+ mo.ui.table(agg_result)
1058
+ return
 
 
 
1059
 
1060
 
1061
  @app.cell(hide_code=True)
1062
  def _(mo):
1063
  mo.md(
1064
+ rf"""
1065
+ <!-- More complex aggregate query with grouping -->
1066
+ ## Aggregate Results (Grouped by Age Range):
1067
+ """
1068
  )
1069
  return
1070
 
1071
 
1072
  @app.cell
1073
+ def _(new_memory_db):
1074
+ age_groups = new_memory_db.execute("""
1075
+ SELECT
1076
+ CASE
1077
+ WHEN age < 30 THEN 'Under 30'
1078
+ WHEN age BETWEEN 30 AND 40 THEN '30 to 40'
1079
+ ELSE 'Over 40'
1080
+ END as age_group,
1081
+ COUNT(*) as count,
1082
+ AVG(age) as avg_age,
1083
+ AVG(account_balance) as avg_balance
1084
+ FROM users_memory
1085
+ GROUP BY 1
1086
+ ORDER BY 1
1087
+ """).df()
1088
+ return (age_groups,)
1089
+
1090
+
1091
+ @app.cell
1092
+ def _(age_groups, mo):
1093
+ mo.ui.table(age_groups)
1094
+ return
1095
+
1096
+
1097
+ @app.cell
1098
+ def _(mo):
1099
+ mo.md(
1100
+ r"""
1101
+ <!-- Window functions demo -->
1102
+ ### Window Functions Example:
1103
+ """
1104
  )
1105
+ return
1106
+
1107
 
1108
+ @app.cell
1109
+ def _(mo, new_memory_db):
1110
+ window_result = new_memory_db.execute("""
1111
+ SELECT
1112
+ id,
1113
+ name,
1114
+ age,
1115
+ account_balance,
1116
+ RANK() OVER (ORDER BY account_balance DESC) as balance_rank,
1117
+ account_balance - AVG(account_balance) OVER () as diff_from_avg,
1118
+ account_balance / SUM(account_balance) OVER () * 100 as pct_of_total
1119
+ FROM users_memory
1120
+ ORDER BY balance_rank
1121
+ """).df()
1122
+
1123
+ mo.ui.table(window_result)
1124
+ return
1125
+
1126
+
1127
+ @app.cell(hide_code=True)
1128
+ def _(mo):
1129
+ mo.md(r"""# [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html)""")
1130
+ return
1131
 
1132
+
1133
+ @app.cell
1134
+ def _convert_results(new_memory_db):
1135
+ polars_result = new_memory_db.execute(
1136
+ """SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
1137
+ ).pl()
1138
+ return (polars_result,)
 
1139
 
1140
 
1141
  @app.cell(hide_code=True)
1142
  def _(mo):
1143
  mo.md(
1144
+ r"""
1145
+ <!-- Display the converted results -->
1146
+ ## Query Result as Polars DataFrame:
1147
+ """
1148
  )
1149
  return
1150
 
1151
 
1152
  @app.cell
1153
+ def _(mo, polars_result):
1154
+ mo.ui.table(polars_result)
1155
+ return
1156
+
1157
+
1158
+ @app.cell
1159
+ def _(new_memory_db):
1160
+ pandas_result = new_memory_db.execute(
1161
+ """SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
1162
+ ).fetch_df()
1163
+ return (pandas_result,)
1164
+
1165
+
1166
+ @app.cell(hide_code=True)
1167
+ def _(mo):
1168
+ mo.md(r"""## Same Query Result as Pandas DataFrame:""")
1169
+ return
1170
+
1171
+
1172
+ @app.cell
1173
+ def _(mo, pandas_result):
1174
+ mo.ui.table(pandas_result)
1175
+ return
1176
+
1177
+
1178
+ @app.cell(hide_code=True)
1179
+ def _(mo):
1180
+ mo.md(
1181
+ r"""
1182
+ <!-- Demonstrate the differences in handling -->
1183
+ ## Differences in DataFrame Handling
1184
  """
1185
  )
1186
+ return
 
 
 
1187
 
1188
 
1189
  @app.cell(hide_code=True)
1190
  def _(mo):
1191
  mo.md(
1192
+ r"""
1193
+ <!-- Polars operation -->
1194
+ ## Polars: Filter users over 35 and calculate average balance
1195
+ """
1196
  )
1197
  return
1198
 
1199
 
1200
  @app.cell
1201
+ def _(mo, pl, polars_result):
1202
+ def _():
1203
+ polars_filtered = polars_result.filter(pl.col("age") > 35)
1204
+ polars_avg = polars_filtered.select(
1205
+ pl.col("account_balance").mean().alias("avg_balance")
1206
+ )
1207
+
1208
+ layout = mo.vstack(
1209
+ [
1210
+ mo.md("### Filtered Polars DataFrame (Age > 35):"),
1211
+ mo.ui.table(polars_filtered),
1212
+ mo.md("### Average Account Balance:"),
1213
+ mo.ui.table(polars_avg),
1214
+ ],
1215
+ gap=1.5,
1216
+ )
1217
+ return layout
1218
+
1219
+
1220
+ _()
1221
+ return
1222
+
1223
+
1224
+ @app.cell(hide_code=True)
1225
+ def _(mo):
1226
+ mo.md(
1227
+ r"""
1228
+ <!-- Pandas equivalent (using pandas style) -->
1229
+ ## Pandas: Same operation in pandas style
1230
  """
 
 
 
 
 
1231
  )
1232
+ return
1233
+
1234
+
1235
+ @app.cell
1236
+ def _(mo, pandas_result):
1237
+ pandas_avg = pandas_result[pandas_result["age"] > 35]["account_balance"].mean()
1238
+ mo.md(f"Average balance: {pandas_avg:.2f}")
1239
+ return
1240
+
1241
+
1242
+ @app.cell(hide_code=True)
1243
+ def _(mo):
1244
+ mo.md("""## 9. Data Visualization with DuckDB and Plotly""")
1245
+ return
1246
+
1247
+
1248
+ @app.cell
1249
+ def _(age_groups, mo, new_memory_db, plotly_express):
1250
+ # User distribution by age group
1251
+ fig1 = plotly_express.bar(
1252
+ age_groups,
1253
+ x="age_group",
1254
+ y="count",
1255
+ title="User Distribution by Age Group",
1256
+ labels={"count": "Number of Users", "age_group": "Age Group"},
1257
+ color="age_group",
1258
+ color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
1259
+ )
1260
+ fig1.update_traces(
1261
+ text=age_groups["count"],
1262
+ textposition="outside",
1263
+ )
1264
+ fig1.update_layout(height=450, margin=dict(t=50, b=50))
1265
+
1266
+
1267
+ # Average balance by age group
1268
+ fig2 = plotly_express.bar(
1269
+ age_groups,
1270
+ x="age_group",
1271
+ y="avg_balance",
1272
+ title="Average Account Balance by Age Group",
1273
+ labels={"avg_balance": "Average Balance ($)", "age_group": "Age Group"},
1274
+ color="age_group",
1275
+ color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
1276
+ )
1277
+ fig2.update_traces(
1278
+ text=[f"${val:.2f}" for val in age_groups["avg_balance"]],
1279
+ textposition="outside",
1280
+ )
1281
+ fig2.update_layout(height=450, margin=dict(t=50, b=50))
1282
+
1283
+
1284
+ # Age vs Account Balance scatter plot
1285
+ scatter_data = new_memory_db.execute(
1286
+ """
1287
+ SELECT
1288
+ name,
1289
+ age,
1290
+ account_balance
1291
+ FROM users_memory
1292
+ ORDER BY age
1293
+ """
1294
+ ).df()
1295
+
1296
+ fig3 = plotly_express.scatter(
1297
+ scatter_data,
1298
+ x="age",
1299
+ y="account_balance",
1300
+ title="Age vs. Account Balance",
1301
+ labels={"account_balance": "Account Balance ($)", "age": "Age"},
1302
+ color_discrete_sequence=["#FF7F0E"],
1303
+ trendline="ols",
1304
+ hover_data=["age", "account_balance"],
1305
+ size_max=15,
1306
+ )
1307
+ fig3.update_traces(marker=dict(size=12))
1308
+ fig3.update_layout(height=450, margin=dict(t=50, b=50))
1309
+
1310
+
1311
+ # Distribution of account balances
1312
+ balance_data = new_memory_db.execute(
1313
+ """
1314
+ SELECT
1315
+ name,
1316
+ account_balance
1317
+ FROM users_memory
1318
+ ORDER BY account_balance DESC
1319
+ """
1320
+ ).df()
1321
+
1322
+ fig4 = plotly_express.pie(
1323
+ balance_data,
1324
+ names="name",
1325
+ values="account_balance",
1326
+ title="Distribution of Account Balances",
1327
+ labels={"account_balance": "Account Balance ($)", "name": "User"},
1328
+ color_discrete_sequence=plotly_express.colors.qualitative.Pastel,
1329
+ )
1330
+ fig4.update_traces(textinfo="percent+label", textposition="inside")
1331
+ fig4.update_layout(height=450, margin=dict(t=50, b=50))
1332
+
1333
+
1334
+ category_tabs = mo.ui.tabs(
1335
+ {
1336
+ "Age Group Analysis": mo.vstack(
1337
+ [
1338
+ mo.ui.tabs(
1339
+ {
1340
+ "User Distribution": mo.ui.plotly(fig1),
1341
+ "Average Balance": mo.ui.plotly(fig2),
1342
+ }
1343
+ )
1344
+ ]
1345
+ ),
1346
+ "Financial Analysis": mo.vstack(
1347
+ [
1348
+ mo.ui.tabs(
1349
+ {
1350
+ "Age vs Balance": mo.ui.plotly(fig3),
1351
+ "Balance Distribution": mo.ui.plotly(fig4),
1352
+ }
1353
+ )
1354
+ ]
1355
+ ),
1356
+ },
1357
+ lazy=True,
1358
+ )
1359
+
1360
+ mo.vstack(
1361
+ [
1362
+ mo.md("### Select a visualization category:"),
1363
+ category_tabs,
1364
+ ],
1365
+ gap=1.5,
1366
+ )
1367
+ return
1368
 
1369
 
1370
  @app.cell(hide_code=True)
1371
  def _(mo):
1372
  mo.md(
1373
+ r"""
1374
+ # [9. Database Management Best Practices]
1375
+
1376
+ ### Closing Connections
1377
+
1378
+ It's important to close database connections when you're done with them, especially for file-based connections:
1379
+
1380
+ ```python
1381
+ memory_db.close()
1382
+ file_db.close()
1383
+ ```
1384
+
1385
+ ### Transaction Management
1386
+
1387
+ DuckDB supports transactions, which can be useful for more complex operations:
1388
+
1389
+ ```python
1390
+ conn = duckdb.connect('mydb.db')
1391
+ conn.begin() # Start transaction
1392
+
1393
+ try:
1394
+ conn.execute("INSERT INTO users VALUES (1, 'Test User')")
1395
+ conn.execute("UPDATE balances SET amount = amount - 100 WHERE user_id = 1")
1396
+ conn.commit() # Commit changes
1397
+ except:
1398
+ conn.rollback() # Undo changes if error
1399
+ raise
1400
+ ```
1401
+
1402
+ ### Query Performance
1403
+
1404
+ DuckDB is optimized for analytical queries. For best performance:
1405
+
1406
+ - Use appropriate data types
1407
+ - Create indexes for frequently queried columns
1408
+ - For large datasets, consider partitioning
1409
+ - Use prepared statements for repeated queries
1410
+ """
1411
  )
1412
  return
1413
 
1414
 
1415
+ @app.cell(hide_code=True)
1416
+ def _interactive_dashboard(mo):
1417
+ mo.md(rf"""## 10. Interactive DuckDB Dashboard with Marimo and Plotly""")
1418
+ return
1419
+
1420
+
1421
+ @app.cell
1422
+ def _(mo):
1423
+ # Create an interactive filter for age range
1424
+ min_age = mo.ui.slider(20, 50, value=25, label="Minimum Age")
1425
+ max_age = mo.ui.slider(20, 50, value=50, label="Maximum Age")
1426
+ return max_age, min_age
1427
+
1428
+
1429
+ @app.cell
1430
+ def _(max_age, min_age, new_memory_db):
1431
+ # Create a function to filter data and update visualizations
1432
+ def get_filtered_data(min_val=min_age.value, max_val=max_age.value):
1433
+ # Get filtered data based on slider values using parameterized query for safety
1434
+ return new_memory_db.execute(
1435
+ """
1436
+ SELECT
1437
+ id,
1438
+ name,
1439
+ age,
1440
+ email,
1441
+ account_balance,
1442
+ registration_date
1443
+ FROM users_memory
1444
+ WHERE age >= ? AND age <= ?
1445
+ ORDER BY age
1446
+ """,
1447
+ [min_val, max_val],
1448
+ ).df()
1449
+ return (get_filtered_data,)
1450
+
1451
+
1452
+ @app.cell
1453
+ def _(get_filtered_data):
1454
+ def get_metrics(data=get_filtered_data()):
1455
+ return {
1456
+ "user count": len(data),
1457
+ "avg_balance": data["account_balance"].mean() if len(data) > 0 else 0,
1458
+ "total_balance": data["account_balance"].sum() if len(data) > 0 else 0,
1459
+ }
1460
+ return (get_metrics,)
1461
+
1462
+
1463
+ @app.cell
1464
+ def _(get_metrics, mo):
1465
+ def metrics_display(metrics=get_metrics()):
1466
+ return mo.hstack(
1467
+ [
1468
+ mo.vstack(
1469
+ [
1470
+ mo.md("### Selected Users"),
1471
+ mo.md(f"## {metrics['user count']}"),
1472
+ ],
1473
+ align="center",
1474
+ ),
1475
+ mo.vstack(
1476
+ [
1477
+ mo.md("### Average Balance"),
1478
+ mo.md(f"## ${metrics['avg_balance']:.2f}"),
1479
+ ],
1480
+ align="center",
1481
+ ),
1482
+ mo.vstack(
1483
+ [
1484
+ mo.md("### Total Balance"),
1485
+ mo.md(f"## ${metrics['total_balance']:.2f}"),
1486
+ ],
1487
+ align="center",
1488
+ ),
1489
+ ],
1490
+ justify="space-between",
1491
+ gap=1.5,
1492
+ )
1493
+ return (metrics_display,)
1494
+
1495
+
1496
+ @app.cell
1497
+ def _(get_filtered_data, max_age, min_age, mo, plotly_express):
1498
+ def create_visualization(
1499
+ data=get_filtered_data(), min_val=min_age.value, max_val=max_age.value
1500
+ ):
1501
+ if len(data) == 0:
1502
+ return mo.ui.text("No data available for the selected age range.")
1503
+
1504
+ # Create visualizations for filtered data
1505
+ fig1 = plotly_express.bar(
1506
+ data,
1507
+ x="name",
1508
+ y="account_balance",
1509
+ title=f"Account Balance by User (Age {min_val} - {max_val})",
1510
+ labels={"account_balance": "Account Balance ($)", "name": "User"},
1511
+ color="account_balance",
1512
+ color_continuous_scale=plotly_express.colors.sequential.Plasma,
1513
+ text_auto=".2s",
1514
+ )
1515
+ fig1.update_layout(
1516
+ height=400,
1517
+ xaxis_tickangle=-45,
1518
+ margin=dict(t=50, b=70, l=50, r=30),
1519
+ )
1520
+ fig1.update_traces(
1521
+ textposition="outside",
1522
+ )
1523
+
1524
+ fig2 = plotly_express.histogram(
1525
+ data,
1526
+ x="age",
1527
+ nbins=min(10, len(set(data["age"]))),
1528
+ title=f"Age Distribution (Age {min_val} - {max_val})",
1529
+ color_discrete_sequence=["#4C78A8"],
1530
+ opacity=0.8,
1531
+ histnorm="probability density",
1532
+ )
1533
+ fig2.update_layout(
1534
+ height=400,
1535
+ margin=dict(t=50, b=70, l=50, r=30),
1536
+ bargap=0.1,
1537
+ )
1538
+
1539
+ fig3 = plotly_express.scatter(
1540
+ data,
1541
+ x="age",
1542
+ y="account_balance",
1543
+ title=f"Age vs. Account Balance (Age {min_val} - {max_val})",
1544
+ labels={"account_balance": "Account Balance ($)", "age": "Age"},
1545
+ color="age",
1546
+ color_continuous_scale="Viridis",
1547
+ size_max=25,
1548
+ size="account_balance",
1549
+ hover_name="name",
1550
+ )
1551
+ fig3.update_layout(
1552
+ height=400,
1553
+ margin=dict(t=50, b=70, l=50, r=30),
1554
+ )
1555
+
1556
+ return mo.ui.tabs(
1557
+ {
1558
+ "Account Balance by User": mo.ui.plotly(fig1),
1559
+ "Age Distribution": mo.ui.plotly(fig2),
1560
+ "Age vs. Account Balance": mo.ui.plotly(fig3),
1561
+ }
1562
+ )
1563
+ return (create_visualization,)
1564
+
1565
+
1566
  @app.cell
1567
+ def _(
1568
+ create_visualization,
1569
+ get_filtered_data,
1570
+ max_age,
1571
+ metrics_display,
1572
+ min_age,
1573
+ mo,
1574
+ ):
1575
+ def dashboard(
1576
+ min_val=min_age.value,
1577
+ max_val=max_age.value,
1578
+ metrics=metrics_display(),
1579
+ data=get_filtered_data(),
1580
+ visualization=create_visualization()
1581
+ ):
1582
+ return mo.vstack(
1583
+ [
1584
+ mo.md(f"### Interactive Dashboard (Age {min_val} - {max_val})"),
1585
+ metrics,
1586
+ mo.md("### Data Table"),
1587
+ mo.ui.table(data, page_size=5),
1588
+ mo.md("### Visualizations"),
1589
+ visualization,
1590
+ ],
1591
+ gap=2
1592
+ )
1593
+ dashboard()
1594
+ return
1595
+
1596
+
1597
+ @app.cell(hide_code=True)
1598
+ def _conclusion(mo):
1599
+ mo.md(
1600
+ rf"""
1601
+ # Summary and Key Takeaways
1602
+
1603
+ In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered:
1604
+
1605
+ 1. **Connection types**: We learned the difference between in-memory databases (temporary) and file-based databases (persistent).
1606
+
1607
+ 2. **Table creation**: We created tables with various data types, constraints, and primary keys.
1608
+
1609
+ 3. **Data insertion**: We demonstrated different ways to insert data, including single inserts and bulk loading.
1610
+
1611
+ 4. **SQL queries**: We executed various SQL queries directly and through Marimo's UI components.
1612
+
1613
+ 5. **Integration with Polars**: We showed how DuckDB can work seamlessly with Polars DataFrames.
1614
+
1615
+ 6. **Joins and relationships**: We performed JOIN operations between tables to combine related data.
1616
+
1617
+ 7. **Aggregation**: We used aggregate functions to summarize and analyze data.
1618
+
1619
+ 8. **Data conversion**: We converted DuckDB results to both Polars and Pandas DataFrames.
1620
+
1621
+ 9. **Best practices**: We reviewed best practices for managing DuckDB connections and transactions.
1622
+
1623
+ 10. **Visualization**: We created interactive visualizations and dashboards with Plotly and Marimo.
1624
+
1625
+ DuckDB is an excellent tool for data analysis, especially for analytical workloads. Its in-process nature makes it fast and easy to use, while its SQL compatibility makes it accessible for anyone familiar with SQL databases.
1626
+
1627
+ ### Next Steps
1628
+
1629
+ - Try loading larger datasets into DuckDB
1630
+ - Experiment with more complex queries and window functions
1631
+ - Use DuckDB's COPY functionality to import/export data from/to files
1632
+ - Create more advanced interactive dashboards with Marimo and Plotly
1633
+ """
1634
+ )
1635
+ return
1636
 
1637
 
1638
  @app.cell(hide_code=True)
 
1640
  import marimo as mo
1641
  import duckdb
1642
  import polars as pl
1643
+ import os
1644
+ from datetime import date
1645
+ import plotly.express as plotly_express
1646
+ import plotly.graph_objects as plotly_graph_objects
1647
+ import numpy as np
1648
+ return date, duckdb, mo, os, pl, plotly_express
1649
 
1650
 
1651
  if __name__ == "__main__":