Spaces:

YuITC
/

Vietnamese-Legal-Doc-Retrieval

Sleeping

App Files Files Community

YuITC commited on Apr 24

Commit

4d6a130

1 Parent(s): e1e5cda

refactor: refactor all files

Browse files

Files changed (9) hide show

README.md +195 -1
assets/gradio_demo.png +3 -0
main.py +0 -2
requirements.txt +2 -2
settings.py +4 -4
step_01_Prepare_Data.ipynb +59 -16
step_02_Finetune_SBERT.ipynb +2 -11
step_03_Eval_with_MTEB.ipynb +3 -57
step_04_Retrieval.ipynb +30 -224

README.md CHANGED Viewed

@@ -12,4 +12,198 @@ datasets:
 - YuITC/Vietnamese-Legal-Doc-Retrieval-Data
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 - YuITC/Vietnamese-Legal-Doc-Retrieval-Data
 ---
+# Vietnamese Legal Document Retrieval System
+[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/YuITC/Vietnamese-Legal-Doc-Retrieval)
+[![Model](https://img.shields.io/badge/%F0%9F%A4%97%20Model-HF%20Hub-yellow)](https://huggingface.co/YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs)
+[![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-HF%20Hub-green)](https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data)
+A retrieval system specifically designed for Vietnamese legal documents using fine-tuned SBERT (Sentence-BERT) technology.
+## 📌 Overview
+This project implements a retrieval system for retrieving relevant Vietnamese legal documents based on user queries. The system uses a fine-tuned multilingual BERT model to encode legal queries and documents into a semantic vector space, allowing for retrieval based on meaning rather than just keyword matching.
+![Gradio Interface Demo](assets/gradio_demo.png)
+## 🔑 Key features
+- Step-by-step notebook for understanding.
+- Fine-tuned SBERT model specialized for Vietnamese legal document retrieval.
+- FAISS indexing for efficient vector search.
+- Evaluation based on MTEB.
+- Interactive web interface for quick legal document search.
+- High-performance retrieval of relevant legal passages.
+## 🛠️ Installation & Usage
+```bash
+# Install dependencies
+conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
+conda install faiss-gpu=1.9.0 -c pytorch -c nvidia
+pip install -r requirements.txt
+# Running the Application
+python main.py
+```
+The application will start a local web server with the Gradio interface, allowing you to enter legal queries and retrieve relevant documents.
+## 📂 Project Structure
+```
+Vietnamese-Legal-Doc-Retrieval/
+├── assets/                   # Visual assets for documentation
+│   └── gradio_demo.png       # Screenshot of the Gradio demo interface
+├── cache/                    # Cached model files
+│   └── VN-legalDocs-SBERT/   # Cached BERT model files
+├── data/                     # Dataset files
+│   ├── original/             # Original downloaded dataset
+│   │   ├── corpus.csv        # Raw corpus documents
+│   │   ├── train_split.csv   # Training data
+│   │   ├── val_split.csv     # Validation data
+│   │   └── ...
+│   ├── processed/            # Processed dataset files
+│   │   ├── corpus_data.parquet  # Processed corpus for embedding
+│   │   ├── train_data.parquet  # Processed training data
+│   │   └── test_data.parquet   # Processed test data
+│   └── retrieval/            # Files for retrieval system
+│       └── legal_faiss.index # FAISS index for fast vector search
+├── models/                   # Trained model files
+│   └── VN-legalDocs-SBERT/   # Fine-tuned BERT model for legal documents
+│       ├── model.safetensors # Model weights
+│       ├── config.json       # Model configuration
+│       └── checkpoint-*/     # Training checkpoints
+├── results/                  # Evaluation results
+├── Dockerfile                # Docker configuration for deployment
+├── main.py                   # Main application entry point
+├── requirements.txt          # Python dependencies
+├── settings.py               # Configuration settings
+└── step_*_*.ipynb            # Jupyter notebooks for each step of the process
+```
+## 💾 Dataset
+The system is trained on a Vietnamese legal document corpus containing:
+- Legal texts from various domains
+- Query-document pairs for training and evaluation
+- Processed and structured for semantic search training
+The dataset is available on [Hugging Face](https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data) (modified by me, the base dataset is cited below).
+## 📊 Model Training Process
+The project follows a systematic approach to build the retrieval system:
+1. **Data Preparation** (`step_01_Prepare_Data.ipynb`):
+   - Processes raw legal documents
+   - Creates query-document pairs for training
+   - Formats data for the embedding model
+2. **SBERT Fine-tuning** (`step_02_Finetune_SBERT.ipynb`):
+   - Fine-tunes a multilingual BERT model with legal document pairs
+   - Uses `CachedMultipleNegativesRankingLoss` for training
+   - Optimizes for semantic similarity in legal context
+3. **Evaluation** (`step_03_Eval_with_MTEB.ipynb`):
+   - Evaluates model performance using retrieval metrics
+   - Compares with baseline models
+4. **Retrieval System Setup** (`step_04_Retrieval.ipynb`):
+   - Creates FAISS index from document embeddings
+   - Implements efficient search functionality
+   - Prepares for deployment
+## 🔍 Usage Examples
+The system accepts natural language queries in Vietnamese related to legal topics. Example queries:
+- "Tội xúc phạm danh dự?" (Crimes against honor?)
+- "Quyền lợi của người lao động?" (Rights of workers?)
+- "Thủ tục đăng ký kết hôn?" (Marriage registration procedures?)
+## 🧪 Performance
+The fine-tuned model was evaluated using the [MTEB benchmark](https://github.com/embeddings-benchmark/mteb) on the BKAILegalDocRetrieval dataset. Key results:
+| Metric       | @k  | Pre-trained model score (%) | Fine-tuned model score (%) |
+|--------------|-----|-----------------------------|-----------------------------|
+| **NDCG**     | 1   | 0.007                       | 42.425                      |
+|              | 5   | 0.011                       | 57.387                      |
+|              | 10  | 0.023                       | 60.389                      |
+|              | 20  | 0.049                       | 62.160                      |
+|              | 100 | 0.147                       | 63.894                      |
+| **MAP**      | 1   | 0.007                       | 40.328                      |
+|              | 5   | 0.009                       | 52.297                      |
+|              | 10  | 0.014                       | 53.608                      |
+|              | 20  | 0.021                       | 54.136                      |
+|              | 100 | 0.033                       | 54.418                      |
+| **Recall**   | 1   | 0.007                       | 40.328                      |
+|              | 5   | 0.017                       | 70.466                      |
+|              | 10  | 0.054                       | 79.407                      |
+|              | 20  | 0.157                       | 86.112                      |
+|              | 100 | 0.713                       | 94.805                      |
+| **Precision**| 1   | 0.007                       | 42.425                      |
+|              | 5   | 0.003                       | 15.119                      |
+|              | 10  | 0.005                       | 8.587                       |
+|              | 20  | 0.008                       | 4.687                       |
+|              | 100 | 0.007                       | 1.045                       |
+| **MRR**      | 1   | 0.007                       | 42.418                      |
+|              | 5   | 0.010                       | 54.337                      |
+|              | 10  | 0.014                       | 55.510                      |
+|              | 20  | 0.021                       | 55.956                      |
+|              | 100 | 0.033                       | 56.172                      |
+- **NDCG@k (Normalized Discounted Cumulative Gain)**
+  Measures ranking quality by evaluating the relevance of results with logarithmic position-based discounting.
+- **MAP@k (Mean Average Precision)**
+  Computes the average precision for each query up to rank k—precision at each relevant retrieved document—then averages across all queries.
+- **Recall@k**
+  The proportion of all relevant documents that are retrieved in the top k results.
+- **Precision@k**
+  The proportion of the top k retrieved documents that are relevant.
+- **MRR@k (Mean Reciprocal Rank)**
+  The average of the reciprocal of the rank position of the first relevant document across all queries.
+The model significantly outperforms baseline retrieval methods, with the main evaluation score (NDCG@10) reaching 60.4%, demonstrating strong performance on Vietnamese legal document retrieval tasks.
+## 🐳 Docker Deployment
+The project includes a Docker configuration for easy deployment. The Docker image is built on `continuumio/miniconda3` and includes GPU support via PyTorch CUDA and FAISS-GPU.
+```bash
+# Build the Docker image
+docker build -t vietnamese-legal-retrieval .
+# Run the container
+docker run -p 7860:7860 vietnamese-legal-retrieval
+```
+The container:
+- Uses Python 3.10 with CUDA 12.1 support
+- Installs required dependencies from requirements.txt
+- Exposes port 7860 for the Gradio web interface
+- Sets proper environment variables for security and performance
+- Runs as a non-root user for enhanced security
+You can access the web interface by navigating to `http://localhost:7860` after starting the container.
+## 📜 License
+This project is licensed under the MIT License – feel free to modify and distribute it as needed.
+## 🤝 Acknowledgments
+Thanks for:
+- [BKAI Legal Retrieval Dataset](https://huggingface.co/datasets/tmnam20/BKAI-Legal-Retrieval) for the original data
+- [Sentence Transformers](https://www.sbert.net/) library for the embedding model architecture
+- [Hugging Face](https://huggingface.co/) for hosting the model and dataset
+If you find this project useful, consider ⭐️ starring the repository or contributing to further improvements!
+## 📬 Contact
+For any questions or collaboration opportunities, feel free to reach out:
+📧 Email: tainguyenphu2502@gmail.com

assets/gradio_demo.png ADDED Viewed

Git LFS Details

SHA256: 3bda9a0ed45befedf1a3d9c447831a5984ffc4e786130d78ee95e7efaad6bb58
Pointer size: 131 Bytes
Size of remote file: 244 kB

main.py CHANGED Viewed

@@ -1,10 +1,8 @@
-import os
 import numpy as np
 import pandas as pd
 import gradio as gr
 import faiss
-from datasets import load_dataset
 from huggingface_hub import hf_hub_download
 from sentence_transformers import SentenceTransformer

 import numpy as np
 import pandas as pd
 import gradio as gr
 import faiss
 from huggingface_hub import hf_hub_download
 from sentence_transformers import SentenceTransformer

requirements.txt CHANGED Viewed

@@ -1,6 +1,7 @@
 # !conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
 # !conda install faiss-gpu=1.9.0 -c pytorch -c nvidia
 transformers
 sentence-transformers
 accelerate
@@ -8,5 +9,4 @@ datasets
 mteb
 tqdm
 pandas
-gradio
-huggingface-hub

 # !conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
 # !conda install faiss-gpu=1.9.0 -c pytorch -c nvidia
+huggingface-hub
 transformers
 sentence-transformers
 accelerate
 mteb
 tqdm
 pandas
+gradio

settings.py CHANGED Viewed

@@ -4,14 +4,14 @@ import random
 import numpy as np
-# Data settings
 os.makedirs('data', exist_ok=True)
 os.makedirs('data/original', exist_ok=True)
 os.makedirs('data/processed', exist_ok=True)
 os.makedirs('data/retrieval', exist_ok=True)
-# Model settings
 MODEL_ID   = 'google-bert/bert-base-multilingual-cased'
 MODEL_NAME = 'VN-legalDocs-SBERT'
@@ -22,7 +22,7 @@ os.makedirs(CACHE_DIR, exist_ok=True)
 os.makedirs(OUTPUT_DIR, exist_ok=True)
-# Reproducibility
 SEED = 42
 random.seed(SEED)
 np.random.seed(SEED)
@@ -35,7 +35,7 @@ torch.backends.cudnn.deterministic = False
 torch.backends.cudnn.benchmark     = True
-# Hyperparameters
 MAX_SEQ_LEN = 512
 EPOCHS      = 5
 LR          = 3e-5

 import numpy as np
+# ===== Data settings =====
 os.makedirs('data', exist_ok=True)
 os.makedirs('data/original', exist_ok=True)
 os.makedirs('data/processed', exist_ok=True)
 os.makedirs('data/retrieval', exist_ok=True)
+# ===== Model settings =====
 MODEL_ID   = 'google-bert/bert-base-multilingual-cased'
 MODEL_NAME = 'VN-legalDocs-SBERT'
 os.makedirs(OUTPUT_DIR, exist_ok=True)
+# ===== Reproducibility =====
 SEED = 42
 random.seed(SEED)
 np.random.seed(SEED)
 torch.backends.cudnn.benchmark     = True
+# ===== Hyperparameters =====
 MAX_SEQ_LEN = 512
 EPOCHS      = 5
 LR          = 3e-5

step_01_Prepare_Data.ipynb CHANGED Viewed

@@ -31,6 +31,15 @@
     "import pandas as pd"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 5,
@@ -54,7 +63,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
    "id": "4fe0c4f8",
    "metadata": {},
    "outputs": [
@@ -68,6 +77,7 @@
     }
    ],
    "source": [
     "corpus_data = pd.read_csv('data/original/corpus.csv')\n",
     "train_split = pd.read_csv('data/original/train_split.csv')\n",
     "test_split  = pd.read_csv('data/original/val_split.csv')\n",
@@ -155,9 +165,47 @@
     "corpus_data.head()"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
    "id": "3d32d13a",
    "metadata": {},
    "outputs": [
@@ -263,20 +311,6 @@
     }
    ],
    "source": [
-    "# 'cid' column: '[1 2 3]'\n",
-    "train_split['cid'] = train_split['cid'].apply(lambda x: [int(i) for i in x[1:-1].split()])\n",
-    "test_split['cid']  = test_split['cid'].apply(lambda x: [int(i) for i in x[1:-1].split()])\n",
-    "\n",
-    "\n",
-    "# Mapping from corpus \n",
-    "mapping = dict(zip(corpus_data['cid'], corpus_data['text']))\n",
-    "\n",
-    "def get_context_list(cid_list):\n",
-    "    return [mapping[cid] for cid in cid_list if cid in mapping]\n",
-    "\n",
-    "train_split['context_list'] = train_split['cid'].apply(get_context_list)\n",
-    "test_split['context_list']  = test_split['cid'].apply(get_context_list)\n",
-    "\n",
     "train_split.head()"
    ]
   },
@@ -342,6 +376,15 @@
     "]"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 11,

     "import pandas as pd"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "657d341a",
+   "metadata": {},
+   "source": [
+    "## **1. Data Overview**\n",
+    "---"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 5,
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "4fe0c4f8",
    "metadata": {},
    "outputs": [
     }
    ],
    "source": [
+    "# Overview the dataset\n",
     "corpus_data = pd.read_csv('data/original/corpus.csv')\n",
     "train_split = pd.read_csv('data/original/train_split.csv')\n",
     "test_split  = pd.read_csv('data/original/val_split.csv')\n",
     "corpus_data.head()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "e435ce3e",
+   "metadata": {},
+   "source": [
+    "## **2. Data Preprocessing**\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8a176681",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 'cid' column: '[1 2 3]'\n",
+    "train_split['cid'] = train_split['cid'].apply(lambda x: [int(i) for i in x[1:-1].split()])\n",
+    "test_split['cid']  = test_split['cid'].apply(lambda x: [int(i) for i in x[1:-1].split()])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "65916537",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Mapping from corpus \n",
+    "mapping = dict(zip(corpus_data['cid'], corpus_data['text']))\n",
+    "\n",
+    "def get_context_list(cid_list):\n",
+    "    return [mapping[cid] for cid in cid_list if cid in mapping]\n",
+    "\n",
+    "train_split['context_list'] = train_split['cid'].apply(get_context_list)\n",
+    "test_split['context_list']  = test_split['cid'].apply(get_context_list)"
+   ]
+  },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "3d32d13a",
    "metadata": {},
    "outputs": [
     }
    ],
    "source": [
     "train_split.head()"
    ]
   },
     "]"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "656c37f3",
+   "metadata": {},
+   "source": [
+    "## **3. Save Data**\n",
+    "---"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 11,

step_02_Finetune_SBERT.ipynb CHANGED Viewed

@@ -23,15 +23,7 @@
    "execution_count": null,
    "id": "0086aabe",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Using device: cuda\n"
-     ]
-    }
-   ],
    "source": [
     "import os\n",
     "import pandas as pd\n",
@@ -50,7 +42,6 @@
     "from sentence_transformers.training_args import BatchSamplers\n",
     "\n",
     "from settings import MODEL_ID, MODEL_NAME, CACHE_DIR, OUTPUT_DIR, MAX_SEQ_LEN, EPOCHS, LR, BATCH_SIZE, DEVICE\n",
-    "\n",
     "os.environ['WANDB_DISABLED'] = 'true'"
    ]
   },
@@ -199,7 +190,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
    "id": "2c751cf4",
    "metadata": {},
    "outputs": [

    "execution_count": null,
    "id": "0086aabe",
    "metadata": {},
+   "outputs": [],
    "source": [
     "import os\n",
     "import pandas as pd\n",
     "from sentence_transformers.training_args import BatchSamplers\n",
     "\n",
     "from settings import MODEL_ID, MODEL_NAME, CACHE_DIR, OUTPUT_DIR, MAX_SEQ_LEN, EPOCHS, LR, BATCH_SIZE, DEVICE\n",
     "os.environ['WANDB_DISABLED'] = 'true'"
    ]
   },
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "2c751cf4",
    "metadata": {},
    "outputs": [

step_03_Eval_with_MTEB.ipynb CHANGED Viewed

@@ -20,64 +20,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
-   "id": "b5fd917b",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "📦 PyTorch version: 2.5.1\n",
-      "🚀 CUDA available : True\n",
-      "🧠 GPU Name       : NVIDIA RTX A4000\n",
-      "📦 FAISS version : 1.9.0\n",
-      "🚀 FAISS is using GPU ✅\n"
-     ]
-    }
-   ],
-   "source": [
-    "import torch\n",
-    "\n",
-    "print(\"📦 PyTorch version:\", torch.__version__)\n",
-    "print(\"🚀 CUDA available :\", torch.cuda.is_available())\n",
-    "if torch.cuda.is_available():\n",
-    "    print(\"🧠 GPU Name       :\", torch.cuda.get_device_name(0))\n",
-    "    \n",
-    "import faiss\n",
-    "\n",
-    "print(\"📦 FAISS version :\", faiss.__version__)\n",
-    "\n",
-    "# Kiểm tra module FAISS-GPU có hoạt động không\n",
-    "try:\n",
-    "    res = faiss.StandardGpuResources()  # Nếu không lỗi là có GPU\n",
-    "    print(\"🚀 FAISS is using GPU ✅\")\n",
-    "except Exception as e:\n",
-    "    print(\"❌ FAISS is NOT using GPU:\", str(e))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
    "id": "030016c2",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "C:\\Users\\Administrator\\AppData\\Local\\Temp\\2\\ipykernel_648\\3951191562.py:5: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n",
-      "  from tqdm.autonotebook import tqdm\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Using device: cuda\n"
-     ]
-    }
-   ],
    "source": [
     "import os\n",
     "import json\n",
@@ -86,12 +32,12 @@
     "from tqdm.autonotebook import tqdm\n",
     "\n",
     "from sentence_transformers import SentenceTransformer\n",
     "from mteb import MTEB\n",
     "from mteb.abstasks.TaskMetadata import TaskMetadata\n",
     "from mteb.abstasks.AbsTaskRetrieval import AbsTaskRetrieval\n",
     "\n",
     "from settings import MODEL_NAME, OUTPUT_DIR, DEVICE, BATCH_SIZE\n",
-    "\n",
     "os.environ['WANDB_DISABLED'] = 'true'"
    ]
   },

   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "030016c2",
    "metadata": {},
+   "outputs": [],
    "source": [
     "import os\n",
     "import json\n",
     "from tqdm.autonotebook import tqdm\n",
     "\n",
     "from sentence_transformers import SentenceTransformer\n",
+    "\n",
     "from mteb import MTEB\n",
     "from mteb.abstasks.TaskMetadata import TaskMetadata\n",
     "from mteb.abstasks.AbsTaskRetrieval import AbsTaskRetrieval\n",
     "\n",
     "from settings import MODEL_NAME, OUTPUT_DIR, DEVICE, BATCH_SIZE\n",
     "os.environ['WANDB_DISABLED'] = 'true'"
    ]
   },

step_04_Retrieval.ipynb CHANGED Viewed

@@ -20,7 +20,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
    "id": "01589fc8",
    "metadata": {},
    "outputs": [],
@@ -33,29 +33,11 @@
     "import faiss\n",
     "from sentence_transformers import SentenceTransformer, CrossEncoder\n",
     "\n",
-    "from settings import OUTPUT_DIR, DEVICE\n",
-    "\n",
-    "os.environ['WANDB_DISABLED'] = 'true'\n",
-    "\n",
     "from transformers import logging\n",
-    "logging.set_verbosity_error()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "id": "057e852f",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# data = {\n",
-    "#     'corpus': pd.read_parquet('data/processed/corpus_data.parquet'),\n",
-    "#     'train' : pd.read_parquet('data/processed/train_data.parquet'),\n",
-    "#     'test'  : pd.read_parquet('data/processed/test_data.parquet')\n",
-    "# }\n",
-    "# for split in ['train', 'test']:\n",
-    "#     data[split]['cid']          = data[split]['cid'].apply(lambda x: x.tolist())\n",
-    "#     data[split]['context_list'] = data[split]['context_list'].apply(lambda x: x.tolist())"
    ]
   },
   {
@@ -85,51 +67,51 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
    "id": "62cc0ead",
    "metadata": {},
    "outputs": [],
    "source": [
-    "passages          = pd.read_parquet('data/processed/corpus_data.parquet')['text'].tolist()\n",
-    "# corpus_embeddings = fine_tuned_model.encode(\n",
-    "#     passages, \n",
-    "#     batch_size=128,\n",
-    "#     convert_to_numpy=True, \n",
-    "#     normalize_embeddings=True,\n",
-    "#     show_progress_bar=True, \n",
-    "#     device=DEVICE,\n",
-    "# ).astype(np.float32)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
    "id": "465e8d2a",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# d         = corpus_embeddings.shape[1]  # 768\n",
-    "# cpu_index = faiss.IndexFlatIP(d)\n",
     "\n",
-    "# res       = faiss.StandardGpuResources()\n",
-    "# gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index)\n",
-    "# gpu_index.add(corpus_embeddings)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
    "id": "af365371",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# final_cpu_index = faiss.index_gpu_to_cpu(gpu_index)\n",
-    "# faiss.write_index(final_cpu_index, 'data/retrieval/legal_faiss.index')"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
    "id": "9251d0db",
    "metadata": {},
    "outputs": [],
@@ -168,194 +150,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
    "id": "ece21ef6",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[Rank 1] index=76423, score=0.6417\n",
-      "Tội làm nhục người khác\n",
-      "1. Người nào xúc phạm nghiêm trọng nhân phẩm, danh dự của người khác, thì bị phạt cảnh cáo, phạt tiền từ 10.000.000 đồng đến 30.000.000 đồng hoặc phạt cải tạo không giam giữ đến 03 năm.\n",
-      "...\n",
-      "--------------------------------------------------------------------------------\n",
-      "[Rank 2] index=99131, score=0.6155\n",
-      "“Người nào có hành vi xâm phạm danh dự, nhân phẩm của người khác mà gây thiệt hại thì phải bồi thường.”\n",
-      "--------------------------------------------------------------------------------\n",
-      "[Rank 3] index=228550, score=0.5932\n",
-      "i) Điều 353, các khoản 2, 3 và 4 (tội tham ô tài sản); Điều 354, các khoản 2, 3 và 4 (tội nhận hối lộ); Điều 355, các khoản 2, 3 và 4 (tội lạm dụng chức vụ, quyền hạn chiếm đoạt tài sản); Điều 356, các khoản 2 và 3 (tội lợi dụng chức vụ, quyền hạn trong khi thi hành công vụ); Điều 357, các khoản 2 và 3 (tội lạm quyền trong khi thi hành công vụ); ��iều 358, các khoản 2, 3 và 4 (tội lợi dụng chức vụ, quyền hạn gây ảnh hưởng đối với người khác để trục lợi); Điều 359, các khoản 2, 3 và 4 (tội giả mạo trong công tác); Điều 364, các khoản 2, 3 và 4 (tội đưa hối lộ); Điều 365, các khoản 2, 3 và 4 (tội làm môi giới hối lộ);\n",
-      "k) Điều 373, các khoản 3 và 4 (tội dùng nhục hình); Điều 374, các khoản 3 và 4 (tội bức cung); Điều 386, khoản 2 (tội trốn khỏi nơi giam, giữ hoặc trốn khi đang bị áp giải, dẫn giải, đang bị xét xử);\n",
-      "l) Các điều từ Điều 421 đến Điều 425 về các tội phá hoại hòa bình, chống loài người và tội phạm chiến tranh.\n",
-      "2. Phạm tội trong trường hợp lợi dụng chức vụ, quyền hạn cản trở việc phát hiện tội phạm hoặc có những hành vi khác bao che người phạm tội, thì bị phạt tù từ 02 năm đến 07 năm.\n",
-      "Điều 390. Tội không tố giác tội phạm\n",
-      "1. Người nào biết rõ một trong các tội phạm được quy định tại Điều 389 của Bộ luật này đang được chuẩn bị, đang hoặc đã được thực hiện mà không tố giác, nếu không thuộc trường hợp quy định tại khoản 2 Điều 19 của Bộ luật này, thì bị phạt cảnh cáo, phạt cải tạo không giam giữ đến 03 năm hoặc phạt tù từ 06 tháng đến 03 năm.\n",
-      "2. Người không tố giác nếu đã có hành động can ngăn người phạm tội hoặc hạn chế tác hại của tội phạm, thì có thể được miễn trách nhiệm hình sự hoặc miễn hình phạt.\n",
-      "Điều 391. Tội gây rối trật tự phiên tòa\n",
-      "1. Người nào tại phiên tòa mà thóa mạ, xúc phạm nghiêm trọng danh dự, nhân phẩm thành viên Hội đồng xét xử, những người khác có mặt tại phiên tòa hoặc có hành vi đập phá tài sản thì bị phạt tiền từ 10.000.000 đồng đến 100.000.000 đồng, phạt cải tạo không giam giữ đến 01 năm hoặc phạt tù từ 03 tháng đến 01 năm.\n",
-      "2. Phạm tội thuộc một trong các trường hợp sau đây, thì bị phạt tù từ 01 năm đến 03 năm:\n",
-      "a) Gây náo loạn phiên tòa dẫn đến phải dừng phiên tòa;\n",
-      "b) Hành hung thành viên Hội đồng xét xử.\n",
-      "--------------------------------------------------------------------------------\n",
-      "[Rank 4] index=228404, score=0.5660\n",
-      "Điều 155. Tội làm nhục người khác\n",
-      "1. Người nào xúc phạm nghiêm trọng nhân phẩm, danh dự của người khác, thì bị phạt cảnh cáo, phạt tiền từ 10.000.000 đồng đến 30.000.000 đồng hoặc phạt cải tạo không giam giữ đến 03 năm.\n",
-      "2. Phạm tội thuộc một trong các trường hợp sau đây, thì bị phạt tù từ 03 tháng đến 02 năm:\n",
-      "a) Phạm tội 02 lần trở lên;\n",
-      "b) Đối với 02 người trở lên;\n",
-      "c) Lợi dụng chức vụ, quyền hạn;\n",
-      "d) Đối với người đang thi hành công vụ;\n",
-      "đ) Đối với người dạy dỗ, nuôi dưỡng, chăm sóc, chữa bệnh cho mình;\n",
-      "e) Sử dụng mạng máy tính hoặc mạng viễn thông, phương tiện điện tử để phạm tội;\n",
-      "g) Gây rối loạn tâm thần và hành vi của nạn nhân từ 11% đến 45%.\n",
-      "3. Phạm tội thuộc một trong các trường hợp sau đây, thì bị phạt tù từ 02 năm đến 05 năm:\n",
-      "a) Gây rối loạn tâm thần và hành vi của nạn nhân 46% trở lên;\n",
-      "b) Làm nạn nhân tự sát.\n",
-      "4. Người phạm tội còn có thể bị cấm đảm nhiệm chức vụ, cấm hành nghề hoặc làm công việc nhất định từ 01 năm đến 05 năm.\n",
-      "--------------------------------------------------------------------------------\n",
-      "[Rank 5] index=143035, score=0.5470\n",
-      "Khoản 4. Người có hành vi xâm phạm thân thể, sức khỏe, tính mạng hoặc xúc phạm danh dự, nhân phẩm của người hành nghề và người khác làm việc tại cơ sở khám bệnh, chữa bệnh thì tùy theo tính chất, mức độ vi phạm mà bị xử lý vi phạm hành chính hoặc bị truy cứu trách nhiệm hình sự theo quy định của pháp luật.\n",
-      "--------------------------------------------------------------------------------\n",
-      "[Rank 6] index=57787, score=0.5443\n",
-      "\"Điều 7. Vi phạm quy định về trật tự công cộng\n",
-      "..\n",
-      "3. Phạt tiền từ 2.000.000 đồng đến 3.000.000 đồng đối với một trong những hành vi sau đây:\n",
-      "a) Có hành vi khiêu khích, trêu gh��o, xúc phạm, lăng mạ, bôi nhọ danh dự, nhân phẩm của người khác, trừ trường hợp quy định tại điểm b khoản 2 Điều 21 và Điều 54 Nghị định này;\n",
-      "...\n",
-      "14. Biện pháp khắc phục hậu quả:\n",
-      "a) Buộc khôi phục lại tình trạng ban đầu đối với hành vi vi phạm quy định tại điểm c khoản 1, điểm l khoản 2 và điểm e khoản 4 Điều này;\n",
-      "b) Buộc cải chính thông tin sai sự thật hoặc gây nhầm lẫn đối với hành vi vi phạm quy định tại điểm a khoản 3 và điểm i khoản 4 Điều này;\n",
-      "c) Buộc xin lỗi công khai đối với hành vi vi phạm quy định tại điểm a khoản 3, các điểm d và đ khoản 5 Điều này trừ trường hợp nạn nhân có đơn không yêu cầu;\n",
-      "d) Buộc thực hiện biện pháp khắc phục tình trạng ô nhiễm môi trường đối với hành vi vi phạm quy định tại điểm h khoản 5 Điều này;\n",
-      "đ) Buộc chi trả toàn bộ chi phí khám bệnh, chữa bệnh đối với hành vi vi phạm quy định tại điểm d khoản 1, điểm c khoản 2, điểm b khoản 3 và điểm a khoản 5 Điều này.\"\n",
-      "--------------------------------------------------------------------------------\n",
-      "[Rank 7] index=57120, score=0.5337\n",
-      "Vi phạm quy định về trật tự công cộng\n",
-      "...\n",
-      "2. Phạt tiền từ 1.000.000 đồng đến 2.000.000 đồng đối với một trong những hành vi sau đây:\n",
-      "...\n",
-      "b) Tổ chức, tham gia tụ tập nhiều người ở nơi công cộng gây mất trật tự công cộng;\n",
-      "...\n",
-      "3. Phạt tiền từ 2.000.000 đồng đến 3.000.000 đồng đối với một trong những hành vi sau đây:\n",
-      "a) Có hành vi khiêu khích, trêu ghẹo, xúc phạm, lăng mạ, bôi nhọ danh dự, nhân phẩm của người khác, trừ trường hợp quy định tại điểm b khoản 2 Điều 21 và Điều 54 Nghị định này;\n",
-      "b) Tổ chức, thuê, xúi giục, lôi kéo, dụ dỗ, kích động người khác cố ý gây thương tích hoặc gây tổn hại cho sức khỏe người khác hoặc xâm phạm danh dự, nhân phẩm của người khác nhưng không bị truy cứu trách nhiệm hình sự;\n",
-      "...\n",
-      "4. Phạt tiền từ 3.000.000 đồng đến 5.000.000 đồng đối với một trong những hành vi sau đây:\n",
-      "a) Tổ chức thuê, xúi giục, lôi kéo, dụ dỗ hoặc kích động người khác gây rối, làm mất trật tự công cộng;\n",
-      "b) Mang theo trong người hoặc tàng trữ, cất giấu các loại vũ khí thô sơ, công cụ hỗ trợ hoặc các loại công cụ, phương tiện khác có khả năng sát thương; đồ vật, phương tiện giao thông nhằm mục đích gây rối trật tự công cộng, cố ý gây thương tích cho người khác;\n",
-      "...\n",
-      "5. Phạt tiền từ 5.000.000 đồng đến 8.000.000 đồng đối với một trong những hành vi sau đây:\n",
-      "a) Cố ý gây thương tích hoặc gây tổn hại cho sức khỏe của người khác nhưng không bị truy cứu trách nhiệm hình sự;\n",
-      "b) Gây rối trật tự công cộng mà có mang theo các loại vũ khí thô sơ, công cụ hỗ trợ hoặc công cụ, đồ vật, phương tiện khác có khả năng sát thương;\n",
-      "...\n",
-      "13. Hình thức xử phạt bổ sung:\n",
-      "a) Tịch thu tang vật, phương tiện vi phạm hành chính đối với hành vi vi phạm quy định tại các điểm a, d, đ và g khoản 2; điểm đ khoản 3; các điểm b, e và i khoản 4; các điểm a, b và c khoản 5; các khoản 6 và 10 Điều này;\n",
-      "...\n",
-      "14. Biện pháp khắc phục hậu quả:\n",
-      "...\n",
-      "b) Buộc cải chính thông tin sai sự thật hoặc gây nhầm lẫn đối với hành vi vi phạm quy định tại điểm a khoản 3 và điểm i khoản 4 Điều này;\n",
-      "c) Buộc xin lỗi công khai đối với hành vi vi phạm quy định tại điểm a khoản 3, các điểm d và đ khoản 5 Điều này trừ trường hợp nạn nhân có đơn không yêu cầu;\n",
-      "...\n",
-      "đ) Buộc chi trả toàn bộ chi phí khám bệnh, chữa bệnh đối với hành vi vi phạm quy định tại điểm d khoản 1, điểm c khoản 2, điểm b khoản 3 và điểm a khoản 5 Điều này.\n",
-      "--------------------------------------------------------------------------------\n",
-      "[Rank 8] index=56183, score=0.5270\n",
-      "\"Điều 155. Tội làm nhục người khác\n",
-      "1. Người nào xúc phạm nghiêm trọng nhân phẩm, danh dự của người khác, thì bị phạt cảnh cáo, phạt tiền từ 10.000.000 đồng đến 30.000.000 đồng hoặc phạt cải tạo không giam giữ đến 03 năm.\n",
-      "2. Phạm tội thuộc một trong các trường hợp sau đây, thì bị phạt tù từ 03 tháng đến 02 năm:\n",
-      "a) Phạm tội 02 lần trở lên;\n",
-      "b) Đối với 02 người trở lên;\n",
-      "c) Lợi dụng chức vụ, quyền hạn;\n",
-      "d) Đối với người đang thi hành công vụ;\n",
-      "đ) Đối với người dạy dỗ, nuôi dưỡng, chăm sóc, chữa bệnh cho mình;\n",
-      "e) Sử dụng mạng máy tính hoặc mạng viễn thông, phương tiện điện tử để phạm tội;\n",
-      "g) Gây rối loạn tâm thần và hành vi của nạn nhân mà tỷ lệ tổn thương cơ thể từ 31% đến 60%”.\n",
-      "3. Phạm tội thuộc một trong các trường hợp sau đây, thì bị phạt tù từ 02 năm đến 05 năm:\n",
-      "a) Gây rối loạn tâm thần và hành vi của nạn nhân mà tỷ lệ tổn thương cơ thể 61% trở lên”;\n",
-      "b) Làm nạn nhân tự sát.\n",
-      "4. Người phạm tội còn có thể bị cấm đảm nhiệm chức vụ, cấm hành nghề hoặc làm công việc nhất định từ 01 năm đến 05 năm.\n",
-      "Điều 156. Tội vu khống\n",
-      "1. Người nào thực hiện một trong các hành vi sau đây, thì bị phạt tiền từ 10.000.000 đồng đến 50.000.000 đồng, phạt cải tạo không giam giữ đến 02 năm hoặc phạt tù từ 03 tháng đến 01 năm:\n",
-      "a) Bịa đặt hoặc loan truyền những điều biết rõ là sai sự thật nhằm xúc phạm nghiêm trọng nhân phẩm, danh dự hoặc gây thiệt hại đến quyền, lợi ích hợp pháp của người khác;\n",
-      "b) Bịa đặt người khác phạm tội và tố cáo họ trước cơ quan có thẩm quyền.\n",
-      "2. Phạm tội thuộc một trong các trường hợp sau đây, thì bị phạt tù từ 01 năm đến 03 năm:\n",
-      "a) Có tổ chức;\n",
-      "b) Lợi dụng chức vụ, quyền hạn;\n",
-      "c) Đối với 02 người trở lên;\n",
-      "d) Đối với ông, bà, cha, mẹ, người dạy dỗ, nuôi dưỡng, chăm sóc, giáo dục, chữa bệnh cho mình;\n",
-      "đ) Đối với người đang thi hành công vụ;\n",
-      "e) Sử dụng mạng máy tính hoặc mạng viễn thông, phương tiện điện tử để phạm tội;\n",
-      "g) Gây rối loạn tâm thần và hành vi của nạn nhân mà tỷ lệ tổn thương cơ thể từ 31% đến 60%;\n",
-      "h) Vu khống người khác phạm tội rất nghiêm trọng hoặc đặc biệt nghiêm trọng.\n",
-      "3. Phạm tội thuộc một trong các trường hợp sau đây, thì bị phạt tù từ 03 năm đến 07 năm:\n",
-      "a) Vì động cơ đê hèn;\n",
-      "b) Gây rối loạn tâm thần và hành vi của nạn nhân mà tỷ lệ tổn thương cơ thể 61% trở lên;\n",
-      "c) Làm nạn nhân tự sát.\n",
-      "4. Người phạm tội còn có thể bị phạt tiền từ 10.000.000 đồng đến 50.000.000 đồng, cấm đảm nhiệm chức vụ, cấm hành nghề hoặc làm công việc nhất định từ 01 năm đến 05 năm.\"\n",
-      "--------------------------------------------------------------------------------\n",
-      "[Rank 9] index=80022, score=0.5218\n",
-      "\"Điều 20.\n",
-      "1. Mọi người có quyền bất khả xâm phạm về thân thể, được pháp luật bảo hộ về sức khoẻ, danh dự và nhân phẩm; không bị tra tấn, bạo lực, truy bức, nhục hình hay bất kỳ hình thức đối xử nào khác xâm phạm thân thể, sức khỏe, xúc phạm danh dự, nhân phẩm.\"\n",
-      "--------------------------------------------------------------------------------\n",
-      "[Rank 10] index=52682, score=0.5203\n",
-      "\"Điều 589. Thiệt hại do tài sản bị xâm phạm\n",
-      "Thiệt hại do tài sản bị xâm phạm bao gồm:\n",
-      "1. Tài sản bị mất, bị hủy hoại hoặc bị hư hỏng.\n",
-      "2. Lợi ích gắn liền với việc sử dụng, khai thác tài sản bị mất, bị giảm sút.\n",
-      "3. Chi phí hợp lý để ngăn chặn, hạn chế và khắc phục thiệt hại.\n",
-      "4. Thiệt hại khác do luật quy định.\"\n",
-      "--------------------------------------------------------------------------------\n"
-     ]
-    }
-   ],
    "source": [
     "query = 'Tội xúc phạm danh dự'\n",
     "hits  = retrieval(fine_tuned_model, query, legal_index, top_k=10)\n",
     "\n",
     "for h in hits:\n",
-    "    print(f\"[Rank {hits.index(h)+1}] index={h['index']}, score={h['score']:.4f}\")\n",
-    "    print(f\"{h['text']}\\n{'-'*80}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1bedd1a7",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# def search(model, query, index, k=10):\n",
-    "#     query_embedding = model.encode(\n",
-    "#         query, \n",
-    "#         convert_to_numpy=True, \n",
-    "#         normalize_embeddings=True,\n",
-    "#     ).astype(np.float32).reshape(1, -1)\n",
-    "\n",
-    "#     scores, indices = index.search(query_embedding, k*3)\n",
-    "#     hits = [{'score': scores[0][i], 'index': indices[0][i]} for i in range(len(scores[0]))]\n",
-    "#     return hits"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4ef857db",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# hits = search(\n",
-    "#     model=fine_tuned_model, \n",
-    "#     query='Hợp đồng lao động là gì?', \n",
-    "#     index=legal_index, \n",
-    "#     k=10\n",
-    "# )\n",
-    "\n",
-    "# for rank, hit in enumerate(hits):\n",
-    "#     print(f\"[Rank: {rank + 1}]\")\n",
-    "#     print(f\"(Index: {hit['index']}Score: {hit['score']:.4f})\\n\")\n",
-    "#     print(passages[hit['index']])\n",
-    "#     print('-' * 100)\n",
-    "#     print()"
    ]
   }
  ],

   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "01589fc8",
    "metadata": {},
    "outputs": [],
     "import faiss\n",
     "from sentence_transformers import SentenceTransformer, CrossEncoder\n",
     "\n",
     "from transformers import logging\n",
+    "logging.set_verbosity_error()\n",
+    "\n",
+    "from settings import OUTPUT_DIR, DEVICE\n",
+    "os.environ['WANDB_DISABLED'] = 'true'"
    ]
   },
   {
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "62cc0ead",
    "metadata": {},
    "outputs": [],
    "source": [
+    "passages = pd.read_parquet('data/processed/corpus_data.parquet')['text'].tolist()\n",
+    "corpus_embeddings = fine_tuned_model.encode(\n",
+    "    passages, \n",
+    "    batch_size=128,\n",
+    "    convert_to_numpy=True, \n",
+    "    normalize_embeddings=True,\n",
+    "    show_progress_bar=True, \n",
+    "    device=DEVICE,\n",
+    ").astype(np.float32)"
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "465e8d2a",
    "metadata": {},
    "outputs": [],
    "source": [
+    "d         = corpus_embeddings.shape[1]  # 768\n",
+    "cpu_index = faiss.IndexFlatIP(d)\n",
     "\n",
+    "res       = faiss.StandardGpuResources()\n",
+    "gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index)\n",
+    "gpu_index.add(corpus_embeddings)"
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "af365371",
    "metadata": {},
    "outputs": [],
    "source": [
+    "final_cpu_index = faiss.index_gpu_to_cpu(gpu_index)\n",
+    "faiss.write_index(final_cpu_index, 'data/retrieval/legal_faiss.index')"
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "9251d0db",
    "metadata": {},
    "outputs": [],
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "ece21ef6",
    "metadata": {},
+   "outputs": [],
    "source": [
     "query = 'Tội xúc phạm danh dự'\n",
     "hits  = retrieval(fine_tuned_model, query, legal_index, top_k=10)\n",
     "\n",
     "for h in hits:\n",
+    "    print(f\"[Rank {hits.index(h)+1}] - index={h['index']}, score={h['score']:.4f}]\")\n",
+    "    print(f\"{h['text']}\")\n",
+    "    print('-' * 100)"
    ]
   }
  ],