--- title: Vietnamese Legal Doc Retrieval emoji: πŸ† colorFrom: indigo colorTo: pink sdk: docker pinned: false short_description: Fine-tuned Retrieval System for Vietnamese Legal Documents models: - YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs datasets: - YuITC/Vietnamese-Legal-Doc-Retrieval-Data --- # Vietnamese Legal Document Retrieval System [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/YuITC/Vietnamese-Legal-Doc-Retrieval) [![Model](https://img.shields.io/badge/%F0%9F%A4%97%20Model-HF%20Hub-yellow)](https://huggingface.co/YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs) [![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-HF%20Hub-green)](https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data) A retrieval system specifically designed for Vietnamese legal documents using fine-tuned SBERT (Sentence-BERT) technology. ## πŸ“Œ Overview This project implements a retrieval system for retrieving relevant Vietnamese legal documents based on user queries. The system uses a fine-tuned multilingual BERT model to encode legal queries and documents into a semantic vector space, allowing for retrieval based on meaning rather than just keyword matching. ![Gradio Interface Demo](assets/gradio_demo.png) ## πŸ”‘ Key features - Step-by-step notebook for understanding. - Fine-tuned SBERT model specialized for Vietnamese legal document retrieval. - FAISS indexing for efficient vector search. - Evaluation based on MTEB. - Interactive web interface for quick legal document search. - High-performance retrieval of relevant legal passages. ## πŸ› οΈ Installation & Usage ```bash # Install dependencies conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia conda install faiss-gpu=1.9.0 -c pytorch -c nvidia pip install -r requirements.txt # Running the Application python main.py ``` The application will start a local web server with the Gradio interface, allowing you to enter legal queries and retrieve relevant documents. ## πŸ“‚ Project Structure ``` Vietnamese-Legal-Doc-Retrieval/ β”œβ”€β”€ assets/ # Visual assets for documentation β”‚ └── gradio_demo.png # Screenshot of the Gradio demo interface β”œβ”€β”€ cache/ # Cached model files β”‚ └── VN-legalDocs-SBERT/ # Cached BERT model files β”œβ”€β”€ data/ # Dataset files β”‚ β”œβ”€β”€ original/ # Original downloaded dataset β”‚ β”‚ β”œβ”€β”€ corpus.csv # Raw corpus documents β”‚ β”‚ β”œβ”€β”€ train_split.csv # Training data β”‚ β”‚ β”œβ”€β”€ val_split.csv # Validation data β”‚ β”‚ └── ... β”‚ β”œβ”€β”€ processed/ # Processed dataset files β”‚ β”‚ β”œβ”€β”€ corpus_data.parquet # Processed corpus for embedding β”‚ β”‚ β”œβ”€β”€ train_data.parquet # Processed training data β”‚ β”‚ └── test_data.parquet # Processed test data β”‚ └── retrieval/ # Files for retrieval system β”‚ └── legal_faiss.index # FAISS index for fast vector search β”œβ”€β”€ models/ # Trained model files β”‚ └── VN-legalDocs-SBERT/ # Fine-tuned BERT model for legal documents β”‚ β”œβ”€β”€ model.safetensors # Model weights β”‚ β”œβ”€β”€ config.json # Model configuration β”‚ └── checkpoint-*/ # Training checkpoints β”œβ”€β”€ results/ # Evaluation results β”œβ”€β”€ Dockerfile # Docker configuration for deployment β”œβ”€β”€ main.py # Main application entry point β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ settings.py # Configuration settings └── step_*_*.ipynb # Jupyter notebooks for each step of the process ``` ## πŸ’Ύ Dataset The system is trained on a Vietnamese legal document corpus containing: - Legal texts from various domains - Query-document pairs for training and evaluation - Processed and structured for semantic search training The dataset is available on [Hugging Face](https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data) (modified by me, the base dataset is cited below). ## πŸ“Š Model Training Process The project follows a systematic approach to build the retrieval system: 1. **Data Preparation** (`step_01_Prepare_Data.ipynb`): - Processes raw legal documents - Creates query-document pairs for training - Formats data for the embedding model 2. **SBERT Fine-tuning** (`step_02_Finetune_SBERT.ipynb`): - Fine-tunes a multilingual BERT model with legal document pairs - Uses `CachedMultipleNegativesRankingLoss` for training - Optimizes for semantic similarity in legal context 3. **Evaluation** (`step_03_Eval_with_MTEB.ipynb`): - Evaluates model performance using retrieval metrics - Compares with baseline models 4. **Retrieval System Setup** (`step_04_Retrieval.ipynb`): - Creates FAISS index from document embeddings - Implements efficient search functionality - Prepares for deployment ## πŸ” Usage Examples The system accepts natural language queries in Vietnamese related to legal topics. Example queries: - "Tα»™i xΓΊc phαΊ‘m danh dα»±?" (Crimes against honor?) - "Quyền lợi cα»§a người lao Δ‘α»™ng?" (Rights of workers?) - "Thα»§ tα»₯c Δ‘Δƒng kΓ½ kαΊΏt hΓ΄n?" (Marriage registration procedures?) ## πŸ§ͺ Performance The fine-tuned model was evaluated using the [MTEB benchmark](https://github.com/embeddings-benchmark/mteb) on the BKAILegalDocRetrieval dataset. Key results: | Metric | @k | Pre-trained model score (%) | Fine-tuned model score (%) | |--------------|-----|-----------------------------|-----------------------------| | **NDCG** | 1 | 0.007 | 42.425 | | | 5 | 0.011 | 57.387 | | | 10 | 0.023 | 60.389 | | | 20 | 0.049 | 62.160 | | | 100 | 0.147 | 63.894 | | **MAP** | 1 | 0.007 | 40.328 | | | 5 | 0.009 | 52.297 | | | 10 | 0.014 | 53.608 | | | 20 | 0.021 | 54.136 | | | 100 | 0.033 | 54.418 | | **Recall** | 1 | 0.007 | 40.328 | | | 5 | 0.017 | 70.466 | | | 10 | 0.054 | 79.407 | | | 20 | 0.157 | 86.112 | | | 100 | 0.713 | 94.805 | | **Precision**| 1 | 0.007 | 42.425 | | | 5 | 0.003 | 15.119 | | | 10 | 0.005 | 8.587 | | | 20 | 0.008 | 4.687 | | | 100 | 0.007 | 1.045 | | **MRR** | 1 | 0.007 | 42.418 | | | 5 | 0.010 | 54.337 | | | 10 | 0.014 | 55.510 | | | 20 | 0.021 | 55.956 | | | 100 | 0.033 | 56.172 | - **NDCG@k (Normalized Discounted Cumulative Gain)** Measures ranking quality by evaluating the relevance of results with logarithmic position-based discounting. - **MAP@k (Mean Average Precision)** Computes the average precision for each query up to rank kβ€”precision at each relevant retrieved documentβ€”then averages across all queries. - **Recall@k** The proportion of all relevant documents that are retrieved in the top k results. - **Precision@k** The proportion of the top k retrieved documents that are relevant. - **MRR@k (Mean Reciprocal Rank)** The average of the reciprocal of the rank position of the first relevant document across all queries. The model significantly outperforms baseline retrieval methods, with the main evaluation score (NDCG@10) reaching 60.4%, demonstrating strong performance on Vietnamese legal document retrieval tasks. ## 🐳 Docker Deployment The project includes a Docker configuration for easy deployment. The Docker image is built on `continuumio/miniconda3` and includes GPU support via PyTorch CUDA and FAISS-GPU. ```bash # Build the Docker image docker build -t vietnamese-legal-retrieval . # Run the container docker run -p 7860:7860 vietnamese-legal-retrieval ``` The container: - Uses Python 3.10 with CUDA 12.1 support - Installs required dependencies from requirements.txt - Exposes port 7860 for the Gradio web interface - Sets proper environment variables for security and performance - Runs as a non-root user for enhanced security You can access the web interface by navigating to `http://localhost:7860` after starting the container. ## πŸ“œ License This project is licensed under the MIT License – feel free to modify and distribute it as needed. ## 🀝 Acknowledgments Thanks for: - [BKAI Legal Retrieval Dataset](https://huggingface.co/datasets/tmnam20/BKAI-Legal-Retrieval) for the original data - [Sentence Transformers](https://www.sbert.net/) library for the embedding model architecture - [Hugging Face](https://huggingface.co/) for hosting the model and dataset If you find this project useful, consider ⭐️ starring the repository or contributing to further improvements! ## πŸ“¬ Contact For any questions or collaboration opportunities, feel free to reach out: πŸ“§ Email: tainguyenphu2502@gmail.com