Update README.md
Browse files
README.md
CHANGED
@@ -1,12 +1,157 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
sdk_version: 5.33.2
|
8 |
app_file: app.py
|
9 |
-
pinned:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: Siswati-English Linguistic Translation Tool
|
3 |
+
emoji: π¬
|
4 |
+
colorFrom: blue
|
5 |
+
colorTo: green
|
6 |
sdk: gradio
|
7 |
sdk_version: 5.33.2
|
8 |
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
license: apache-2.0
|
11 |
+
tags:
|
12 |
+
- translation
|
13 |
+
- siswati
|
14 |
+
- linguistics
|
15 |
+
- african-languages
|
16 |
+
- nlp
|
17 |
+
- research
|
18 |
+
- corpus-analysis
|
19 |
+
- bantu-languages
|
20 |
+
- m2m100
|
21 |
+
- multilingual
|
22 |
---
|
23 |
|
24 |
+
# π¬ Siswati-English Linguistic Translation Tool
|
25 |
+
|
26 |
+
An advanced AI-powered translation system with comprehensive linguistic analysis features, designed specifically for linguists, researchers, and language documentation projects working with Siswati and English.
|
27 |
+
|
28 |
+
## π Features
|
29 |
+
|
30 |
+
### π Translation Capabilities
|
31 |
+
- **Bidirectional Translation**: High-quality English β Siswati translation
|
32 |
+
- **Advanced Model Architecture**: Built on M2M100 transformer models
|
33 |
+
- **Batch Processing**: Process multiple texts simultaneously for corpus analysis
|
34 |
+
- **Real-time Analysis**: Instant linguistic metrics and feature detection
|
35 |
+
|
36 |
+
### π Linguistic Analysis
|
37 |
+
- **Morphological Complexity**: Word length, sentence structure analysis
|
38 |
+
- **Lexical Diversity**: Vocabulary richness measurements
|
39 |
+
- **Language-Specific Features**: Siswati agglutination, click consonants, tone markers
|
40 |
+
- **Translation Ratios**: Comparative analysis between source and target languages
|
41 |
+
- **Statistical Metrics**: Character count, word count, sentence segmentation
|
42 |
+
|
43 |
+
### π¬ Research Tools
|
44 |
+
- **Translation History**: Track and analyze translation patterns over time
|
45 |
+
- **CSV Export**: Research-ready data export for statistical analysis
|
46 |
+
- **Corpus Management**: Batch processing for linguistic corpora
|
47 |
+
- **Performance Metrics**: Processing time and efficiency tracking
|
48 |
+
|
49 |
+
## π£οΈ About Siswati
|
50 |
+
|
51 |
+
**Siswati** (also known as **Swati** or **Swazi**) is a Bantu language spoken by approximately 2.3 million people, primarily in:
|
52 |
+
- πΈπΏ **Eswatini** (Kingdom of Eswatini) - Official language
|
53 |
+
- πΏπ¦ **South Africa** - One of 11 official languages
|
54 |
+
|
55 |
+
### Linguistic Features
|
56 |
+
- **Language Family**: Niger-Congo β Bantu β Southeast Bantu
|
57 |
+
- **Script**: Latin alphabet
|
58 |
+
- **Characteristics**: Agglutinative morphology, click consonants, tonal
|
59 |
+
- **ISO Code**: ss (ISO 639-1), ssw (ISO 639-3)
|
60 |
+
|
61 |
+
## π€ Model Information
|
62 |
+
|
63 |
+
This tool uses state-of-the-art transformer models developed by the **Data Science for Social Impact Research Group**:
|
64 |
+
|
65 |
+
- **English β Siswati**: `dsfsi/en-ss-m2m100-combo`
|
66 |
+
- **Siswati β English**: `dsfsi/ss-en-m2m100-combo`
|
67 |
+
|
68 |
+
Both models are based on Meta's M2M100 architecture, fine-tuned specifically for Siswati-English translation pairs.
|
69 |
+
|
70 |
+
## π― Use Cases
|
71 |
+
|
72 |
+
### For Linguists & Researchers
|
73 |
+
- **Language Documentation**: Analyze translation patterns and linguistic features
|
74 |
+
- **Corpus Studies**: Process large text collections with batch translation
|
75 |
+
- **Comparative Analysis**: Study morphological and syntactic differences
|
76 |
+
- **Quality Assessment**: Evaluate translation adequacy and fluency
|
77 |
+
|
78 |
+
### For Educators & Students
|
79 |
+
- **Language Learning**: Understand translation patterns and linguistic structures
|
80 |
+
- **Academic Research**: Export data for statistical analysis and publications
|
81 |
+
- **Computational Linguistics**: Study machine translation for low-resource languages
|
82 |
+
|
83 |
+
### For Community & Cultural Projects
|
84 |
+
- **Language Preservation**: Support Siswati language documentation efforts
|
85 |
+
- **Cultural Exchange**: Facilitate communication between English and Siswati speakers
|
86 |
+
- **Content Translation**: Assist in translating educational and cultural materials
|
87 |
+
|
88 |
+
## π Getting Started
|
89 |
+
|
90 |
+
1. **Single Translation**: Enter text and select translation direction
|
91 |
+
2. **Batch Processing**: Upload `.txt` files or paste multiple lines for corpus analysis
|
92 |
+
3. **Analysis Export**: Use the research tools to export translation data as CSV
|
93 |
+
4. **Linguistic Study**: Explore the real-time analysis features for detailed insights
|
94 |
+
|
95 |
+
## π Linguistic Metrics Explained
|
96 |
+
|
97 |
+
### Text Complexity
|
98 |
+
- **Word Count**: Total number of words in the text
|
99 |
+
- **Character Count**: Total characters including spaces and punctuation
|
100 |
+
- **Sentence Count**: Number of sentences detected
|
101 |
+
- **Average Word Length**: Mean character length per word
|
102 |
+
- **Lexical Diversity**: Ratio of unique words to total words (vocabulary richness)
|
103 |
+
|
104 |
+
### Translation Analysis
|
105 |
+
- **Word Ratio**: Target word count / Source word count
|
106 |
+
- **Character Ratio**: Target character count / Source character count
|
107 |
+
- **Processing Time**: Time taken for model inference
|
108 |
+
|
109 |
+
### Siswati-Specific Features
|
110 |
+
- **Agglutination Detection**: Identification of potentially agglutinated words (>10 characters)
|
111 |
+
- **Click Consonants**: Count of clicks (c, q, x sounds)
|
112 |
+
- **Tone Markers**: Detection of acute (Μ) and grave (Μ) accent marks
|
113 |
+
|
114 |
+
## π Academic Usage
|
115 |
+
|
116 |
+
If you use this tool in your research, please cite the original models:
|
117 |
+
|
118 |
+
```bibtex
|
119 |
+
@misc{dsfsi-siswati-translation,
|
120 |
+
title={Siswati-English Translation Models},
|
121 |
+
author={Marivate, Vukosi and Lastrucci, Richard},
|
122 |
+
year={2024},
|
123 |
+
publisher={Data Science for Social Impact Research Group},
|
124 |
+
url={https://github.com/dsfsi/}
|
125 |
+
}
|
126 |
+
```
|
127 |
+
|
128 |
+
## π Related Resources
|
129 |
+
|
130 |
+
- **Model Repositories**: [En-Ss Model](https://github.com/dsfsi/en-ss-m2m100-combo) | [Ss-En Model](https://github.com/dsfsi/ss-en-m2m100-combo)
|
131 |
+
- **Research Group**: [DSFSI](https://dsfsi.github.io/)
|
132 |
+
- **Feedback**: [Research Feedback Form](https://docs.google.com/forms/d/e/1FAIpQLSf7S36dyAUPx2egmXbFpnTBuzoRulhL5Elu-N1eoMhaO7v10w/viewform)
|
133 |
+
|
134 |
+
## π€ Contributing
|
135 |
+
|
136 |
+
We welcome contributions from the linguistic and NLP communities! Areas of interest:
|
137 |
+
- Improving translation quality
|
138 |
+
- Adding more linguistic analysis features
|
139 |
+
- Expanding to other African languages
|
140 |
+
- Enhancing the user interface for research workflows
|
141 |
+
|
142 |
+
## π License
|
143 |
+
|
144 |
+
This project is licensed under the Apache 2.0 License. The underlying models may have their own licensing terms - please check the individual model repositories.
|
145 |
+
|
146 |
+
## π Supporting African Languages
|
147 |
+
|
148 |
+
This tool is part of a broader effort to support African language technology and computational linguistics research. By providing advanced NLP tools for Siswati, we aim to:
|
149 |
+
|
150 |
+
- Preserve and promote African languages in the digital age
|
151 |
+
- Support linguistic research and documentation
|
152 |
+
- Enable better communication across language barriers
|
153 |
+
- Contribute to the development of multilingual AI systems
|
154 |
+
|
155 |
+
---
|
156 |
+
|
157 |
+
**Built with β€οΈ for the African NLP community**
|