vukosi commited on
Commit
6432040
Β·
verified Β·
1 Parent(s): b923990

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -6
README.md CHANGED
@@ -1,12 +1,157 @@
1
  ---
2
- title: isiSwati - English Translate (Bi-directional)
3
- emoji: 🌸
4
- colorFrom: purple
5
- colorTo: pink
6
  sdk: gradio
7
  sdk_version: 5.33.2
8
  app_file: app.py
9
- pinned: true
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Siswati-English Linguistic Translation Tool
3
+ emoji: πŸ”¬
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: gradio
7
  sdk_version: 5.33.2
8
  app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ tags:
12
+ - translation
13
+ - siswati
14
+ - linguistics
15
+ - african-languages
16
+ - nlp
17
+ - research
18
+ - corpus-analysis
19
+ - bantu-languages
20
+ - m2m100
21
+ - multilingual
22
  ---
23
 
24
+ # πŸ”¬ Siswati-English Linguistic Translation Tool
25
+
26
+ An advanced AI-powered translation system with comprehensive linguistic analysis features, designed specifically for linguists, researchers, and language documentation projects working with Siswati and English.
27
+
28
+ ## 🌟 Features
29
+
30
+ ### πŸ”„ Translation Capabilities
31
+ - **Bidirectional Translation**: High-quality English ↔ Siswati translation
32
+ - **Advanced Model Architecture**: Built on M2M100 transformer models
33
+ - **Batch Processing**: Process multiple texts simultaneously for corpus analysis
34
+ - **Real-time Analysis**: Instant linguistic metrics and feature detection
35
+
36
+ ### πŸ“Š Linguistic Analysis
37
+ - **Morphological Complexity**: Word length, sentence structure analysis
38
+ - **Lexical Diversity**: Vocabulary richness measurements
39
+ - **Language-Specific Features**: Siswati agglutination, click consonants, tone markers
40
+ - **Translation Ratios**: Comparative analysis between source and target languages
41
+ - **Statistical Metrics**: Character count, word count, sentence segmentation
42
+
43
+ ### πŸ”¬ Research Tools
44
+ - **Translation History**: Track and analyze translation patterns over time
45
+ - **CSV Export**: Research-ready data export for statistical analysis
46
+ - **Corpus Management**: Batch processing for linguistic corpora
47
+ - **Performance Metrics**: Processing time and efficiency tracking
48
+
49
+ ## πŸ—£οΈ About Siswati
50
+
51
+ **Siswati** (also known as **Swati** or **Swazi**) is a Bantu language spoken by approximately 2.3 million people, primarily in:
52
+ - πŸ‡ΈπŸ‡Ώ **Eswatini** (Kingdom of Eswatini) - Official language
53
+ - πŸ‡ΏπŸ‡¦ **South Africa** - One of 11 official languages
54
+
55
+ ### Linguistic Features
56
+ - **Language Family**: Niger-Congo β†’ Bantu β†’ Southeast Bantu
57
+ - **Script**: Latin alphabet
58
+ - **Characteristics**: Agglutinative morphology, click consonants, tonal
59
+ - **ISO Code**: ss (ISO 639-1), ssw (ISO 639-3)
60
+
61
+ ## πŸ€– Model Information
62
+
63
+ This tool uses state-of-the-art transformer models developed by the **Data Science for Social Impact Research Group**:
64
+
65
+ - **English β†’ Siswati**: `dsfsi/en-ss-m2m100-combo`
66
+ - **Siswati β†’ English**: `dsfsi/ss-en-m2m100-combo`
67
+
68
+ Both models are based on Meta's M2M100 architecture, fine-tuned specifically for Siswati-English translation pairs.
69
+
70
+ ## 🎯 Use Cases
71
+
72
+ ### For Linguists & Researchers
73
+ - **Language Documentation**: Analyze translation patterns and linguistic features
74
+ - **Corpus Studies**: Process large text collections with batch translation
75
+ - **Comparative Analysis**: Study morphological and syntactic differences
76
+ - **Quality Assessment**: Evaluate translation adequacy and fluency
77
+
78
+ ### For Educators & Students
79
+ - **Language Learning**: Understand translation patterns and linguistic structures
80
+ - **Academic Research**: Export data for statistical analysis and publications
81
+ - **Computational Linguistics**: Study machine translation for low-resource languages
82
+
83
+ ### For Community & Cultural Projects
84
+ - **Language Preservation**: Support Siswati language documentation efforts
85
+ - **Cultural Exchange**: Facilitate communication between English and Siswati speakers
86
+ - **Content Translation**: Assist in translating educational and cultural materials
87
+
88
+ ## πŸš€ Getting Started
89
+
90
+ 1. **Single Translation**: Enter text and select translation direction
91
+ 2. **Batch Processing**: Upload `.txt` files or paste multiple lines for corpus analysis
92
+ 3. **Analysis Export**: Use the research tools to export translation data as CSV
93
+ 4. **Linguistic Study**: Explore the real-time analysis features for detailed insights
94
+
95
+ ## πŸ“ˆ Linguistic Metrics Explained
96
+
97
+ ### Text Complexity
98
+ - **Word Count**: Total number of words in the text
99
+ - **Character Count**: Total characters including spaces and punctuation
100
+ - **Sentence Count**: Number of sentences detected
101
+ - **Average Word Length**: Mean character length per word
102
+ - **Lexical Diversity**: Ratio of unique words to total words (vocabulary richness)
103
+
104
+ ### Translation Analysis
105
+ - **Word Ratio**: Target word count / Source word count
106
+ - **Character Ratio**: Target character count / Source character count
107
+ - **Processing Time**: Time taken for model inference
108
+
109
+ ### Siswati-Specific Features
110
+ - **Agglutination Detection**: Identification of potentially agglutinated words (>10 characters)
111
+ - **Click Consonants**: Count of clicks (c, q, x sounds)
112
+ - **Tone Markers**: Detection of acute (́) and grave (Μ€) accent marks
113
+
114
+ ## πŸ“š Academic Usage
115
+
116
+ If you use this tool in your research, please cite the original models:
117
+
118
+ ```bibtex
119
+ @misc{dsfsi-siswati-translation,
120
+ title={Siswati-English Translation Models},
121
+ author={Marivate, Vukosi and Lastrucci, Richard},
122
+ year={2024},
123
+ publisher={Data Science for Social Impact Research Group},
124
+ url={https://github.com/dsfsi/}
125
+ }
126
+ ```
127
+
128
+ ## πŸ”— Related Resources
129
+
130
+ - **Model Repositories**: [En-Ss Model](https://github.com/dsfsi/en-ss-m2m100-combo) | [Ss-En Model](https://github.com/dsfsi/ss-en-m2m100-combo)
131
+ - **Research Group**: [DSFSI](https://dsfsi.github.io/)
132
+ - **Feedback**: [Research Feedback Form](https://docs.google.com/forms/d/e/1FAIpQLSf7S36dyAUPx2egmXbFpnTBuzoRulhL5Elu-N1eoMhaO7v10w/viewform)
133
+
134
+ ## 🀝 Contributing
135
+
136
+ We welcome contributions from the linguistic and NLP communities! Areas of interest:
137
+ - Improving translation quality
138
+ - Adding more linguistic analysis features
139
+ - Expanding to other African languages
140
+ - Enhancing the user interface for research workflows
141
+
142
+ ## πŸ“„ License
143
+
144
+ This project is licensed under the Apache 2.0 License. The underlying models may have their own licensing terms - please check the individual model repositories.
145
+
146
+ ## 🌍 Supporting African Languages
147
+
148
+ This tool is part of a broader effort to support African language technology and computational linguistics research. By providing advanced NLP tools for Siswati, we aim to:
149
+
150
+ - Preserve and promote African languages in the digital age
151
+ - Support linguistic research and documentation
152
+ - Enable better communication across language barriers
153
+ - Contribute to the development of multilingual AI systems
154
+
155
+ ---
156
+
157
+ **Built with ❀️ for the African NLP community**