Open
Milestone
started on Mar 9, 2023
[Data Science] [RUP] Language detection pipeline experiments
Milestone ID: 67
The current workflow is following:
Train different language detection models for each language group, e.g. "fi-en-ru", "et-en-ru", "lt-lv-ru" etc. However, it would be more convenient if there was a single model for all the focus languages, but an option to "turn off" unnecessary languages. E.g. train a single model for "et-fi-en-lt-lv-ru-hr". If selected languages are "fi", "en" and "lv" and the predictions (from highest probability to lowest) are ["et", "fi", "en", "lv", "lt", "hr", "ru"], then output "fi" as it has the highest probability from the selected languages.
Run experiments to make sure, if this approach will work.