texta-mlp-python issueshttps://git.texta.ee/texta/texta-mlp-python/-/issues2023-11-29T12:04:56Zhttps://git.texta.ee/texta/texta-mlp-python/-/issues/59Add support for retrieving detected language from the document2023-11-29T12:04:56ZMarit AsulaAdd support for retrieving detected language from the documentIf the passed document contains language in field "{doc_path}_mlp.language.detected", use this language as `detected_language` instead of applying langdetect again.If the passed document contains language in field "{doc_path}_mlp.language.detected", use this language as `detected_language` instead of applying langdetect again.Marit AsulaMarit Asulahttps://git.texta.ee/texta/texta-mlp-python/-/issues/58Add support for forcing analysis language2023-11-29T12:04:42ZMarit AsulaAdd support for forcing analysis languageAdd support for forcing a specific language used for analysing the document with Stanza analyzers.Add support for forcing a specific language used for analysing the document with Stanza analyzers.Marit AsulaMarit Asulahttps://git.texta.ee/texta/texta-mlp-python/-/issues/53Always return the language detected with langdetect in process_docs2023-11-29T12:04:24ZMarit AsulaAlways return the language detected with langdetect in process_docs**Current behaviour:**
Language detected with langdetect is only returned, IF it is present in supported langs (function `process_docs`).
Where it goes wrong:
1. It defaults to default_lang here, if it isn't present in the supported ...**Current behaviour:**
Language detected with langdetect is only returned, IF it is present in supported langs (function `process_docs`).
Where it goes wrong:
1. It defaults to default_lang here, if it isn't present in the supported langs: https://git.texta.ee/texta/texta-mlp-python/-/blob/master/texta_mlp/mlp.py#L351
2. The result of step one is then passed to function `generate_document`: https://git.texta.ee/texta/texta-mlp-python/-/blob/master/texta_mlp/mlp.py#L383
3. As the language was passed in step 2, there is no need to detect the language again: https://git.texta.ee/texta/texta-mlp-python/-/blob/master/texta_mlp/mlp.py#L185
4. Finally, the result of step 1 is returned as dominant_language in here: https://git.texta.ee/texta/texta-mlp-python/-/blob/master/texta_mlp/mlp.py#L208
**Expected behaviour:**
Language detected with langdetect is ALWAYS returned (function `process_docs`).
Removing rows https://git.texta.ee/texta/texta-mlp-python/-/blob/master/texta_mlp/mlp.py#L351 and https://git.texta.ee/texta/texta-mlp-python/-/blob/master/texta_mlp/mlp.py#L352 should fix this issue, unless it will break something else...Marit AsulaMarit Asulahttps://git.texta.ee/texta/texta-mlp-python/-/issues/40GPU selection2022-06-03T07:36:29ZWael RamadanGPU selection**Problem:**
Currently when trying to run MLP with another task that is using the GPU an error is presented with "CUDA out of memory" yet the second GPU is not being used.
**Research:**
According to Stanza, you can set GPU selection w...**Problem:**
Currently when trying to run MLP with another task that is using the GPU an error is presented with "CUDA out of memory" yet the second GPU is not being used.
**Research:**
According to Stanza, you can set GPU selection with the flag 'CUDA_VISIBLE_DEVICES'
https://github.com/stanfordnlp/stanza/issues/390
**Solution:**
We can have both GPU's used by setting one worker to run using gpu-0 and another worker to run using gpu-1
This will at least give us the ability to run more tasks on the GPU'shttps://git.texta.ee/texta/texta-mlp-python/-/issues/54Meta field for the document2022-04-20T16:16:48ZRaul SirelMeta field for the document**Problem:** Currently document contains no information on how to interpret it's contents. E.g. whether text in a certain field is tokenized as sentences.
**Idea:**
* Add special field called "_meta" inside the document.
* Something lik...**Problem:** Currently document contains no information on how to interpret it's contents. E.g. whether text in a certain field is tokenized as sentences.
**Idea:**
* Add special field called "_meta" inside the document.
* Something like a mapping definition in Elasticsearch.
* Should be minimal and not puke out too much useless information.
**Open questions:**
* how is the meta for nested fields stored?
* what actually needs to be done and where?
Tokenization & spans should always default to text!!!MLP & Annotator SpansMarko KolloMarko Kollohttps://git.texta.ee/texta/texta-mlp-python/-/issues/52Remove "=" from lemmas2022-02-03T13:28:01ZRaul SirelRemove "=" from lemmasRaul SirelRaul Sirelhttps://git.texta.ee/texta/texta-mlp-python/-/issues/47Add use_gpu param via ENV2021-10-18T09:52:07ZRaul SirelAdd use_gpu param via ENVhttps://git.texta.ee/texta/texta-mlp-python/-/blob/master/worker/taskman.py#L31
This way we can override the MLP trying to use CUDA each time it initiates.
Should also do this in TTK as it might speed up MLP on CPU environments.https://git.texta.ee/texta/texta-mlp-python/-/blob/master/worker/taskman.py#L31
This way we can override the MLP trying to use CUDA each time it initiates.
Should also do this in TTK as it might speed up MLP on CPU environments.Wael RamadanWael Ramadanhttps://git.texta.ee/texta/texta-mlp-python/-/issues/24MLP texta-toolkit-rest compatibility2021-10-13T13:30:46ZWael RamadanMLP texta-toolkit-rest compatibilityWith MLP version 1.8.0 lang structure has changed:
"lang": {"detected_lang": "fr", "analysis_lang": "et"}
Tests have been made in MLP but this does not pass texta-toolkit-restWith MLP version 1.8.0 lang structure has changed:
"lang": {"detected_lang": "fr", "analysis_lang": "et"}
Tests have been made in MLP but this does not pass texta-toolkit-resthttps://git.texta.ee/texta/texta-mlp-python/-/issues/36Load resources on a need-be basis instead of pre-loading everything.2021-10-13T13:29:03ZMarko KolloLoad resources on a need-be basis instead of pre-loading everything.https://git.texta.ee/texta/texta-mlp-python/-/issues/46Correct POS tag for proper names2021-10-11T13:04:06ZRaul SirelCorrect POS tag for proper namesRaul SirelRaul Sirelhttps://git.texta.ee/texta/texta-mlp-python/-/issues/38Remove BOUNDED from default analyzers2021-09-28T17:32:28ZRaul SirelRemove BOUNDED from default analyzersBOUNDED analyzer should not be used by default as it creates lot's of noise.BOUNDED analyzer should not be used by default as it creates lot's of noise.Wael RamadanWael Ramadanhttps://git.texta.ee/texta/texta-mlp-python/-/issues/45Sentence splitting screws up NER spans because of newlines2021-09-28T08:47:24ZRaul SirelSentence splitting screws up NER spans because of newlinesHappens after first sentence.Happens after first sentence.Raul SirelRaul Sirelhttps://git.texta.ee/texta/texta-mlp-python/-/issues/41Profiling: Difference between Stanza and current MLP.2021-09-07T16:32:55ZMarko KolloProfiling: Difference between Stanza and current MLP.This should contain texts of different sizes. From a small comment sized, several sentences worth and a whole article.This should contain texts of different sizes. From a small comment sized, several sentences worth and a whole article.Wael RamadanWael Ramadanhttps://git.texta.ee/texta/texta-mlp-python/-/issues/37Improve Error Handling2021-08-24T07:51:44ZRaul SirelImprove Error HandlingTTK Log:
```
[2021-05-24 15:51:32,035: ERROR/ForkPoolWorker-1] Task apply_mlp_on_index[177ed543-581d-4193-832b-e5f36b760db2] raised unexpected: RuntimeError('stack expects a non-empty TensorList')
Traceback (most recent call last):
Fil...TTK Log:
```
[2021-05-24 15:51:32,035: ERROR/ForkPoolWorker-1] Task apply_mlp_on_index[177ed543-581d-4193-832b-e5f36b760db2] raised unexpected: RuntimeError('stack expects a non-empty TensorList')
Traceback (most recent call last):
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/celery/app/trace.py", line 412, in trace_task
R = retval = fun(*args, **kwargs)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/celery/app/trace.py", line 704, in __protected_call__
return self.run(*args, **kwargs)
File "/var/texta-rest/toolkit/mlp/tasks.py", line 99, in apply_mlp_on_index
raise e
File "/var/texta-rest/toolkit/mlp/tasks.py", line 92, in apply_mlp_on_index
elastic_response = ed.bulk_update(actions=actions)
File "/var/texta-rest/toolkit/elastic/decorators.py", line 18, in func_wrapper
return func(*args, **kwargs)
File "/var/texta-rest/toolkit/elastic/tools/document.py", line 143, in bulk_update
return bulk(client=self.core.es, actions=actions, refresh=refresh, request_timeout=30, chunk_size=chunk_size)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 396, in bulk
for ok, item in streaming_bulk(client, actions, *args, **kwargs):
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 308, in streaming_bulk
actions, chunk_size, max_chunk_bytes, client.transport.serializer
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 155, in _chunk_actions
for action, data in actions:
File "/var/texta-rest/toolkit/elastic/tools/document.py", line 161, in add_type_to_docs
for action in actions:
File "/var/texta-rest/toolkit/mlp/helpers.py", line 24, in process_mlp_actions
mlp_processed = mlp_class.process_docs(document_sources, analyzers=analyzers, doc_paths=field_data)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/texta_mlp/mlp.py", line 357, in process_docs
doc = self.generate_document(raw_text, analyzers, document, doc_paths=doc_path)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/texta_mlp/mlp.py", line 208, in generate_document
sentences, entities = self._get_stanza_tokens(analysis_lang, processed_text) if processed_text else ([], [])
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/texta_mlp/mlp.py", line 233, in _get_stanza_tokens
pipeline = self.stanza_pipelines[lang](raw_text)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/stanza/pipeline/core.py", line 210, in __call__
doc = self.process(doc)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/stanza/pipeline/core.py", line 204, in process
doc = process(doc)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/stanza/pipeline/tokenize_processor.py", line 92, in process
no_ssplit=self.config.get('no_ssplit', False))
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/stanza/models/tokenization/utils.py", line 153, in output_predictions
pred1 = np.argmax(trainer.predict(batch1), axis=2)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/stanza/models/tokenization/trainer.py", line 67, in predict
pred = self.model(units, features)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/stanza/models/tokenization/model.py", line 49, in forward
inp, _ = self.rnn(emb)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/envs/texta-rest/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 570, in forward
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: stack expects a non-empty TensorList
```
i think we need some generic mlp error that can be logged on ttk side
and this way it wouldn't kill the worker
some custom exception here: https://git.texta.ee/texta/texta-mlp-python/-/blob/master/texta_mlp/exceptions.py
i guess when depending on stanza anything could go wrong on their side, but we can't have that kill our pipelinesWael RamadanWael Ramadanhttps://git.texta.ee/texta/texta-mlp-python/-/issues/34Tensor error on Finnish text.2021-05-14T08:39:08ZMarko KolloTensor error on Finnish text.Attached Finnish[finnish_text.txt](/uploads/b11ddba0a1f17ed7df150da511444f69/finnish_text.txt) text throws the following Exception when processing lemmas:
RuntimeError: The expanded size of the tensor (24) must match the existing size (0...Attached Finnish[finnish_text.txt](/uploads/b11ddba0a1f17ed7df150da511444f69/finnish_text.txt) text throws the following Exception when processing lemmas:
RuntimeError: The expanded size of the tensor (24) must match the existing size (0) at non-singleton dimension 1. Target sizes: [0, 24]. Tensor sizes: [0]Marko KolloMarko Kollohttps://git.texta.ee/texta/texta-mlp-python/-/issues/351.2 model changes the way punctuation is tokenized, thus creating errors for ...2021-05-14T08:20:23ZMarko Kollo1.2 model changes the way punctuation is tokenized, thus creating errors for address and phone parsing.1. Delete the data folder with Stanza models.
1. Update Stanza to 1.2.*.
1. Run `pytest -v tests` inside the repo, models will take a bit to be re-downloaded.
Some examples:
```
> AssertionError: assert 'ул. Матросская Тишина , д. 14А' ...1. Delete the data folder with Stanza models.
1. Update Stanza to 1.2.*.
1. Run `pytest -v tests` inside the repo, models will take a bit to be re-downloaded.
Some examples:
```
> AssertionError: assert 'ул. Матросская Тишина , д. 14А' in ['ул. Матросская Тишина ,д.14А']
> AssertionError: assert 'ул. Курчатова 10а' in []
> AssertionError: assert 'vana-lõuna 39' in []
> AssertionError: assert [] == ['74956456601']
> ['4310373'] != ['89104310373']
```Raul SirelRaul Sirelhttps://git.texta.ee/texta/texta-mlp-python/-/issues/31'int' object is not iterable2021-03-23T18:11:03ZGhost User'int' object is not iterable![2021-03-23-122627_957x327_scrot](/uploads/fe8937eec6873533a645e14789abc60c/2021-03-23-122627_957x327_scrot.png)![2021-03-23-122627_957x327_scrot](/uploads/fe8937eec6873533a645e14789abc60c/2021-03-23-122627_957x327_scrot.png)Marko KolloMarko Kollohttps://git.texta.ee/texta/texta-mlp-python/-/issues/28Email extractor too greedy2021-03-17T10:28:31ZRaul SirelEmail extractor too greedyIn files containing HTML MLP matches "mailto:my@email.com" where it should only match "my@email.com".In files containing HTML MLP matches "mailto:my@email.com" where it should only match "my@email.com".Linda FreienthalLinda Freienthalhttps://git.texta.ee/texta/texta-mlp-python/-/issues/27Parallel running of downstream jobs2021-03-15T11:10:23ZRaul SirelParallel running of downstream jobsRaul SirelRaul Sirelhttps://git.texta.ee/texta/texta-mlp-python/-/issues/23Add sentence tokenization support2021-03-10T10:00:18ZGhost UserAdd sentence tokenization supportSoon we are going to add a text summarization tool to our toolkit and it requires sentence tokenized text as an input.Soon we are going to add a text summarization tool to our toolkit and it requires sentence tokenized text as an input.Marko KolloMarko Kollo