texta-mlp-python issueshttps://git.texta.ee/texta/texta-mlp-python/-/issues2023-05-22T11:08:28Zhttps://git.texta.ee/texta/texta-mlp-python/-/issues/57Mapping NER entity names2023-05-22T11:08:28ZMarit AsulaMapping NER entity namesRead more: https://github.com/stanfordnlp/stanza/issues/904
The names and the number of different entities extracted with NER models might vary depending on the language / NER model. It would be more convenient, if entities conveying th...Read more: https://github.com/stanfordnlp/stanza/issues/904
The names and the number of different entities extracted with NER models might vary depending on the language / NER model. It would be more convenient, if entities conveying the same information were mapped together and added under the same fact name (e.g. "PER", "PERS", "PERSON" -> "PER"). Furthermore, it would be nice to add a selection menu to let the user decide which entitites to extract (but might be difficult to implement as the language of the documents is often unknown before and thus we wouldn't know the available options).
```
{
"fr": [
"LOC",
"MISC",
"ORG",
"PER"
],
"en": [
"CARDINAL",
"DATE",
"EVENT",
"FAC",
"GPE",
"LANGUAGE",
"LAW",
"LOC",
"MONEY",
"NORP",
"ORDINAL",
"ORG",
"PERCENT",
"PERSON",
"PRODUCT",
"QUANTITY",
"TIME",
"WORK_OF_ART"
],
"zh-hans": [
"CARDINAL",
"DATE",
"EVENT",
"FAC",
"GPE",
"LANGUAGE",
"LAW",
"LOC",
"MONEY",
"NORP",
"ORDINAL",
"ORG",
"PERCENT",
"PERSON",
"PRODUCT",
"QUANTITY",
"TIME",
"WORK_OF_ART"
],
"ru": [
"LOC",
"MISC",
"ORG",
"PER"
],
"uk": [
"LOC",
"MISC",
"ORG",
"PERS"
],
"ar": [
"LOC",
"MISC",
"ORG",
"PER"
],
"hu": [
"LOC",
"MISC",
"ORG",
"PER"
],
"af": [
"LOC",
"MISC",
"ORG",
"PERS"
],
"bg": [
"EVT",
"LOC",
"ORG",
"PER",
"PRO"
],
"fi": [
"DATE",
"EVENT",
"LOC",
"ORG",
"PER",
"PRO"
],
"my": [
"LOC",
"NE",
"NUM",
"ORG",
"PNAME",
"RACE",
"TIME"
],
"it": [
"LOC",
"ORG",
"PER"
],
"de": [
"LOC",
"MISC",
"ORG",
"PER"
],
"nl": [
"LOC",
"MISC",
"ORG",
"PER"
],
"vi": [
"LOCATION",
"MISCELLANEOUS",
"ORGANIZATION",
"PERSON"
],
"es": [
"LOC",
"MISC",
"ORG",
"PER"
]
}
```https://git.texta.ee/texta/texta-mlp-python/-/issues/56Memory issues with documents containing long words2023-05-15T12:29:27ZMarit AsulaMemory issues with documents containing long wordsRunning MLP on (6GB) GPU leads to "CUDA out of memory" errors (even for a monolingual dataset), if some document contains a word that is too long (like encrypted messages or bad OCR results resulting in a long array of symbols like "HyAH...Running MLP on (6GB) GPU leads to "CUDA out of memory" errors (even for a monolingual dataset), if some document contains a word that is too long (like encrypted messages or bad OCR results resulting in a long array of symbols like "HyAHS473HSgsFhNSMIAUDIMUAIMDIAMUDOIAUMDOIUAMOIDUHGAGHJAAHJHJHDJHAKAD..."). To avoid this, set some configurable limit for maximum word length and ignore documents that surpass it (but log the failed documents!). If the long character sequence is located at the end of the document, it could be just stripped before passing the document to Stanza pipeline. ... or maybe there's a better solution.https://git.texta.ee/texta/texta-mlp-python/-/issues/55Update or replace Pelicanus to be Python 3.10 compatible.2022-08-10T10:31:30ZMarko KolloUpdate or replace Pelicanus to be Python 3.10 compatible.When installing a conda environment with Python 3.10 and installing texta-mlp into it, it will throw an error when import the MLP class due to a deprecation in the collections module that Pelicanus uses.
```
File "/home/mkollo/.conda/...When installing a conda environment with Python 3.10 and installing texta-mlp into it, it will throw an error when import the MLP class due to a deprecation in the collections module that Pelicanus uses.
```
File "/home/mkollo/.conda/envs/py10-mlp/lib/python3.10/site-packages/pelecanus/pelicanjson.py", line 27, in <module>
class PelicanJson(collections.MutableMapping):
AttributeError: module 'collections' has no attribute 'MutableMapping'
```
```
https://docs.python.org/3.8/library/collections.html
Deprecated since version 3.3, will be removed in version 3.10: Moved Collections Abstract Base Classes to the collections.abc module. For backwards compatibility, they continue to be visible in this module through Python 3.9.
```Marko KolloMarko Kollohttps://git.texta.ee/texta/texta-mlp-python/-/issues/51Add postprocessing to lemmatized output2021-12-01T10:40:46ZMarit AsulaAdd postprocessing to lemmatized outputremove "=" etcremove "=" etchttps://git.texta.ee/texta/texta-mlp-python/-/issues/50NER output lemmatisation2021-11-11T15:12:04ZRaul SirelNER output lemmatisationhttps://git.texta.ee/texta/texta-mlp-python/-/issues/49stanza multilingual pipeline for mlp2021-10-30T17:45:33ZRaul Sirelstanza multilingual pipeline for mlp**Description:**
Update to Stanza 1.3 multilingual pipeline for MLP.
**Actions:**
- [ ] - Update Stanza package to 1.3
- [ ] - Use langid processor instead of langdetect to detect input language
- [ ] - Use MultilingualPipeline to maint...**Description:**
Update to Stanza 1.3 multilingual pipeline for MLP.
**Actions:**
- [ ] - Update Stanza package to 1.3
- [ ] - Use langid processor instead of langdetect to detect input language
- [ ] - Use MultilingualPipeline to maintain a cache of pipelines for each language
**Tests:**
- [ ] - Check that Stanza package 1.3 is stable in MLP
- [ ] - Check langid processor is detecting correct language
- [ ] - Tests for MultilangualPipelines
**Documentation:**
https://stanfordnlp.github.io/stanza/langid.htmlWael RamadanWael Ramadanhttps://git.texta.ee/texta/texta-mlp-python/-/issues/48apply_to_text currency span misaligned2022-12-12T08:35:48ZGert Paimlaapply_to_text currency span misalignedPandora dokumendid kinnitavad Tartust pärit maailmakuulsa küberpetturi Vladimir Tšaštšini investeeringuid FSB sidemetega Venemaa miljonäri ärisse ning paljastavad mitu tema partnerit idapiiri taga. Nii nagu lekkinud dokumendid, viitab ka...Pandora dokumendid kinnitavad Tartust pärit maailmakuulsa küberpetturi Vladimir Tšaštšini investeeringuid FSB sidemetega Venemaa miljonäri ärisse ning paljastavad mitu tema partnerit idapiiri taga. Nii nagu lekkinud dokumendid, viitab ka endine FBI küberdivisjoni tehnoloogiajuht Milan Patel, et miljonid arvutid viirustega nakatanud ning seeläbi enam kui 14 miljonit dollarit varastanud küberkurjategija omas olulisi sidemeid Venemaal. See näitab, et tema kuritegude taga võisid olla veelgi olulisemad niiditõmbajad. Samuti omasid Tšaštšini salajases Seišellide firmas osalust tema endised alluvad Eestis, kellest on nüüd saanud edukad kiirlaenu ja IT-ettevõtjad.
using currency sum analyzerhttps://git.texta.ee/texta/texta-mlp-python/-/issues/44Investigate possible integrations with MLPLite to support quick processing of...2021-08-18T13:00:20ZMarko KolloInvestigate possible integrations with MLPLite to support quick processing of untokenized and unlemmatized content.https://git.texta.ee/texta/texta-mlp-python/-/issues/43Implement changes for MLP to handle documents in bulk.2021-08-18T13:01:07ZMarko KolloImplement changes for MLP to handle documents in bulk.https://git.texta.ee/texta/texta-mlp-python/-/issues/42Profiling: How much does Stanzas supported bulk processing help with speed co...2021-08-18T12:59:07ZMarko KolloProfiling: How much does Stanzas supported bulk processing help with speed compared to single processing.This should contain texts of different sizes. From a small comment sized, several sentences worth and a whole article.This should contain texts of different sizes. From a small comment sized, several sentences worth and a whole article.https://git.texta.ee/texta/texta-mlp-python/-/issues/39Optimize batch processing2021-06-10T10:22:06ZRaul SirelOptimize batch processingMLP is very slow with large documents and document lists because of Stanza.
Possible solution: https://github.com/apmoore1/stanza-batchMLP is very slow with large documents and document lists because of Stanza.
Possible solution: https://github.com/apmoore1/stanza-batchMarko KolloMarko Kollohttps://git.texta.ee/texta/texta-mlp-python/-/issues/33Idea: table of content extractor2021-04-27T16:58:55ZLinda FreienthalIdea: table of content extractor- Is it possible?
- can be used for extracting chapters- Is it possible?
- can be used for extracting chaptershttps://git.texta.ee/texta/texta-mlp-python/-/issues/32Geoanalysis2021-07-12T08:40:32ZRaul SirelGeoanalysis* Koordinaatide eraldamine tekstist + suvalistelt andmeväljadelt.
* Aadresside ja asukohtade sidumine koordinaatidega.
* Tuvastatud koordinaatide kuvamine TK-s kaardil.* Koordinaatide eraldamine tekstist + suvalistelt andmeväljadelt.
* Aadresside ja asukohtade sidumine koordinaatidega.
* Tuvastatud koordinaatide kuvamine TK-s kaardil.Krister KruusmaaKrister Kruusmaahttps://git.texta.ee/texta/texta-mlp-python/-/issues/30Improve EmailParser (first collect bad examples).2021-04-27T16:57:06ZLinda FreienthalImprove EmailParser (first collect bad examples).Collect here examples of bad parsing and improve parser.
" vanema toetuse väljamakset?annika.tammemäe@gmail.com" ==> väljamakset?annika.tammemäe@gmail.com
". e-mail. (xxxxxxx@suhtlus.ee)isikukod(xxxxx).jään ootama vastust." ==> "e-mail...Collect here examples of bad parsing and improve parser.
" vanema toetuse väljamakset?annika.tammemäe@gmail.com" ==> väljamakset?annika.tammemäe@gmail.com
". e-mail. (xxxxxxx@suhtlus.ee)isikukod(xxxxx).jään ootama vastust." ==> "e-mail.(xxxxxxx@suhtlus.ee)isikukod(xxxxx"
"minu meiliaadressile .....@gmail.com_x000D_
Palun teil samuti " ==> "meiliaadressile.....@gmail.com_x000D_" (käsitsi salastatud tekst)Linda FreienthalLinda Freienthalhttps://git.texta.ee/texta/texta-mlp-python/-/issues/29Fact structure has "lemma" field in it2021-03-22T11:00:04ZLinda FreienthalFact structure has "lemma" field in itand also we dont load it from json: https://git.texta.ee/texta/texta-mlp-python/-/blob/master/texta_mlp/fact.py#L23. Find out why and what to do.and also we dont load it from json: https://git.texta.ee/texta/texta-mlp-python/-/blob/master/texta_mlp/fact.py#L23. Find out why and what to do.Linda FreienthalLinda Freienthalhttps://git.texta.ee/texta/texta-mlp-python/-/issues/25Add parsers for coordinates2021-03-04T09:17:35ZRaul SirelAdd parsers for coordinatesExample text:
```
BOG MARRUEC
OS - BOG 4.553891 74.11848
```Example text:
```
BOG MARRUEC
OS - BOG 4.553891 74.11848
```https://git.texta.ee/texta/texta-mlp-python/-/issues/22Tests for MLP worker2021-10-13T09:30:38ZRaul SirelTests for MLP workerWael RamadanWael Ramadanhttps://git.texta.ee/texta/texta-mlp-python/-/issues/17Change name of the fact "BOUNDED"2020-11-13T06:33:20ZLinda FreienthalChange name of the fact "BOUNDED"Cause nobody knows that it means "piiritletud, ühendatud, seotud". I refuse to use "linked" or sth similar, because entity linking means something else in NLP.Cause nobody knows that it means "piiritletud, ühendatud, seotud". I refuse to use "linked" or sth similar, because entity linking means something else in NLP.Linda FreienthalLinda Freienthalhttps://git.texta.ee/texta/texta-mlp-python/-/issues/14Add texta_facts helper analyzer for all entity based analyzers.2020-06-30T15:26:51ZMarko KolloAdd texta_facts helper analyzer for all entity based analyzers.Since the analyzers that make up texta_facts are bound to change, referencing them by name inside the Toolkit
becomes difficult, add a seperate analyzer named texta_facts which makes MLP automatically select the necessary
analyzers that ...Since the analyzers that make up texta_facts are bound to change, referencing them by name inside the Toolkit
becomes difficult, add a seperate analyzer named texta_facts which makes MLP automatically select the necessary
analyzers that it consists of.https://git.texta.ee/texta/texta-mlp-python/-/issues/4Split input document into paragraphs2020-04-16T15:30:55ZKristiina VaikSplit input document into paragraphs
1. Document should be split on "\n\n", result: n paragraphs per document
1. Detect the language of each paragraph
1. Apply the correct MLP pipeline according to the detected language of each paragraph
1. Put the entire document back tog...
1. Document should be split on "\n\n", result: n paragraphs per document
1. Detect the language of each paragraph
1. Apply the correct MLP pipeline according to the detected language of each paragraph
1. Put the entire document back together
1. Adjust the initial fact spansMarko KolloMarko Kollo