Commit cfe0f827 authored by Raul Sirel's avatar Raul Sirel
Browse files

update readme

parent 5433e9f5
Pipeline #5015 canceled with stage
in 30 seconds
......@@ -16,14 +16,35 @@ http://pypi.texta.ee/texta-mlp/
`python3 -m pytest -v tests`
## Entities
MLP extracts following entities:
### Model-based Entities
MLP uses Stanza to extract:
* Persons (missing Estonian model)
* Organizations (missing Estonian model)
* Geopolitical entities (missing Estonian model)
* Phone numbers
* Email addresses
### Regex-based Entities
MLP uses regular expressions to extract:
* Phone numbers (regex)
* Email addresses (regex)
### List-based Entitites
MLP also supports entity extraction using lists of predefined entities. These lists come with MLP:
* Companies (Estonian)
* Addresses (Estonian & Russian)
* Addresses (Estonian and Russian)
* Currencies (Estonian, Russian, and English)
## Custom List-based Entities
MLP also supports defining custom entity lists. Custom lists must be placed in the **entity_mapper** directory residing in **data** directory.
Entities are defined as JSON files:
```
{
"MY_ENTITY": [
"foo",
"bar"
]
}
```
## Usage
......@@ -126,142 +147,3 @@ You can choose the parsers like so:
```
>>> mlp.process(analyzers=["lemmas", "phone_high_precision"], raw_text= "My phone number is 12 34 56 77.")
```
### Concatenate close entities
Let`s test MLP() and Concatenator() on the following three letters.
Letter 1:
```
Dear all,
Let`s not forget that I intend to concure the whole of Persian Empire!
Best wishes,
Alexander Great
aleksandersuur356eKr@mail.ee
phone: 76883266
```
Letter 2:
```
От: Terry Pratchett < tpratchett@gmail.com >
Кому: Joe Abercrombie < jabercrombie@gmail.com >
Название: Разъяснение
Дорогой Joe,
Как вы? Надеюсь, у тебя все хорошо. Последний месяц я писал свой новый роман,
который обещал представить в начале лета. Я тоже немного почитал и обожаю твою
новую книгу!
Я просто хотел уточнить, что Alexander Great жил в Македонии.
Лучший,
Terry
```
Letter 3:
```
Dear Terry!
Terry Pratchett already created Discworld. This name is taken. Other than that I found
the piece fascanating and see great potential in you! I strongly encourage you to take
action in publishing your works. Btw, if you would like to show your works to Pratchett
as well, he`s interested. I talked about you to him. His email is tpratchett@gmail.com.
Feel free to write him!
Joe
From: Terry Berry < bigfan@gmail.com >
To: Joe Abercrombie < jabercrombie@gmail.com >
Title: Question
Hi Joe,
I finally finished my draft and I`m sending it to you. The hardest part
was creating new places. What do you think of the names of the places I created?
Terry Berry
```
Let`s read all those letters into a list called "mailbox". We will process the letters as discribed above and save them into a jsonlines file.
```
from texta_mlp.mlp import MLP
mlp = MLP(language_codes=["et","en","ru"])
processed_letters = []
for letter in mailbox:
processed_letters += [mlp.process(letter)]
import jsonlines
with jsonlines.open("letters.jsonl", mode="w") as writer:
writer.write_all(processed_letters)
```
MLP() already creates a fact BOUNDED which bounds the closest entities within the letter together. In order to sort out the info in whole mailbox we have to concatenate the BOUNDED facts. It means creating a database of personal info gotten from different letters. For that we use the Concatenator(), which input is processed letters.
```
from texta_mlp.concatenator import Concatenator
cn = Concatenator()
cn.load_bounded_from_jsonl(path = "letters.jsonl")
#cn.load_bounded_fron_jsonl() uses default path "mlpanalyzed.jsonl"
```
Then we will concatenate the BOUNDED facts. Be aware that with large mailboxes it might take 2 hours!
```
cn.concatenate()
```
We can check the length of the database lists and the content with functions:
- cn._just_pers_infos() (type "close_persons", persons appearing close in letter(s)),
- cn._bounded() (the original unconcatenated bounded),
- cn._unsure_infos() (type "unsure_whose_entities", enities that have >=2 candidate persons, not sure to whom it belongs),
- cn._no_personas_infos() (type "no_per_close_entities", entities appearing close in letter(s) without persons nearby),
- cn._persona_infos() (type "person_info", the real deal, entities with its person).
All of that can be saved to .jsonl file.
```
cn.save_to_jsonlines(path="concatenated_bounds_from_mailbox.jsonl")
#cn.save_to_jsonlines() uses default path "concatenated_bounds.jsonl"
```
Output of "concatenated_bounds_from_mailbox.jsonl":
```
{"type": "person_info", "PER": "Alexander Great", "LOC": ["Македония", "Persian Empire"], "EMAIL": ["aleksandersuur356eKr@mail.ee"], "PHONE": ["76883266"]}
{"type": "person_info", "PER": "Joe Abercrombie", "EMAIL": ["jabercrombie@gmail.com"]}
{"type": "person_info", "PER": "Terry Berry", "EMAIL": ["bigfan@gmail.com"]}
{"type": "person_info", "PER": "Terry Pratchett", "EMAIL": ["tpratchett@gmail.com"]}
```
### Dealing with Elasticsearch
We can also use Elasticsearch with Concatenator(). Here`s a snippet for getting from Elasticsearch and processing documents already processed by MLP() and then uploading them to a new index.
```
from texta_mlp.concatenator import Concatenator
cn = Concatenator()
cn.load_bounded_from_elastic(es_url= 'http://localhost:8888', index_name = "mlp_processed_mails")
cn.concatenate()
cn.save_to_elasticsearch(index_name = 'http://localhost:8888', es_url = "mails_concatenated_bounded")
```
Using just cn.load_bounded_from_elastic() uses default settings:
```
cn.load_bounded_from_elasticsearch(es_url= 'http://elastic-dev.texta.ee:9200', index_name = "mlp_processed_mails")
```
Using just cn.save_to_elasticsearch() uses default settings:
```
cn.save_to_elasticsearch(index_name = 'http://elastic-dev.texta.ee:9200', es_url = "concatenated_BOUNDED")
```
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment