Memory issues with documents containing long words
Running MLP on (6GB) GPU leads to "CUDA out of memory" errors (even for a monolingual dataset), if some document contains a word that is too long (like encrypted messages or bad OCR results resulting in a long array of symbols like "HyAHS473HSgsFhNSMIAUDIMUAIMDIAMUDOIAUMDOIUAMOIDUHGAGHJAAHJHJHDJHAKAD..."). To avoid this, set some configurable limit for maximum word length and ignore documents that surpass it (but log the failed documents!). If the long character sequence is located at the end of the document, it could be just stripped before passing the document to Stanza pipeline. ... or maybe there's a better solution.