About The Project
MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking. MILES currently supports 22 languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Indonesian, Italian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Ukrainian.
As a result of not using any language-specific resources (WordNets, POS taggers, parallel corpora, etc.), MILES does not always offer synonymous substitutions for complex words. Although almost always simpler than the original, selected substitutions may alter the meaning of the text. Please keep this in mind, and feel free to download and tailor MILES to a language of your choosing!
It is recommended that fastText embeddings are downloaded for your target language/s. These will be used by MILES to make notably more accurate simplifications. To install fastText embeddings for MILES, download the .vec embeddings for you target language here. Once done, place the .vec file in simplifier/embeddings/ before running the key vector generation script with the ISO 639-1 code for the selected language:
python simplifier/embeddings/gen_keyed_vectors.py <ISO 639-1 code>
MILES simplifications can be done using either a simple Flask app provided or the command line. To start using the Flask app, run app.py with ISO 639-1 language code:
python app.py -l <ISO 639-1 code>
Once running, open 127.0.0.1 in your browser and start simplifying!
If you would prefer to use the command line, there are a couple of options available:
python simplify.py -t <sentence> -l <ISO 639-1 code>
Simplifying text files:
python simplify.py -f <text_file> -l <ISO 639-1 code>
Note: If no language code is provided, text will be simplified assuming it's English. The default language can be changed in simplifier/config.py.
See the open issues for a list of proposed features (and known issues).