Create a proper test dataset for the tests
Currently data is added inside the tests itself, which is ugly as hell. We should pick a few dozen samples from an existing index, vectorize them, save to a file, import during tests once and adjust tests for them.