Change or allow to specify optimization method

Currently our TorchTagger uses stochastic gradient descent (SGD) as an optimization method. While it's quite popular, it might take quite a lot of time to converge.

Below is an image of a fasttext classifier trained for 100 epochs with SGD. It can be seen that the model has not converged yet.

Below is an image of a fasttext classifier trained on the same data for 100 epochs with Adam optimizer. It can be seen that model converges much faster and therefore requires less training time.