SloVlo V1: Embeddings for the Slovenian Language

June 22nd, 2024

The SloVlo V1 (Slovenske Vložitve) project brings purposefully built embeddings and semantic search capabilities to the Slovenian language.

If you've never heard about embeddings, you can think of them as an array of numbers that capture the meaning of a particular text. More importantly, we can take two embeddings and calculate their similarity. That allows us to take a user query, embed it, and search for the most similar documents in our pre-embedded corpus. If you want to learn more, this is a great resource: What are embeddings by Vicki Boykis.

Multilinguality is an important topic in embeddings. It allows developers to build semantic search applications using languages other than English (apparently there is a world outside of San Francisco). The Slovenian language usually makes up a tiny fraction of multilingual datasets due to the relative lack of Slovenian content on the internet. The SloVlo project attempts to bridge that gap by purposefully gathering data and training models for the Slovenian language. We are releasing:

  1. slovlo-dataset-v1: 1.5M+ Slovenian query and document pairs,
  2. slovlo-v1: a fine-tuned embedding model for the Slovenian language, and
  3. slovlo: the code used to gather the data, and train the model.

Using SloVlo Embeddings

Here is an example of how you can quickly build a semantic search application over Slovenian documents.

We are going to build a CLI application to help with the most Slovenian hobby possible: discovering new hills and mountains to visit.

# examples/find_trip.py

# We will read the query from the command line arguments.
query = sys.argv[1]

# First, we define the documents we want to search over.
# In our case, that is a list of destination descriptions.
documents = [
  "Triglav je najvišja gora v Sloveniji (2864 m) in simbol slovenske narodne identitete. Pohod je zahteven in običajno traja dva dni. Potrebna je dobra fizična pripravljenost in osnovno znanje plezanja. Priporočena je tudi uporaba vodnika za manj izkušene pohodnike.",
  "Velika Planina je zelo priljubljena pohodniška destinacija z značilnimi pastirskimi kočami. Pohod je primeren za vse starosti in ponuja čudovite razglede na okoliške gore. Na vrh se lahko povzpnete peš ali z nihalko iz Kamniške Bistrice.",
  "Bled je znan po kremnih rezinah. Če vas zanima pohod, so pa zraven še Ojstrica, ter Mala in Velika Osojnica.",
  "Golica je znana po neskončnih poljih narcis v maju. Pohod se začne iz vasi Planina pod Golico in traja približno 2-3 ure. Pot je primerna za vse pohodnike in ponuja lepe razglede na Julijske Alpe in Avstrijo.",
  "Šmarna Gora je najbolj priljubljena pohodniška destinacija v bližini Ljubljane. Pohod traja približno 1 uro iz Tacna. Na vrhu je koča, kjer lahko uživate v tradicionalni slovenski hrani in lepih razgledih na Ljubljansko kotlino.",
  "Pohorje je pohodniško območje z različnimi potmi, primernimi za vse starosti in pripravljenosti. Posebej priljubljena je pot do Črnega jezera in Slivniškega jezera. Pozimi je Pohorje tudi priljubljena smučarska destinacija.",
]

# Load the model and the tokenizer.
slovlo_model = AutoModel.from_pretrained("rokn/slovlo-v1").eval().to(device)
slovlo_tokenizer = AutoTokenizer.from_pretrained("rokn/slovlo-v1")

# Embed the documents (destinations).
document_embeddings = get_embeddings(documents)

# Embed the user query.
query_embedding = get_embeddings([query])

# Compute dot product similarity between the query and each document.
similarities = torch.matmul(document_embeddings, query_embedding.T).squeeze()

# Find the nearest neighbor.
nearest_index = torch.argmax(similarities).item()

print("Predlog za tvojo naslednjo avanturo:")
print(documents[nearest_index])

I omitted the imports and the helper functions for brevity. You can find the full script here.

Ok, let's try a couple of queries.

$ python examples/find_trip.py "Rad bi se malo sprehodil, potem pa še kakšno kremšnito pojedel."  # Translated: I want to take a walk, and then eat a kremšnita.

Bled je znan po kremnih rezinah. Če vas zanima pohod, ...

That's pretty good! The embedding model was able to connect "kremšnita" (the common way to refer to the traditional Slovenian dessert) with "kremna rezina" even though they are syntactically different. And through that connection, it correctly suggested Bled where you can find the best (if overpriced) "kremšnita".

$ python examples/find_trip.py "Kam na pohod iz glavnega mesta Slovenije?"  # Translated: Where to go for a hike from the capital of Slovenia?

Šmarna Gora je najbolj priljubljena pohodniška destinacija v bližini Ljubljane. Pohod traja ...

In the next example, the model suggested Šmarna Gora as the most appropriate trip from the capital of Slovenia. Even though Ljubljana is not mentioned in the query, the model learned during training that Ljubljana is the capital of Slovenia. Consequently, when a user looks for documents related to the capital of Slovenia, the model automatically includes documents mentioning Ljubljana.

Data

To train embedding models we need pairs. Ideally, we need pairs of queries and corresponding documents because we need to train the model to associate queries with relevant documents.

An example pair:

{
  "query": "Kaj je glavno mesto Slovenije?",  // Translated: What's the capital of Slovenia?
  "document": "Ljubljana"
}

Unfortunately, no such dataset exists. Instead, we approximate queries and documents using other sources. We can use news headlines as the queries, and the articles as documents. Similarly, we can use Wikipedia headings and sections, along with Reddit titles and comments.

We used three sources for data collection: Slovenian Wikipedia, rtvslo.si, and the r/Slovenia subreddit. They contain easily accessible, structured Slovenian language data. The sources contain a mix of formal Slovene, and a contemporary version of Slovene with slang, English, and other languages sprinkled in.

Examples from the dataset:

[
  {
    "query": "Ionska spojina",
    "document": "Ionska spojina je snov, ki je sestavljena iz pozitivnih delcev (kationov) in negativnih delcev (anionov). Delci so med sabo povezani z mo\u010dnimi elektrostatskimi vezmi, ki jih pri kemiji ozna\u010dujemo pod pojmom ionska vez. Najpreprostej\u0161a ionska spojina je natrijev klorid (NaCl), ki je sestavljen iz natrijevih (Na) in kloridnih (Cl) ionov."
  },
  {
    "query": "Ob glasni podpori navija\u010dev lahko pade tudi Flensburg",
    "document": "Rokometa\u0161i Celja so si prej\u0161nji teden z zmago na Poljskem povrnili upanje na preboj v osmino finala Lige prvakov. Prvi izmed treh preostalih nasprotnikov v rednem delu je nem\u0161ki velikan Flensburg."
  }
]

The code used to gather the datasets is available in the slovlo/scrape directory.

Training

We used a semi-supervised learning method called contrastive learning to train the embedding models. Specifically, we used the InfoNCE contrastive loss with in-batch negatives as detailed in the E5 paper. The InfoNCE loss is used to maximize the similarity between the positive pairs while minimizing the similarity between each positive pair and the negative samples from the same batch.

We made an additional tweak by using Gradient Caching to increase the batch size. Increasing the batch size improves the quality of the embeddings when using in-batch negatives.

I used the multilingual-e5-base model as the base model for fine-tuning. It is a strong multilingual model that supports 100 languages from the xlm-roberta model (including Slovene). I briefly experimented with the sloberta model but the quality improvement was not good enough.

The code used to train the models is available in the slovlo/embedding_model directory.

Evaluation

To evaluate the models we used the test split from the SloVlo dataset. We calculate three metrics based on MRR: MRR@1, MRR@5, and MRR@10.

We evaluated the slovlo-v1 model against other open-source models and Elasticsearch (BM25). We evaluate lexical search with BM25 because it provides a strong baseline for comparison. We used the Serbian language analyzer for the Elasticsearch index as the closest approximation to Slovenian.

Model MRR@1 MRR@5 MRR@10
Elasticsearch (BM25) 31.7 45.2 45.8
e5-base-v2 25.1 36.5 37.2
multilingual-e5-base 37.2 53.9 54.5
bge-m3 38.1 54.1 54.7
slovlo-v1 43.6 60.4 61.0

We see a significant improvement using the slovlo-v1 model compared to the best OSS models and BM25.

The code used to evaluate the models is available in the slovlo/embedding_evaluation directory.

Next Steps

This is only the first version of the model, and I pushed it as far as time allowed. There is plenty of low-hanging fruit in terms of improving the dataset and the models. Here are a couple of improvements in no particular order:

  • More data: There are plenty of Slovenian data sources where we can extract pairs (if we get creative).
  • Better data: The V1 of the SloVlo dataset was minimally cleaned. Investigate if self-consistent filtering (from the E5 paper) improves the quality of the model.
  • Hard negative mining: Investigate whether we can use slovlo-v1 to mine hard negatives instead of using in-batch negatives during training.
  • Better evaluation: I did not find any labeled Slovenian retrieval datasets I could use to evaluate the SloVlo model. I had to rely on my test split to gauge performance improvements. While there is nothing wrong with that, I would like to have a more comprehensive, annotated dataset to evaluate retrieval performance across different domains.