NeoBERT: A Next-Generation BERT

My study notes on the ‘NeoBERT: A Next-Generation BERT’ paper and explorations with the model
Published

June 25, 2025

The paper “NeoBERT: A Next-Generation BERT” (2025) by Lola Le Breton, Quentin Fournier, Mariam El Mezouar, John X. Morris, Sarath Chandar introduces a new encoder model with updated architecture, training data, and pre-training methods and is intended as a strong backbone model.

[W]e introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies.

These are the most important highlights at a first glance:

And here are the relevant links:

Motivation

The motivation behind is paper is the fact that encoders have not received as much love as LLMs in recent years, although they are equally important for downstream applications, like RAG systems.

Today’s LLMs are capable of in-context learning and reasoning because of advancements in architecture, training data, pre-training, and fine-tuning. While there has been research on fine-tuning methods for pre-trained encoders (e.g., GTE or jina-embeddings), they are applied to older base models, like BERT from 2019. Therefore, the authors see a lack of updated open-source base models to apply these new fine-tuning techniques to.

As a result, there is a dire need for a new generation of BERT-like pre-trained models that incorporate up-to-date knowledge and leverage both architectural and training innovations, forming stronger backbones for these more advanced fine-tuning procedures.

Recent work on modernizing these base models are NomicBERT and ModernBERT, which this paper takes inspiration from:

Key insights

The paper covers a lot of interesting nitty-gritty details on recent advancements around architecture choice, training data selection, and pre-training methods. But if you step back, I think the key insights are that they are confirming what we already know from LLMs:

  1. Training on a lot of good data = Better models
  2. Increasing model size = Better models (even at small scale)

Training on a lot of good data = Better models

According to the paper, the modification with the biggest improvement was changing the training data:

[…] replacing Wikitext and BookCorpus with the significantly larger and more diverse RefinedWeb dataset improved the score by +3.6% […]

They trained NeoBERT on RefinedWeb, which is a 2.8 TB large dataset. It contains 600B tokens and is 18 times larger than RoBERTa’s training dataset.

Following the same trend, we pre-trained NeoBERT on RefinedWeb (Penedo et al., 2023), a massive dataset containing 600B tokens, nearly 18 times larger than RoBERTa’s.

I think it’s interesting that apparently the newer NomicBERT was trained on the same dataset as BERT with 13GB, while RoBERTa was trained on an extended dataset of 160 GB, and it’s just now that this has been done to encoders, while this has been done to generative models for a while already.

Recent generative models like the LLaMA family (Touvron et al., 2023; Dubey et al., 2024) have demonstrated that language models benefit from being trained on significantly more tokens than was previously standard. Recently, LLaMA-3.2 1B was successfully trained on up to 9T tokens without showing signs of saturation. Moreover, encoders are less sample-efficient than decoders since they only make predictions for masked tokens. Therefore, it is reasonable to believe that encoders of similar sizes can be trained on an equal or even greater number of tokens without saturating.

Increasing model size = Better models (even at small scale)

The second most impactful modification was increasing the model size and finding an optimal depth-to-width ratio for the Transformer architecture:

[…] while increasing the model size from 120M to 250M in M7 led to a +2.9% relative improvement.

So, NomicBERT and ModernBERT base both have around 150M parameters and are considered small-sized. NeoBERT with 250M parameters can be considered medium-sized, so it makes sense that it performs better than smaller models.

But what’s interesting is that they took the depth-to-width ratio into consideration when increasing the model size:

In contrast, small language models like BERT, RoBERTa, and NomicBERT are instead in a width-inefficiency regime. To maximize NeoBERT’s parameter efficiency while ensuring it remains a seamless plug-and-play replacement, we retain the original BERT base width of 768 and instead increase its depth to achieve this optimal ratio.

So, they first increased the number of parameters to 250M with a depth-to-width ratio of 16 x 1056 (too wide) and then they optimized the depth-to-width ratio to 28 x 768 (more width-efficient).

Note that to assess the impact of the depth-to-width ratio, we first scale the number of parameters in M7 to 250M while maintaining a similar ratio to BERT base, resulting in 16 layers of dimension 1056. In M8, the ratio is then adjusted to 28 layers of dimension 768.

This is nice, because by keeping the hidden size at 768, NeoBERT can be easily switched out for other base models:

NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models

What I also think is remarkable is that it has a much faster inference speed compared to both ModernBERT models despite it’s size. There’s a nice figure in the paper showing the thoughput at different sequence length. The figure is missing NomicBERT though, which I don’t know why.

For extended sequences, NeoBERT significantly outperforms ModernBERT base, despite having 100M more parameters, achieving a 46.7% speedup on sequences of 4, 096 tokens.

Old Encoders vs. Modern Encoders

The paper features a nice overview table of different characteristics between older encoders, like BERT (2019), RoBERTa (2019), and newer encoders, like ModernBERT (2024) and NomicBERT (2025), which shows their differences. Here I’m summarizing key differences between older and newer encoders that stood out to me:

Configuration Older Encoders Newer Encoders
Position encoding and sequence lengths Absolute positional embeddings with a sequence length of 512 RoPE for handling longer sequences of 2,048 to 8,192
Masking rate 15 % Optimal masking rate was found to be between 20 to 40 % by Wettig et al.
Optimizer Adam AdamW
Training DDP FlashAttention and other
Normalization Post-Layer Normalization Pre-Layer Normalization (normalization layer is moved inside the residual connection of each feed-forward and attention block)
Back to top