NeoBERT: A Next-Generation BERT

The paper “NeoBERT: A Next-Generation BERT” (2025) by Lola Le Breton, Quentin Fournier, Mariam El Mezouar, John X. Morris, Sarath Chandar introduces a new encoder model with updated architecture, training data, and pre-training methods and is intended as a strong backbone model.

[W]e introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies.

These are the most important highlights at a first glance:

Performance: It performs state-of-the-art on MTEB for its parameter class against other backbone models like BERT, NomicBERT, or ModernBERT base.
Size: The model is medium-sized with 250M parameters
Sequence length: Compared to BERT and RoBERTa (512) it has an extended context length of 4,096 tokens similar to NomicBERT (2,048) and ModernBERT (8,192)
Speed: NeoBERT is faster than ModernBERT in inference speed, despite being 100M parameters larger than the ModernBERT base
Dimensions: NeoBERT maintains same hidden size as base models (768) allowing for seamless plug-and-play

And here are the relevant links:

arXiv: https://arxiv.org/abs/2502.19587
Model on Hugging Face: https://huggingface.co/chandar-lab/NeoBERT
Code: https://github.com/chandar-lab/NeoBERT

Motivation

The motivation behind is paper is the fact that encoders have not received as much love as LLMs in recent years, although they are equally important for downstream applications, like RAG systems.

Today’s LLMs are capable of in-context learning and reasoning because of advancements in architecture, training data, pre-training, and fine-tuning. While there has been research on fine-tuning methods for pre-trained encoders (e.g., GTE or jina-embeddings), they are applied to older base models, like BERT from 2019. Therefore, the authors see a lack of updated open-source base models to apply these new fine-tuning techniques to.

As a result, there is a dire need for a new generation of BERT-like pre-trained models that incorporate up-to-date knowledge and leverage both architectural and training innovations, forming stronger backbones for these more advanced fine-tuning procedures.

Recent work on modernizing these base models are NomicBERT and ModernBERT, which this paper takes inspiration from:

NomicBERT:
- Paper: https://arxiv.org/abs/2402.01613 (nomic-bert-2048, not to be confused with nomic-embed-text-v1)
- Model: https://huggingface.co/nomic-ai/nomic-bert-2048
- License: apache-2.0
- Released: February 2025
ModernBERT:
- Paper: https://arxiv.org/abs/2412.13663
- Model: https://huggingface.co/answerdotai/ModernBERT-base
- Model: https://huggingface.co/answerdotai/ModernBERT-large
- License: apache-2.0
- Released: December 2024

Key insights

The paper covers a lot of interesting nitty-gritty details on recent advancements around architecture choice, training data selection, and pre-training methods. But if you step back, I think the key insights are that they are confirming what we already know from LLMs:

Training on a lot of good data = Better models
Increasing model size = Better models (even at small scale)

Training on a lot of good data = Better models

According to the paper, the modification with the biggest improvement was changing the training data:

[…] replacing Wikitext and BookCorpus with the significantly larger and more diverse RefinedWeb dataset improved the score by +3.6% […]

They trained NeoBERT on RefinedWeb, which is a 2.8 TB large dataset. It contains 600B tokens and is 18 times larger than RoBERTa’s training dataset.

Following the same trend, we pre-trained NeoBERT on RefinedWeb (Penedo et al., 2023), a massive dataset containing 600B tokens, nearly 18 times larger than RoBERTa’s.

I think it’s interesting that apparently the newer NomicBERT was trained on the same dataset as BERT with 13GB, while RoBERTa was trained on an extended dataset of 160 GB, and it’s just now that this has been done to encoders, while this has been done to generative models for a while already.

Recent generative models like the LLaMA family (Touvron et al., 2023; Dubey et al., 2024) have demonstrated that language models benefit from being trained on significantly more tokens than was previously standard. Recently, LLaMA-3.2 1B was successfully trained on up to 9T tokens without showing signs of saturation. Moreover, encoders are less sample-efficient than decoders since they only make predictions for masked tokens. Therefore, it is reasonable to believe that encoders of similar sizes can be trained on an equal or even greater number of tokens without saturating.

Increasing model size = Better models (even at small scale)

The second most impactful modification was increasing the model size and finding an optimal depth-to-width ratio for the Transformer architecture:

[…] while increasing the model size from 120M to 250M in M7 led to a +2.9% relative improvement.

So, NomicBERT and ModernBERT base both have around 150M parameters and are considered small-sized. NeoBERT with 250M parameters can be considered medium-sized, so it makes sense that it performs better than smaller models.

But what’s interesting is that they took the depth-to-width ratio into consideration when increasing the model size:

In contrast, small language models like BERT, RoBERTa, and NomicBERT are instead in a width-inefficiency regime. To maximize NeoBERT’s parameter efficiency while ensuring it remains a seamless plug-and-play replacement, we retain the original BERT base width of 768 and instead increase its depth to achieve this optimal ratio.

So, they first increased the number of parameters to 250M with a depth-to-width ratio of 16 x 1056 (too wide) and then they optimized the depth-to-width ratio to 28 x 768 (more width-efficient).

Note that to assess the impact of the depth-to-width ratio, we first scale the number of parameters in M7 to 250M while maintaining a similar ratio to BERT base, resulting in 16 layers of dimension 1056. In M8, the ratio is then adjusted to 28 layers of dimension 768.

This is nice, because by keeping the hidden size at 768, NeoBERT can be easily switched out for other base models:

NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models

What I also think is remarkable is that it has a much faster inference speed compared to both ModernBERT models despite it’s size. There’s a nice figure in the paper showing the thoughput at different sequence length. The figure is missing NomicBERT though, which I don’t know why.

For extended sequences, NeoBERT significantly outperforms ModernBERT base, despite having 100M more parameters, achieving a 46.7% speedup on sequences of 4, 096 tokens.

Old Encoders vs. Modern Encoders

The paper features a nice overview table of different characteristics between older encoders, like BERT (2019), RoBERTa (2019), and newer encoders, like ModernBERT (2024) and NomicBERT (2025), which shows their differences. Here I’m summarizing key differences between older and newer encoders that stood out to me:

Configuration	Older Encoders	Newer Encoders
Position encoding and sequence lengths	Absolute positional embeddings with a sequence length of 512	RoPE for handling longer sequences of 2,048 to 8,192
Masking rate	15 %	Optimal masking rate was found to be between 20 to 40 % by Wettig et al.
Optimizer	Adam	AdamW
Training	DDP	FlashAttention and other
Normalization	Post-Layer Normalization	Pre-Layer Normalization (normalization layer is moved inside the residual connection of each feed-forward and attention block)