Integration of BERT models into Orion

Xytronix · 2025-04-23T02:43:29+00:00

This thread is dedicated to the possibility of integrating BERT models and implementing a LLM interference engine.

BERT models are advancing and I would like them to serve Orion in a way the user chooses to.

With BERT models it is possible to personalize the users experience without sacrificing your privacy.

Implementation

Dropdown Menu in the Settings where the user can choose to download a BERT model
Option to use multiple BERT models
Specify the usage of the BERT model upon given criteria
Kagi Search collaborating together with Orion for greater UX
BERT models adjusting site content based on collected data kept privately in the browser.
BERT Benchmark for Orion is required

Possibilities

Category	Use Case / Capability	Description / Bundled Examples
1. Text Analysis & Classification	General Text Classification	Automatically categorizing texts by topic, type, or relevance (e.g., news articles, support tickets, legal documents). This may be used for bookmarks.
	Specific Classifications	Identifying sentiment (product reviews, social media), spam (emails, comments), or user intent (chatbots).
	Named Entity Recognition (NER)	Extracting and classifying specific information like names, locations, dates from texts.
2. Information Retrieval & Recommendation	Semantic Search & Question Answering (Q&A)	Understanding search queries or questions to extract relevant information from documents, web pages, or browser history (contextual understanding).
	Recommendations & Ranking	Suggesting similar websites, products, or content; ranking results or visited sites based on relevance, preference, or user behavior.
	Contextual Autocompletion	Intelligent suggestions for completing inputs (e.g., search bar, forms) based on context and past interactions (beyond simple keyword matching).
3. User Experience Enhancement	Pattern & Anomaly Detection (e.g., "Annoyances")	Identifying unusual or potentially disruptive website behavior patterns (e.g., frequent title changes, pop-ups) by analyzing sequences/logs.
	On-Device Personalization	Adjusting site content, layout, or recommendations based on locally stored user data (history, preferences) processed privately within the browser.

BERT and SLM Models for Browser Use

Model Name	Params / Model Size	Typical RAM (Inference)	CPU Usage (Browser)	Max Seq. Length	Browser Suitability	Use Cases	Link(s)
BERT (base)	110M / 420MB	400–800 MB	High for long seq., moderate	512	Prototyping, not optimal	Text classification, NER, feature extraction	Hugging Face GitHub Demo
BERT (large)	340M / 1.6GB	1.2–2.5 GB	Very high	512	Not practical for browser	Advanced NLU, classification, QA	Restackio
DistilBERT	66M / 250MB	200–400 MB	Lower than BERT-base	512	Excellent, browser-optimized	Fast text classification, Q&A, sentiment analysis	Hugging Face
MobileBERT	25M / 100MB	100–200 MB	Low, efficient	512	Best for browser/mobile	Lightweight Q&A, classification	Hugging Face Demo
ALBERT (base)	12M / 45–60MB	80–160 MB	Very efficient	512	Very good, small footprint	Efficient text classification, intent detection	Hugging Face
MiniLM	33M / 120MB	150–250 MB	Low	512	Excellent for search/ranking	Semantic search, ranking, recommendations	Hugging Face
TinyBERT	14M / 55MB	60–120 MB	Very low	512	Excellent	Fast, efficient, competitive on small tasks	Hugging Face
ModernBERT	110M / varies	400–700 MB	Moderate (browser-optimized)	4096+	SoA, large context	Retrieval, classification, code/doc search	Hugging Face Blog
LongFormer	149M / 575MB	600–1000 MB	High for long documents	4096	Research, long docs	Long doc Q&A, summarization	GitHub
BioBERT	110M / 420MB	400–800 MB	As BERT-base	512	Research, domain-specific	Biomedical QA, extraction	Hugging Face
Entity-BERT	110M / 420MB	400–800 MB	As BERT-base	512	Research, specialized	Entity extraction in domain tasks	Frontiers Paper
NeoBERT	80–120M / TBD	Expected lower	Expected lower	512+	Research, next-gen efficient	Classification, retrieval, efficient inference	arXiv
Qwen2 (1B)	1B / 4GB	2–4 GB	Moderate	2048+	Good (quantized)	Multitask, edge, fast	Hugging Face
Phi-3 (mini)	1.8B / 7GB	4–8 GB	Moderate	128k	Good (quantized)	Code, reasoning, dialogue	Hugging Face
TinyLlama	1.1B / 4GB	2–4 GB	Moderate	2048+	Good (quantized)	General SLM, tiny, fast	Hugging Face
Llama 2 (7B)	7B / 28GB	8–16 GB	High, edge device	4096	Feasible (quantized/edge)	General, strong SLM	Meta
Mistral (7B)	7B / 28GB	8–16 GB	High, edge device	8192	Feasible (quantized/edge)	Fast, efficient, browser/edge	Hugging Face
Gemma 2 (2B)	2B / 8GB	4–8 GB	Moderate	8192	Good (quantized)	Efficient, edge, general SLM	Google
Gemma 3 1B	1B / 529MB	1–2 GB	Excellent	32,000	Excellent	Text gen, QA, summarization, chat	Google HF
Gemma 3 4B	4B / 2GB	3–5 GB	Edge / Browser (quantized)	128,000	Excellent (quantized)	Text/image gen, multilingual, summarization	Google HF
StableLM 2 1.6B	1.6B / 6GB	2–4 GB	Excellent	8,192	Excellent	NLU, chat, multilingual tasks	HF
MiniCPM	1.2B / 5GB	2–4 GB	Excellent	2,048	Excellent	General SLM, chat, summarization	HF
BLOOMZ 560M	560M / 2GB	1–2 GB	Excellent	2,048	Excellent	Translation, chat, summarization, instruct	HF
Unreleased: Qwen3 SLM	1–4B (expected)	≤4 GB (quantized, est.)	Excellent (design goal)	32–128k (rumor)	Excellent	Multilingual, multimodal, chat, QA	GitHub
Unreleased: Llama 4 Mini	1–4B (expected)	≤4 GB (quantized, est.)	Excellent (design goal)	32–128k (rumor)	Excellent	NLU, chat, summarization	Meta
Unreleased: DeepSeek SLM v2	1–4B (expected)	≤4 GB (est.)	Excellent	32–128k (rumor)	Excellent	General SLM, chat	GitHub
Unreleased: Falcon Lite 1B	1B (expected)	≤2 GB (est.)	Excellent	8k+	Excellent	Multilingual NLU, chat	Falcon HF

Modern SLMs (Qwen2, Phi-3, TinyLlama, Mistral) are designed to be both more efficient and more capable than classic BERT, especially for generation and multitask use.
For pure browser deployment, models under 100MB–250MB are strongly preferred. Quantization (4-8 bit) can reduce RAM / Storage needs for larger LLMs.
RAM and CPU usage estimates are for inference (not training), assuming float32 precision; quantization or float16/int8 reduces requirements further https://huggingface.co/docs/accelerate/en/usage_guides/model_size_estimator
https://krishansubudhi.github.io/deeplearning/2019/09/20/BertMemoryAnalysis.html

Frameworks for Browser BERT Inference

Framework / Tool	Stage	Supported Models / Formats	Browser Suitability	Key Features / Notes	Link(s)
TensorFlow.js	Mature	TF.js, BERT, MobileBERT, etc.	Excellent (all major browsers)	Open source, run/convert models, large JS community	TensorFlow.js
Transformers.js	Mature	Hugging Face Transformers	Excellent	Open source, easy API for Hugging Face models in browser	Transformers.js
ONNX.js	Mature	ONNX (e.g., MiniLM, DistilBERT)	Very good	Open source, runs ONNX models, WASM/WebGL acceleration	ONNX.js GitHub
ONNX Runtime Web	Mature	ONNX	Excellent	High performance, WASM, WebGL, quantization support	ONNX Runtime Web
WebLLM	Research / Active	LLMs, Transformer variants	Excellent (WebGPU / WebGL browsers)	In-browser LLM inference engine, supports quantized models	WebLLM
Browser-BERT	Demo / Research	BERT (converted), TF.js	Excellent for demos	Example project, educational, shows browser deployment	Browser-BERT
browser-use	Mature	Integrates with various LLMs / APIs	Excellent (controls any browser)	Open source automation / orchestration tool, lets AI control browser actions	GitHub

Key Distinction:

Frameworks like TensorFlow.js, Transformers.js, ONNX.js, ONNX Runtime Web, and WebLLM are primarily for running the machine learning models (like BERT) directly within the browser's JavaScript environment for tasks like inference.
browser-use is an automation and orchestration tool that allows an AI (often an LLM) to control the browser itself (navigate, click, type, scrape). It can utilize the other frameworks to analyze content gathered during automation but doesn't run the models itself in the same way.

Notes:

Model sizes and RAM / CPU usage can often be reduced by quantization (e.g., float16, int8, int4) using tools such as accelerate estimate-memory.
Research frameworks like WebLLM are rapidly evolving and are likely to support more BERT/Transformer variants in the near future.
For production browser projects, MobileBERT, DistilBERT, ALBERT, and MiniLM are currently the most practical choices.

Yannick · 2025-05-01T11:24:34+00:00

We are actively monitoring LLM models, which could be integrated into Orion, but we don't plan to implement it at the moment. Thanks for the post!