AI Chatbot Features: How It Works Under the Hood

A technical deep-dive into embeddings, vector search, RAG architecture, and how we prevent hallucinations while delivering accurate answers from your content.

Vector database
GPT-5 & Claude
Zero hallucinations

How AI Learns From Your Website

AI chatbot features like automatic learning are what set modern chatbots apart from traditional rule-based bots. When you connect your website to Boei, a sophisticated process transforms your content into something an AI can understand and search through. These features are perfect for ecommerce websites looking to automate support. Here's what happens under the hood:

1. Content Extraction & Cleaning

Our crawler visits every page of your website and extracts the meaningful content. This isn't just copying HTML — we intelligently remove:

  • Navigation menus, footers, and sidebars
  • Advertisements, forms, scripts, and styles
  • Comments, pagination, and breadcrumbs
  • Duplicate content and boilerplate text

What remains is the actual content your visitors care about: product descriptions, FAQs, policies, articles, and documentation. We also extract metadata like page titles, H1 headings, and descriptions.

2. Intelligent Chunking

Long pages are split into smaller, semantic chunks that preserve meaning. Our chunking algorithm:

  • Preserves code blocks and tables as complete units
  • Respects paragraph and section boundaries
  • Keeps related information together
  • Optimizes chunk size for retrieval accuracy

3. Creating Embeddings

Embeddings are numerical representations of text that capture semantic meaning. Think of them as coordinates in a multi-dimensional space where similar concepts are close together. The sentence "What's your return policy?" and "Can I send items back?" have different words but nearly identical embeddings because they mean the same thing.

We use state-of-the-art embedding models to convert each chunk of your content into these numerical vectors — typically 1,536 dimensions that capture nuance, context, and meaning.

4. Vector Storage

These embeddings are stored in a specialized vector database optimized for similarity search. Unlike traditional databases that match exact keywords, vector databases find content based on meaning. This is why the chatbot understands questions even when visitors don't use the exact words from your website. See these features in action: 18 chatbot use cases. All features included in our simple pricing. Easy setup on WordPress.

The Training Pipeline

From raw content to searchable knowledge base

1

Scrape

Crawl your website via sitemap or domain discovery. Support for JavaScript-rendered pages using our custom scraper.

2

Process

Clean HTML, remove navigation/ads/scripts, extract meaningful content and metadata.

3

Chunk

Split content into semantic units while preserving code blocks, tables, and context.

4

Embed

Convert text chunks into 1,536-dimensional vectors that capture semantic meaning.

5

Store

Save embeddings in Weaviate vector database with hybrid BM25 + vector search.

6

Search

Hybrid BM25 + vector search finds the most relevant content for any question.

What Happens When Someone Asks a Question

One of the most important AI chatbot features is intelligent question handling. When a visitor types a question, a multi-stage process ensures they get the most accurate answer possible:

Step 1: Query Pre-Processing

The raw question is analyzed and enhanced before searching. This includes:

  • Identifying the intent behind the question
  • Expanding abbreviations and fixing typos
  • Generating alternative phrasings for better matching
  • Extracting key entities (product names, features, etc.)

Step 2: Hybrid Search

We don't rely on just one search method. Instead, we combine:

  • Vector search — Find content with similar meaning using embeddings
  • BM25 keyword search — Traditional text matching for exact terms

This hybrid approach catches both semantic matches ("refund" matches "return policy") and exact matches (specific product names, model numbers).

Step 3: Re-Ranking

Results from both search methods are combined and re-ranked based on:

  • Content type relevance (FAQs weighted higher for questions)
  • Page importance and freshness
  • Match quality across both search methods
  • Custom rules (e.g., pricing pages for price questions)

Step 4: Answer Generation

The top-ranked content chunks are sent to the LLM (GPT-5 or Claude) along with the original question. The AI synthesizes an answer using only the provided content — never its general training data. This is called Retrieval-Augmented Generation (RAG).

Step 5: Source Attribution

Every answer includes links to the source pages used. Visitors can verify information themselves, and you can see exactly what content informed each response.

How We Prevent Hallucinations

AI hallucinations happen when models generate plausible-sounding but incorrect information. This is the #1 concern businesses have about AI chatbots. Here's how Boei solves it:

  • RAG architecture: The AI can ONLY use content from your knowledge base, not its general training
  • Source citations: Every answer shows exactly which pages were used, so visitors can verify
  • Confidence thresholds: If the AI can't find relevant content, it says "I don't know" instead of guessing
  • Fallback to humans: When uncertain, the bot offers to connect visitors with your team
  • Custom instructions: Set explicit boundaries on topics the bot should and shouldn't discuss

How We Prevent Hallucinations

Training Process Step-by-Step

How to create your AI chatbot from scratch

1

Create Your Bot

Start by entering a simple prompt describing your business, or just paste your domain URL. Boei's AI analyzes your site and automatically configures the chatbot's personality, tone, and focus areas. One click from your domain gets you a working bot.

2

Add Knowledge Sources

Choose how the bot should learn: upload your sitemap for automatic crawling, let our crawler discover pages, or manually add URLs. You can also upload documents (PDF, Word, Excel, PowerPoint), paste FAQ content, or add custom text blocks.

3

Review & Refine Content

See exactly what content was extracted from each source. Remove irrelevant pages, adjust what content types to prioritize, and add custom rules for special content like pricing tables or technical specs.

4

Configure Behavior

Set custom instructions for how the bot should respond. Define its personality, specify topics to avoid, configure when to escalate to humans, and set up lead capture fields (email, name, phone, etc.).

5

Test with Automated Cases

Create test questions and expected answers. Run automated test suites to verify the bot handles common scenarios correctly. Review transcripts and refine instructions based on real performance.

6

Deploy & Monitor

Install on your website with one line of code or use our WordPress/Shopify plugins. Monitor conversations in real-time, review analytics on visitor engagement, and continuously improve based on actual usage patterns.

Complete AI Chatbot Features List

All the AI chatbot features included with Boei — no hidden costs or add-ons

AI Bot Creation

Create a bot using AI from a prompt or one-click from your domain

Website Learning

Upload sitemap or use our crawler to learn your entire site

Document Training

Learn from FAQ, text, Excel, PDF, PPT, and other documents

Source Display

Show sources to customers — no hallucinations, fully verifiable

Lead Delivery

Get leads via email, webhook, or Boei inbox with full transcripts

Analytics Dashboard

Track page visits, bot opens, interactions, and conversion to leads

Lead Fields

Configure which fields to collect: email, name, phone, custom fields

Conversation History

Full searchable history of all conversations with AI summaries

Quick Buttons

Suggested responses for visitors to click instead of typing

Auto Translation

Interface automatically translates to visitor's language (95+ languages)

Flexible Installation

Widget on your site or standalone page/landing page

Custom Instructions

Fully adjust bot behavior to match your exact use case

Latest AI Models

GPT-5, Claude 4 Sonnet, GPT-4o, and o3-mini available

Design Customization

Match your brand colors, fonts, and styling

Custom Texts

Adjust all interface text and messages

Live Chat Escalation

Hand off conversations to human agents when needed

Advanced Prompts

Custom system prompts for power users

Automated Testing

Set up test cases to review bot performance automatically

Technical Specifications

The AI models and infrastructure powering your chatbot

Supported LLMs

GPT-5 (Latest, fastest) • Claude 4 Sonnet (Most human-like responses) • GPT-4o (Reliable workhorse) • o3-mini (Budget-friendly option). Choose based on your needs — switch anytime.

Vector Database

Powered by Weaviate — an enterprise-grade vector database that handles millions of documents. Supports hybrid BM25 + vector search for optimal retrieval accuracy.

Processing Pipeline

Scrape → Process → Chunk → Embed → Store → Search. Intelligent chunking preserves semantic boundaries. Custom content rules for pricing, tables, and code blocks.

Content Cleaning

Automatic removal of nav, footer, sidebar, ads, forms, scripts, styles, comments, pagination, and breadcrumbs. Metadata extraction for title, H1, and description.

Supported Knowledge Sources

Flexible knowledge sources are among the most powerful AI chatbot features available. Your AI chatbot can learn from multiple types of content, all processed through the same embedding pipeline:

Website Content

  • Sitemap scraping — Parse XML sitemaps to discover all URLs automatically
  • Domain crawling — Intelligent crawler discovers pages even without a sitemap
  • JavaScript support — Fetch dynamic content from SPAs and JS-rendered pages using our custom scraper

Uploaded Documents

  • PDF files — Product manuals, guides, policies, brochures
  • Word documents — Internal documentation, procedures
  • Excel spreadsheets — Product catalogs, specifications, pricing
  • PowerPoint presentations — Training materials, sales decks
  • Text files — Any plain text content

Manual Content

  • FAQ management — Add question/answer pairs directly
  • Custom text blocks — Paste any content you want the bot to know
  • Custom rules — Define special handling for specific content types

All sources are combined into a unified knowledge base. The bot seamlessly searches across everything when answering questions.

Search Capabilities

🔍

Hybrid Search

Combines BM25 keyword matching with vector similarity search for best results

Query Pre-Processing

Enhances queries before search for better retrieval accuracy

📊

Re-Ranking

Results re-ranked by content type, page importance, and relevance

🔗

Source Attribution

Every answer includes clickable links to original source pages

Technical FAQ

What's the difference between GPT-5 and Claude models?

GPT-5 is OpenAI's latest model — fast and capable for most use cases. Claude 4 Sonnet from Anthropic tends to produce more human-like, nuanced responses and is better at following complex instructions. GPT-4o is the reliable middle-ground. o3-mini is the budget option for high-volume, simpler queries. You can switch models anytime based on your needs.

How does vector search differ from keyword search?

Keyword search (BM25) finds exact word matches — great for specific terms like product names. Vector search uses embeddings to find semantically similar content — so 'return policy' matches 'how to send items back' even without shared words. Boei uses both together (hybrid search) to get the benefits of each approach.

What are embeddings and why do they matter?

Embeddings are numerical representations of text (typically 1,536 numbers) that capture meaning. Similar concepts have similar embeddings, regardless of the exact words used. This lets the chatbot understand questions even when visitors phrase things differently than your content. It's the core technology that makes modern AI search actually useful.

How does RAG prevent hallucinations?

RAG (Retrieval-Augmented Generation) means the AI only uses content retrieved from your knowledge base to answer questions. Unlike ChatGPT which uses its general training, your Boei chatbot is constrained to YOUR content. If relevant content isn't found, the bot says 'I don't know' rather than making something up. Every answer includes source links for verification.

Can the bot handle JavaScript-rendered websites?

Yes. Our custom scraper can fetch content from single-page applications (SPAs) and JavaScript-rendered pages. When standard crawling doesn't capture content, enable JS rendering in your source settings. This ensures React, Vue, Angular, and similar sites are fully indexed.

What happens to my data during training?

Your content is processed on EU servers (Amsterdam) and stored in Weaviate, our vector database. We never train AI models on your private data or conversations. Content is used solely to answer questions from YOUR chatbot visitors. You can delete sources anytime and they're immediately removed from the knowledge base.

How do I know if the bot is answering correctly?

Several ways: 1) Every answer shows source links so you can verify accuracy. 2) Set up automated test cases with expected answers and run regular checks. 3) Review conversation history and flag incorrect responses. 4) Use the analytics dashboard to spot topics where visitors aren't getting good answers.

What's the re-ranking step in the pipeline?

After initial search returns results, re-ranking improves ordering based on: content type (FAQs rank higher for questions), page variants (canonical vs. paginated), freshness, and custom rules you define. This ensures the most relevant chunks are sent to the LLM for answer generation, improving response quality.

See It In Action

Try the demo chatbot or start your free trial — setup takes 5 minutes.

7-day free trial • No credit card required • Cancel anytime
Andrew Lee David S. Vance W. Grant Nitesh Manav
from 159 reviews

Related Resources

Learn more about AI chatbots

Use Cases

Real examples across industries

Pricing

Simple, transparent pricing

WordPress

Our most popular platform