January 1, 2024

Chat with Your Website Using LlamaIndex & GPT-4

We've been experimenting with building a chatbot for our website to answer questions about Logram using LlamaIndex and GPT-4. To make this as simple as possible, we decided the chatbot should scrape every page from this site to build its index. That way, there's no need to maintain two separate knowledge bases. This also allows us to have the bot link to its sources.

This isn't an original idea. ChatBot does this well while offering a clean UI on top of it. It's a bit bloated for our purposes, though.

Quick Start

Here's the GitHub repo: https://github.com/logram-llc/llmsitechatbot. Feel free to fork or contribute.

# Install the CLI
git clone https://github.com/logram-llc/llmsitechatbot
cd llmsitechatbot
pip install -e .
# Copy and then configure config.json
cp config.sample.json config.json
# Build the index
llmchatbot --config config.json build https://logram.io/sitemap.xml
# Interact with the LLM through an API
llmchatbot --config config.json serve
curl -X POST http://localhost:8000 \
-H "Content-Type: application/json" \
-d '{"message": "Who is behind this company?"}'


During llmchatbot build, the app essentially:

  1. Accepts a sitemap or series of URLs
  2. Visits each page with Playwright (to handle any JavaScript)
  3. Downloads that web page's content
  4. Converts it to markdown
  5. And inserts it into a vector index

During llmchatbot serve, it exposes a POST-able API endpoint that accept a user's query. Using RAG, the information relevant to the that user's query is pulled from the index and handed off to ChatGPT along with the query itself.

LlamaIndex makes this process dead-simple. Here's a quick snippet demonstrating this:

from os import environ
from llama_index import VectorStoreIndex, ServiceContext, Document
from llama_index.llms import OpenAI
SOURCE_SITEMAP = 'https://logram.io/sitemap.xml'
# Recursively scrape the website using `SOURCE_SITEMAP`
documents = [Document(text="Website contents", metadata={})]
llm = OpenAI(model="gpt-3.5-turbo", api_key=OPENAI_API_KEY)
service_context = ServiceContext.from_defaults(llm=llm, embed_model="local")
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)
query_engine = index.as_query_engine(
system_prompt="Your job is to answer questions about Logram, a small web agency."
while True:
streaming_response = query_engine.query(input("\n> "))

If you're interested in better understanding these concepts, LlamaIndex has an excellent high-level overview.

I doubt we'll develop this much further, but it'd be sweet to have a simple frontend on top of the CLI to manage data sources for the index. Having its API return references in the response body is another nice-to-have.