We've been experimenting with building a chatbot for our website to answer questions about Logram using LlamaIndex and GPT-4. To make this as simple as possible, we decided the chatbot should scrape every page from this site to build its index. That way, there's no need to maintain two separate knowledge bases. This also allows us to have the bot link to its sources.
This isn't an original idea. ChatBot does this well while offering a clean UI on top of it. It's a bit bloated for our purposes, though.
Quick Start
Here's the GitHub repo: https://github.com/logram-llc/llmsitechatbot. Feel free to fork or contribute.
# Install the CLIgit clone https://github.com/logram-llc/llmsitechatbotcd llmsitechatbotpip install -e .# Copy and then configure config.jsoncp config.sample.json config.json# Build the indexllmchatbot --config config.json build https://logram.io/sitemap.xml# Interact with the LLM through an APIllmchatbot --config config.json servecurl -X POST http://localhost:8000 \-H "Content-Type: application/json" \-d '{"message": "Who is behind this company?"}'
Overview
During llmchatbot build
, the app essentially:
- Accepts a sitemap or series of URLs
- Visits each page with Playwright (to handle any JavaScript)
- Downloads that web page's content
- Converts it to markdown
- And inserts it into a vector index
During llmchatbot serve
, it exposes a POST-able API endpoint that accept a user's query. Using RAG, the information relevant to the that user's query is pulled from the index and handed off to ChatGPT along with the query itself.
LlamaIndex makes this process dead-simple. Here's a quick snippet demonstrating this:
from os import environfrom llama_index import VectorStoreIndex, ServiceContext, Documentfrom llama_index.llms import OpenAIOPENAI_API_KEY = environ['OPENAI_API_KEY']SOURCE_SITEMAP = 'https://logram.io/sitemap.xml'# Recursively scrape the website using `SOURCE_SITEMAP`documents = [Document(text="Website contents", metadata={})]llm = OpenAI(model="gpt-3.5-turbo", api_key=OPENAI_API_KEY)service_context = ServiceContext.from_defaults(llm=llm, embed_model="local")index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)query_engine = index.as_query_engine(chat_mode="context",verbose=True,streaming=True,system_prompt="Your job is to answer questions about Logram, a small web agency.")while True:streaming_response = query_engine.query(input("\n> "))streaming_response.print_response_stream()
If you're interested in better understanding these concepts, LlamaIndex has an excellent high-level overview.
I doubt we'll develop this much further, but it'd be sweet to have a simple frontend on top of the CLI to manage data sources for the index. Having its API return references in the response body is another nice-to-have.