Scaling Gleen's AI Chat Playground
We recently launched a self-serve product that can transform valid URLs into bots. Getting it ready for production was challenging due to the need for multiple system services to communicate.
We recently launched a self-serve product that can transform valid websites into bots. Preparing it for production was challenging due to the need for multiple system services to communicate. In this post, we'll discuss how we overcame these technical hurdles to prepare the product for production.
Product Experience
Our user experience is simple. Users submit requests, which we store in RequestStore and send an approval acknowledgment to their email to prevent spam. After email validation, we start to scrap and vectorize the content, which involves visiting all links in a domain. This process can take some time. Once complete, we invite the user to try out their new bot via email.
In the following sections we will talk about our technical architecture that we have deployed in the production.
Scraping & Vectorization
The scrapping and vectorization process is the most time-consuming part of our operation. We use various scrappers tailored to different content types, such as PDF, docs, Github, and Github issues. Some scrappers rely on standard selenium services, while others use specific API calls, like for scraping Github issues. After scraping, we vectorize the content and store it using our self-hosted faiss on EC2 Ubuntu instances. This approach keeps costs down and allows us greater customization control.
Storing Vectorized Data
We store our vectorized data in S3. This allows us to easily access the data if we need to launch a new server for the query processor.
Query processor
Our query processor is the central part of our architecture. It retrieves relevant answers given a query and chat history. This processor contains several components:
- Classifier: This ensures the user's question is contextually relevant. For example, it can suppress unrelated queries like a user asking about the weather or songs while chatting with a Microsoft support center. This required custom work to validate the relevance of questions.
- Image Extractor: Our bot can read screenshots and answer questions based on them. This is useful for customer support, such as when a customer encounters a product error and shares a screenshot. The bot can convert the image to text and use it as input.
- VectorStore Retrieval: When the query processor restarts, it retrieves all relevant VectorStore databases from our S3 storage and stores them in a Redis cache. S3 acts as a permanent storage for these vector databases and a single source of truth due to our heavy read traffic.
- Faiss Similarity Search: After identifying a query as a valid support query, we perform a vector search in the corresponding vector database. We've observed various types of searches during our experiments.
- LLM Calls: Here we call OpenAI or similar transformer models to generate the response. Instead of using external SDKs, we've implemented this architecture in-house, which gives us more control over customization.
- Streaming: Response generation using the GPT-4 model averages around 30 seconds. To avoid making users wait, we've implemented streaming. Whenever a user connects to our service, a WebSocket is attached, and the corresponding WebSocket ID is passed to the query processor. This begins pushing partial texts to the WebSocket, so users can start receiving in-progress answers.
Conclusion
Users are encouraged to test the bot before setting up a demo for the complete product suite. While our initial version was straightforward, we're planning to add more features to this playground, such as scraping progress, limited auto-integrations with websites and Discord channels, bot personas, and more. So, if you want to build your own AI chatbot on your website, please try out our playground and share any feedback you have. Stay tuned for updates!