Rebuilding a Chatbot Using FastAPI

It sounds pretty straight forward on the surface right? I decided to go another direction with my chatbot - CaptAInsLog - after running into issues with a flask version. I spent months refining to system and adding capabilities. However, I started with the assistants API. At the time of release it sounded good to me because I had just finished writing a RAG system using langchain and decided that I should just do it all from scratch because of limitations I kept running into with langchain implementations. I also did not like the fact that I needed so many libraries and it abstracted too much.

After I spent time understanding the assistants API, building out a full chat setup with lots of capabilities, I realized I dug myself into a hole with it. The assistants API works well but the expense can grow quickly. The RAG capabilities are especially expensive. Relying on the assistants API for what you can do yourself is what you pay for. However, implementing your own cosine similarity search and using the embedding endpoint to create the embeddings is easy enough. The cost of embedding generation is cheap and storing it in your own database is straight forward.

The assistants API with it’s threads is nice if you want it to handle everything. What it is not good at is changing the history. I assume that openai is keeping track of the context window and created some sort of memory mechanism to reduce the context size but keep it relevant. I am not sure of how they do this but they do not allow you to modify previous messages because they hold onto the context in the thread. You cannot modify it. Once it reaches a certain size, you could generate a new thread with a summary but why not do that with the completions endpoint yourself? If you can hold onto the context messages, you can always modify them. Streaming also is not yet working so you have to wait for the full message to come to you. That is problematic for a UI that needs a real time feel to it.

The limitations imposed on you as a developer, with the assistants API is frustrating at the least and opaque at the most. So, after i started running into issues with the setup in flask, I decided to rebuild the entire thing using the normal openai API. After thinking through what i wanted to do and outlining the steps to get there, I started working on the basics.

I started to research how I could have streaming responses and thought through how I could implement the same sort of feel as the assistants API with the concept of threads myself. I realized that this was actually going to be a much more complex project quickly. First, I needed to create a class capable of doing all of the work for the openai API. Any possible thing that is within it had to exist in this class. Chat, image generation, vision, voice input and output, embeddings and cosine similarity search.

With the class built and using openai asynchronous as well as making everything else within the class async, I could start on interfacing. The class is useable within the terminal for testing and that can be turned on and off by calling the chat method within other modules. Once I had streaming chat working, I had to start on the front end.

I decided to move away from flask and chose fastapi as the replacement. Using Uvicorn also allows me to package this within a docker container if I ever get to that point. having the ability to use the swagger UI for testing is a great feature and building out the API for the backend allows for other interfacing in the future besides a web application.

I had zero experience with FastAPI though and it took a while to understand building Pydantic models for communication. Once I had the ability to submit a message and get a response, I found that it was not possible to render the streaming chunks using a normal route. I then had to add a websocket setup to the backend. With the wensockets, i could establish a session and stream each chunk as it arrives into the ui. With a little bit of Javascript I could then replace every last chunk with the updated chunk and it simulates the response typing out as it arrives.

In my class I have a local history setup that is stateless between sessions. This was the next problem to solve. I created a variable to use for when websockets are used and it switches off the local history setup. I had to then come up with a new history setup. I decided on Postgresql as a database to use. No idea how to use it, I just knew it was a good choice from reading.

Now I had to set up a Postgresql server and learn how to interact with it. Sqalchemy is a Python library that helps with this in many ways. Especially when it comes to building out the database and has a git like commit capability for upgrading and downgrading your tables. It also will easily rebuild the entire database with one script when I need to shift to another server setup. Migration and everything can be done pretty easily once you get the hang of the interactions. After getting websockets set up and getting the basic database models working, I was able to store and restore messages and then link them to a conversation. This takes care fo the concept of threads at this point.

Now I had to separate users data. I use cloudflare for OAuth to access my server setup. The tunnel itself uses this but then I can set users up within cloudflare so that I can make sure only certain email addresses have any access to the app. I also have google, github and microsoft OAuth flows set up within it so that the users within cloudflare can choose how to authenticate themselves. Once authenticated, the cloudflare jwt is available for decrypting and finding the authenticated users email address within it. Now I have a method of verifying a user is the user and then I have a way to assign a conversation id to the user in the database. So when a user accesses the site, the email is extracted on each call to teh system as well as to the database and their email will be the data that can match them to the conversation. The functions I used to get the certificate, find the kid and match to the public key and then extract the user email are all in one module for authentication.

Since a user might want a username, I then created an entry int he users table for a username. This can be displayed in the UI in the logged in section instead of the email. It is also matched in the same way as the conversation is. Cloudflare always sends this in the headers as it is accessed and then my function looks up the cloudflare public keys every time so that it can always be updated to the newest one when it rolls over. This makes many more interactions with the system secure because I can always check against their token before acting.

I set up some users and interacted with the system simultaneously to see how it reacted. With everything throughout the interactions needing async capabilities to make sure that database sessions happen in order as well as close out when needed, I spent a couple days refining this process.

The websocket connection is long lived and establishes a database session on connection. This makes writing to the database the touchy point within the websocket connection. I can establish new sessions in other routes that are not long lived and still read/write without breaking the websocket capability. This is actually why I chose Postgresql. It should scale better and allow for interactions like this to a pretty high user count. Much more than would be used in my system. Even so, planning out the flow of this was probably the most time I spent on the system.

Ultimately I created 3 tables in my database. Users, Messages, OAuth states. I use Google APIs for some things as well. At the end of each day, the entire conversation is summarized and the summary is inserted into a summary entry in the Users database. This is added to the top of a PDF document with the entire context for the day. This is then converted from markdown to html and then from html to PDF and uploaded to google drive in a location you would choose from the interface. The table for OAuth holds the refresh keys for google so that it can refresh on it’s own when needed and if it cannot automatically do so, it will prompt the end user to do it in the browser and then return to the site and finish the upload.

So this is the point I am now at. I will be adding in function calls next so that the bot can schedule things in my calendar when it is asked to. Then building out a memory system that is more advanced. As the context grows, I would like to summarize portions of it, link the summary in the database to the section that was summarized so that the ui always shows the full context without the summary and then sends the summary in the place of that to keep it smaller. Then creating the embedding setup so that long term memories are stored in the database and function calls from the ai could be used to search for information that is related to the context if the ai does not recall something. This way the memory can build over time and serve as a long term memory system that the ai can use to “know” you. the end of day summary will be inserted into the next days first message so that the current context is aware of what ocurred the day before. Then it will look up what is occurring in the calendar and insert that to help with planning and keeping you accountable.

I also need to add timestamps to each message to the ai so that it is time aware. That way it can give responses that make sense for the time of day or make it aware of time passing between messages.

I also plan on adding more agents that are specialized to certain things. If I upload an image it will be inserted into the pdf at the end of the day but a second agent will then use the vision api along with the message sent when the image was uploaded to do things with the image like describe it or find something in it according to what the user asks. Then that agent will hand the response back to the main chatbot with a flag to notify the main bot of what it just received so that it can respond correctly to it. So if I ask a question about the image, the other agent does that work and then the main one will rephrase the explanation to match the feeling of the conversation occurring as well as have it in the memory.

This will happen with multiple agents ultimately. Adding wikipedia lookup and then perplexity.ai api to another model for searching the web and getting up to date results within the context window.

Too many things I want to do with this. It will take time since I am only one person. However, I will learn a lot by the end of the road.

Updated:

Leave a comment