Disclaimer: This is one of the older writings and may be a bit obsolete in today's times.
In future of NLP, conversational agents are going to be at the forefront transforming lives of humanity by numerous ways, making the human-machine interaction even more seamless. As we grow in the field of NLP towards higher levels of intelligence in conversational agents, one challenge is going to be effective mass deployment of these agents. This paper proposes an architecture to deploy conversational ai agent particularly developed in RASA stack. We cover a simple yet robust framework capable of deployment at huge scales with little time and effort.
Inspired from the movies such as Ironman, conversational agents like Jarvis have been a dream long aspired by people with technological inclination. Although to achieve something as capacitive as Jarvis, we need advancement in various technological fronts and finally a seamless integration of these technologies. In any case, the core conversational agent part of Jarvis is a critical component of the complete system. The progress we make in this technological front is going to be pivotal in future enhancements of such integrated systems. In this paper, we focus on large scale deployment of such conversational agents. Particularly, we discuss such deployment in RASA stack where we thoroughly discuss the challenges for such a deployment, elaborate upon available options for this and finally propose elegant ways to achieve the same with little time and effort. We would follow the following route throughout the paper. Starting with some of the basics of rasa, its architecture, we will discuss the available options of deployment and their challenges, finally we will propose our architectural setup for a highly available and extensively scalable setup.
Rasa is an open-source platform that can be used to build conversational agents. It provides simple augmentation layer over a complex integrated system that the user can understand and then build a conversational agent. Users can build simple chitchat bots to complex bots which can perform some business operations on behalf of the user. It provides good number of options for incorporating complex use cases easily within the bot.
To explain rasa development phase in simple terms, we could think it of as a frontend and backend. Contrary to normal notion in web development world where core computation is done at the backend and frontend is majorly responsible for UI and small computations if any, here frontend in fact is the core conversational brain of the bot and backend is the computation engine that interacts with other APIs or databases to accomplish business objective on demand of the frontend.
Going deeper in frontend, we find three main parts that form the crux of the conversational brain of the bot. These are ‘nlu.yml’, ‘rules.yml’ and ‘domain.yml’.
‘nlu.yml’ contains the intents and corresponding examples for those intents. These are used to train a classification model, specifically the model is the DIET (Dual Intent Entity Transformer) classifier. Bunk et. al. discuss DIET in depth in their research paper. We can also give examples for entities embedded within the training examples of intents. These can be extracted by the model and mapped to slots defined.
‘rules.yml’ is more concrete and stringent way to define paths of conversation that the bot may follow during real time conversation. ‘stories.yml’ also provides a way to define conversation paths but is not as stringent and concrete as ‘rules.yml’ and can lead to unexpected behavior at times. ‘domain.yml’ is the file consisting of all the information defined anywhere in frontend, that is name of all the intents, name of all slots, name of all entities etc. and backend (discussed more later) parts which are mainly the actions plus some additional information. This justifies the ‘domain’ name of the file. Slots are kind of place holders, similar to variables in modern programming languages. These can be used to pass user inputs to the backend plus some additional parts. Talking about the additional information present in this file, this information consists of ‘responses’, session configuration parameters and ‘forms’. ‘forms’ are similar to forms in web development and are used in scenarios where we need multiple inputs from a user for accomplishing an objective. Each ‘form’ defines the slots required to fill the form along with any entity mappings if any. Going deeper in backend, we find it is majorly a collection of programming routines that may do some computation, or call some other external API or do a combination of both. Other files present are ‘config.yml’, ‘credentials.yml’ and ‘endpoints.yml’. ‘config.yml’ provides options to configure NLP pipeline (example preprocessing pipeline, tunable parameters, thresholds, etc.). It also provides configurable parameters for dialog policies defined via ‘rules.yml’ or ‘stories.yml’. ‘credentials.yml’ provides rest endpoints and options to integrate bot with platforms like ‘Facebook’, ‘Slack’, etc. ‘endpoints.yml’ is an important file when it comes to large scale deployment. We will discuss this file in more detail later in this paper. It provides way to configure lock store, tracker store, event brokers and most importantly the action endpoint. In fact, this action endpoint is the one which binds frontend with backend.
The architecture of Rasa has been designed meticulously with due consideration to different aspects catering to large scale design patterns but also limiting the complexity level of the design. Moreover, the team has done great job in refactoring the design time to time with user experience and feedback. We will cover the design not in full detail but sufficient to understand the deployment options and our proposed deployment design. Figure 1 shows the system combining the frontend (majorly the Rasa Open Source) and the backend (majorly the Rasa SDK). Now we shall discuss some important aspects with respect to deployment of rasa bot on a large scale. In any large-scale system, there is a limit for vertical scaling, we cannot go beyond a particular specification for hardware at particular time, the so-called Moore’s law also states the time delta for jumping from one generation of memory to other by increasing number of transistors per unit area of a chip. At any present generation of memory and compute devices, we are struck by a hardware limit. It will take some considerable amount of time to jump from one generation to other. So, irrespective of other factors, we have to resort to Horizontal scaling in large scale deployments. Coming to large scale deployment of our rasa bot, our bot if imagined as a single entity (internally consisting of the complete rasa architecture) when deployed in a single machine, will be limited by a particular threshold number of requests per second bound by the hardware specs of the underlying bare metal i.e., the machine. We have multiple options, each option in some sense will form a distributed system which can be scaled by adding more machines in the system. The question is how do we distribute the functionalities, do we have a large number of bots, and how do we ensure that the user experience is seamless, that is we do not loose the relation built between user and bot throughout the session. The situation is different from scaling a simple stateless API end point. When we have to scale an API that has some sort of user-state associated with the user, we generally do not collocate the user-state in the same server in which the API is present, in fact we simply store that state in some separate server which is accessible by the APIs residing on different servers. Rasa architecture is made so as to use the same technique. Here the information is captured in combined states of user-data stored in ‘Tracker Store’ and ‘Lock Store’ as shown in the system architecture. Not going in too much internal details of how this is done, Tracker Store keeps record of the user conversation and slot values, Lock Store facilitates this process to avoid pitfalls like missed update, dirty read etc. and also helps to implement time saving algorithmic tweaks such as lazy updates in Tracker store, etc. These stores can be specified in ‘endpoints.yml’. We could make two different bots point to the same Tracker Store and Lock Store in ‘endpoints.yml’. By default, these are done in memory of the machine where the bot resides. Filesystem is the one if provided can be used to pick up ml models from, by default it is the ml model present in the same folder directory where other files are present. We will now discuss options for large scale deployment, their challenges and our proposed design.
RasaX is the way proposed by Rasa officials for robust large-scale conversation driven development and deployment. Although, they do discuss about other options such as containers, they suggest RasaX as the best way out there. To keep it simple, RasaX can be imagined as an additional layer on top of existing Rasa Architecture which provides a whole lot of other features alongside methods for beta-testing among limited users, deployment on scale using Kubernetes provided by third parties such as GCP (Google Cloud Platform) or AWS (Amazon Web Services). It provides insights on user conversations suggesting new intents or cleaner NLU examples. Other important point is that it provides many of these features via a web UI, so even a stakeholder can easily get hold of the UI, understand the user’s conversations, bot’s functioning, and improve the bot. This is called as conversation driven development. To have some sort of conversation driven development on the normal Rasa Architecture discussed earlier, one would need some sort of processing engine on top of the user’s conversations. The minimal this engine would need to do is to filter out user messages that fall below a threshold confidence value, on the other side, making it more complex, it could do clustering for anomaly detection, or some more complex operations or combinations of those operations.
As discussed, the RasaX method provided by Rasa officials has many positives for large scale robust deployment, but it’s not all sweet. The suggested path to large scale deployment of RasaX is via Kubernetes. This proposes a couple of challenges. Foremost being is Kubernetes is not simple, even though it takes care of number of things such as load balancing, maintaining consistent system states, bringing new pods (a term equivalent to a single minimal identifiable alive subsystem in loose sense) up and bringing them down, it is not everyone’s cup of tea. It is difficult to understand the Kubernetes and the orchestration world. Second one being the suggested way by Rasa documentation is having the Kubernetes cluster on a cloud service provider such as GCP. The business requirement could itself be that the deployment should be on premise machines, in which case we cannot skip the dreaded root of Kubernetes and push on the Kubernetes stuff on to some cloud service provider.
We need to distribute the work between various machines, in other terms we need to scale the system Horizontally. One simple way is to have multiple bots which serve to different users. Here, there are couple of possibilities. First possibility, we have default in memory Tracker Store and Lock Store and ensure that all requests of a particular user always fall on the same bot. This can be achieved by intercepting the requests with a load balancer and then using ‘SenderId’ (key attribute used by rasa to differentiate between two users) to ensure specific user requests on a specific bot each time. Second possibility, have Tracker Store and Lock Store on a separate server, and configure the ‘endpoints.yml’ of each bot so as to point to those servers. We again the intercept the requests with a load balancer, but here load balancer does only load balancing other than acting reverse proxy for the complex system. We need not ensure that request of a particular user fall on a particular bot every time. All bots are in sync with each other because of the configuration of Tracker Store and Lock Store. To put it differently, history part of user is common for all bots stored on a separate server other than in memory storage. So, the relation built between the system and user is not bonded with a specific bot. In fact, this solution has one more advantage than the former one. In the former solution, if a specific bot goes down, all users associated with that bot will need to start from scratch, all conversation history will be lost. Such a situation will not arise in second case, since other bot will take over as no particular bot is bonded to a user. One small disadvantage in second solution is complete system will go down incase Tracker Store or Lock Store goes down. This can be done by maintaining replica sets so that other set takes over in case of failure. Taking a step further, coming to actual deployment of a such a system, it is cumbersome to set up environments in all machines and then launch rasa bots on those machines. Containerization is the proper solution for such scenarios. It provides sandboxing at application level and can help to avoid setting up environments in all machines separately leading to faster deployments. Discussing about a single instance of rasa bot, there is a natural existing division in an instance. As discussed earlier, these are the frontend and the backend. Each instance initiates two processes, one for frontend and one for backend. There can be pitfalls that frontend is up and backend is down for a particular instance leading to anomalies. Such a thing will not be captured in our present design since load balancer would only query frontend. There are couple of solutions around this. First one, we can change the check query of the load balancer such that it triggers till backend and does not travel only till frontend. Second one, we could bring down corresponding frontend whenever a corresponding backend is down. For this, we would need to monitor backend of each instance using some other service such as ‘monit’ or ‘promeheus’. Third one is an elegant solution to this problem; we entirely separate out the frontend from backend and divide them via a load balancer in between. Since backend is only a compute engine without any state, any frontend can call any backend as and when required.
Figure 2 shows the main architecture of sYrCaD. It is simple yet robust conversational agent deployment. It is simple as we could find the way around the complex understanding of a Kubernetes system. It only uses very fundamental aspects of system design on top of the Rasa architecture. We do not need to maintain complicated stuff or care about binding one user to a fixed instance. It is robust as we have reduced the point of failures up to a great level. Given that our load balancers are robust enough, we have a stable large-scale system in place. As we will discuss load testing, we will see ways to accurately test the load capacity of the system and get more data points to robustness of our system. Both frontend and backend in sYrCaD are running in docker containers. This helps us to avoid the troublesome of setting environments in individual machine. Complete code for frontend, backend, NLU model, and Docker related files are present in a central server which is mounted on all the machines that will run instance(s) of rasa bots. This will ensure that all rasa bot instances have the same NLU knowledge and configuration at all times since they have spin up from the same source code.
For load balancer, there are number of available options in the industry. These are nginx, kemp, HAProxy to name a few. During our experiments, we used HAProxy. HAProxy is fully open source, has good documentation and an active community support. Rasa provides number of options for Tracker Store and Lock Store, we used mongoDB for Tracker Store and Redis for Lock Store. There are different pros and cons for each of these depending on the business use case. For example, if we want to run some analytic queries on top of user conversations, we might prefer relational database such as Postgres over mongo db because of powerful query language sql.
Rasa provides some other options to enhance such a system depending on the business use case. We can stream user conversations in form of events to some separate database such as postgres or server queues such ad Kafka or RabbitMq. This can be integrated in our system by setting appropriate configurations in ‘endponits.yml’. We can have fan out, workers queue, etc. paradigms at the exchange as per the business requirement.
Load testing of a static API is quite different from load testing of a contextual system. To understand this, say agent spawns a thread for a new user, it is possible that 100 different users may give 100 request per second to the bot, so 100 different threads will handle these 100 requests respectively. But it is unlikely that same user will blast 100 requests per second hampering a single thread to handle 100 requests per second. Bothe of these are completely different scenarios.
There are many testing platforms for large scale testing. Locust and Apache JMeter to name a few. We used Locust for testing sYrCaD. Python provides UUID (Unique User Identifier) which can be used to generate a unique SenderId and simulate different users for the sYrCaD system. Locust provides a nice web UI to plot system behavior on different charts, see system performance w.r.t. increasing load, change number of users, change user growth pattern and many other features.
This paper serves as a good dive into the deployment strategies of conversational agents. Even though we discussed large scale deployment specific to Rasa stack, the concepts and fundamentals covered are generic and will come handy in any large-scale deployment of conversational agents.
Tanja Bunk, Daksh Varshneya, Vladimir Vlasov and Alan Nichol. 2020. DIET: Lightweight Language Understanding for Dialogue Systems, arXiv:2004.09936. Version 3