ZieDit woensdag 2 februari 2022
AI is booming, and you can try, and see, the state-of-the-art AI models at the click of a mouse button (for instance ChatGPT). These applications are wonderful and a technological marvel! However, this technology is only possible thanks to the huge amounts of data that are available from the internet (we have language models because the internet is full of written text, we have models that generate amazing images because the internet is full of photos with descriptions!). But text and images are not the only possible applications for AI. Many other domains exist where AI models possibly can alleviate real world problems. These domains can be quite niche and data might not be available at the scale needed to train AI models.
At the Data Intelligence Research Centre we recognise this problem and have recently started researching possible solutions to this challenge. The main problem is that modern AI systems (deep neural network, DNN) require large amounts of data to become accurate enough to work in real world scenarios. One way to approach this is to spend a lot of resources on gathering large amounts of training data before developing and deploying the model. However, there is generally no guarantee that your AI application will function satisfactorily in a real world environment. Hence, it is often necessary to repeat the data gathering process many times to keep the AI application up to date.
The main conclusion from our preliminary research is that it is not efficient to try to do this process “off-line”. This term “off-line” here means that you, or your data- or machine learning-engineer, goes out to gather more data from already available resources. Instead, we find that it is more efficient to gather the data directly from the people who will use the application in a continuous manner. That is: to incorporate the use case and the users in the AI system from the start.
This might seem a bit abstract, so let’s go to an example! Say you want to develop an AI application that somehow responds to the emotion or mood of the user (there are many real ideas here: https://research.aimultiple.com/emotional-ai-examples/, however for us it is more of a toy example). It would make sense to try to develop an AI model that is very good at recognizing emotions from faces for this application. In fact, this might seem a plausible undertaking: the internet is full of face images and there are even a number of large curated datasets on facial emotion expression. But can you guarantee that the AI model will be accurate enough for all the potential users of the application? What if someone with a preference for designer glasses wants to use the application, are you sure that the training data gathered covers all edge cases of people wearing designer glasses? Of course, for humans it does not matter what kind of glasses someone wears, we can judge the facial emotion in any case (Fig. 2). Unfortunately, this does not go for DNN’s: these models can be potentially very good at predicting new data, but only if the new data are similar enough to the training data (e.g. see the much-debated opinion article by scientist Gary Marcus: Deep Learning: A Critical Appraisal [1]).
We think it would probably be a better strategy to try and gather feedback from the users of the AI application and update the AI model with this feedback. In the case of emotion recognition, this would entail that the users “teaches” the AI model what their personal emotions look like specifically.
For this to work, we need a system that is not only capable of training an AI model and deploying it to a user interface, but also to be able to receive, gather and process user feedback in an automated manner. Even though there are many open source tools that greatly simplify the development of machine learning and AI models (e.g. PyTorch, Tensorflow and SKLearn), when we started this research project there were no tools that provided all the functionality that was needed for our idea. So, of course, we started developing our own tool! We wanted the system to be context-free and easily adaptable to different use cases, but to be able to test and validate the system we implemented the emotion recognition use-case mentioned above (with the sole purpose of demonstrate the functioning of the system).
Emotion Recognition Demonstrator
As mentioned above, to develop and test the core system for human feedback AI / continual learning we implemented a use-case centred around facial emotion recognition. The system is able to record user feedback data: additional data on emotion expression from a specific person, see Fig. 3. This data can then be used to calibrate the basic AI model for emotion expression (this basic model was trained on a publicly available dataset containing faces of many people, FER2013). This is achieved by calling on the automated pipelines implemented in the system that handles the data trafficking and model training and deployment procedures. The aim of the use-case is that the personalized AI model is more accurate in recognizing the emotions of the specific user.
Next to testing the basic functionality of the software system (data and error handling, message passing, etc.), we also performed some preliminary tests on the AI model performance. Many open questions remain on how to actually accomplish an accurate enough model: how to build and train the basic (non-personalized model), how much user feedback data is needed, can too much user feedback also degrade the system? In our test we simply collected user feedback (emotion expressions from 5 people recorded by the system) and evaluated the accuracy of emotion recognition in a number of different scenarios, see Fig. 4 for details.
The first conclusion is that simply calibrating a basis model on a small and not very diverse amount of user data is not sufficient to get a good performance in general. But the important result is that we now have a system that enables us to answer questions about, for instance, how much data is actually required, or what is a good method to retrain the models. In the next phase of the project we will try to get an answer to these kinds of questions, not only for emotion recognition, but also for other use-cases. We intend to implement and test use-cases such as human pose estimation, plastic waste recycling and drone security.
The CHIMP system architecture
So what does the system that enables these features actually look like? The system consists of a number of network-connected components that have a certain functional responsibility. The biggest division in functional responsibility is between the front-end application and the back-end.
In our architectural design, the front-end application is the part that the user interacts with and where specific requirements for the use-case are implemented. For instance, for the emotion recognition use-case this consists of a web-interface with a webcam feed where both model emotion predictions are displayed and the feedback functionality is implemented. For other use-cases, this can be something different.
The system back-end is designed to implement the features of the system that are more or less use-case agnostic: all applications need data storage and versioning, model training and deployment. The components are the training service (including one or more context-specific plugins), the data store, the tracking service, and the serving service. The entire CHIMP system (except for off-the-shelf components such as MLFlow and Minio) is built using Python and Flask. Fig. 5 gives an overview of how each component connects. Below that a description of each service is given.
The Training Service
The training service provides a way to run training tasks triggered by API calls. These tasks can be the training of an initial model or fine-tuning an existing model (for example, for a specific user). The training tasks run separately from the API. This is done using Celery, combined with a RabbitMQ based task queue. Background workers (event workers) monitor this task queue and start training tasks when one is added to the queue. This queue ensures that training tasks, which can take a longer time to run, do not block the control flow of the application or the API by returning a response when a task is queued instead of waiting for the task to finish. The application can poll the API for the status of the task. This flow is shown in Figure 6 below (note that the order of operation is shown by the numbers).
Separating the API and the execution of the training tasks also means that multiple workers can be used to run multiple tasks in parallel, as shown in Figure 7 below.
The Plugin System
To enable CHIMP to support many different use-cases in a maintainable and modular manner, we designed and implemented a plugin system. Plugins define and implement the requirements of a specific use case, and are therefore conceptually coupled to the font-end application. For this system to function we need to ensure that all plugins have a single consistent interface, and can thus be called upon by general routines defined in the API. This abstract interface is defined in the BasePlugin class which must be used by all plugin implementations. The plugin system automatically reads and loads plugins from a local folder on the Training Service runs. In addition to the plugin implementation itself, each plugin also has a PluginInfo object. This object contains some general information about the plugin and is descriptive of the use-case that is coupled to it. This information consists of:
- Name: Name of the plugin (this is used to select the correct task to run in the API)
- Version: The version of the plugin for version management
- Description: A description of what the plugin does
- Arguments: A list of arguments and the properties of these arguments (name, type, description, and whether the argument is required)
- Datasets: The expected datasets and the properties of these datasets. Note that this does not specify a specific dataset, but a type of dataset (for example, a dataset of images categorized by object)
- Model return type: The type of object returned by the plugin (e.g. ONNX, Tensorflow, PyTorch, etc.).
If the argument and datasets fields of the PluginInfo object are set, the API will check if all the required arguments and datasets are included in the request. The arguments and datasets will be made available to the plugin, as well as an instance of the datastore class (for accessing datasets) and an instance of the connector class (for loading base models and for uploading new/fine-tuned models).
Several example plugin implementations can be found in the plugins folder of the training service, see Figure 8 for the components of the training service.
Serving Service
The serving service is responsible for providing an API for inference. It supports serving different models at the same time. When a new model is requested, the serving API will check if a given model is available in the tracking service. If the model is available, it will be loaded and cached by the serving service. Cached models are updated at regular intervals. Internally, the InferenceManager is responsible for this process. This process is shown in the sequence diagram in Figure 9.
The serving service can support many different model types. Currently, only ONNX model support has been implemented, but by creating more implementations of a BaseModel class, this can be extended to other types of models (e.g. PyTorch, Tensorflow, etc.).
A full overview of the components of the serving service is shown in Figure 10 below.
References
Marcus, Gary. “Deep learning: A critical appraisal.” arXiv preprint arXiv:1801.00631 (2018).