Human Feedback AI and Continual Learning

sep 18, 2024 | AI-Ops, Onderzoek

AI is booming, and you can try, and see, the state-of-the-art AI models at the click of a mouse button (for instance ChatGPT). These applications are wonderful and a technological marvel! However, this technology is only possible thanks to the huge amounts of data that are available from the internet (we have language models because the internet is full of written text, we have models that generate amazing images because the internet is full of photos with descriptions!). But text and images are not the only possible applications for AI. Many other domains exist where AI models possibly can alleviate real world problems. These domains can be quite niche and data might not be available at the scale needed to train AI models.

Maarten Vaessen & Eddy van den Aker

At the Data Intelligence Research Centre we recognise this problem and have recently started researching possible solutions to this challenge. The main problem is that modern AI systems (deep neural network, DNN) require large amounts of data to become accurate enough to work in real world scenarios. One way to approach this is to spend a lot of resources on gathering large amounts of training data before developing and deploying the model. However, there is generally no guarantee that your AI application will function satisfactorily in a real world environment. Hence, it is often necessary to repeat the data gathering process many times to keep the AI application up to date.

The main conclusion from our preliminary research is that it is not efficient to try to do this process “off-line”. This term “off-line” here means that you, or your data- or machine learning-engineer, goes out to gather more data from already available resources. Instead, we find that it is more efficient to gather the data directly from the people who will use the application in a continuous manner. That is: to incorporate the use case and the users in the AI system from the start.

Figure 1. Traditional development process for Machine Learning (ML) Applications, versus the approach used in continual learning. Adapted from https://miatbiolab.csr.unibo.it/continual-learning/

This might seem a bit abstract, so let’s go to an example! Say you want to develop an AI application that somehow responds to the emotion or mood of the user (look at the ideas). It would make sense to try to develop an AI model that is very good at recognizing emotions from faces for this application. In fact, this might seem a plausible undertaking: the internet is full of face images and there are even a number of large curated datasets on facial emotion expression. But can you guarantee that the AI model will be accurate enough for all the potential users of the application? What if someone with a preference for designer glasses wants to use the application, are you sure that the training data gathered covers all edge cases of people wearing designer glasses? Of course, for humans it does not matter what kind of glasses someone wears, we can judge the facial emotion in any case (Fig. 2). Unfortunately, this does not go for DNN’s: these models can be potentially very good at predicting new data, but only if the new data are similar enough to the training data (e.g. see the much-debated opinion article by scientist Gary Marcus: Deep Learning: A Critical Appraisal [1]).

Figure 2. Designer glasses, easy for humans, hard for AI

We think it would probably be a better strategy to try and gather feedback from the users of the AI application and update the AI model with this feedback. In the case of emotion recognition, this would entail that the users “teaches” the AI model what their personal emotions look like specifically.

For this to work, we need a system that is not only capable of training an AI model and deploying it to a user interface, but also to be able to receive, gather and process user feedback in an automated manner. Even though there are many open source tools that greatly simplify the development of machine learning and AI models (e.g. PyTorch, Tensorflow and SKLearn), when we started this research project there were no tools that provided all the functionality that was needed for our idea. So, of course, we started developing our own tool! We wanted the system to be context-free and easily adaptable to different use cases, but to be able to test and validate the system we implemented the emotion recognition use-case mentioned above (with the sole purpose of demonstrate the functioning of the system).

Emotion recognition demonstrator

As mentioned above, to develop and test the core system for human feedback AI/continual learning we implemented a use-case centred around facial emotion recognition. The system is able to record user feedback data: additional data on emotion expression from a specific person, see Figure 3. This data can then be used to calibrate the basic AI model for emotion expression (this basic model was trained on a publicly available dataset containing faces of many people, FER2013). This is achieved by calling on the automated pipelines implemented in the system that handles the data trafficking and model training and deployment procedures. The aim of the use-case is that the personalized AI model is more accurate in recognizing the emotions of the specific user.

Figure 3. The emotion calibration front end: in a web interface the user can record emotion expression from a webcam feed. The user is provided with instructions (top right) on how the calibration session takes place. The order and frequency of emotions and the duration of the recording are flexible and read from a configuration file. Before each emotion recording, the user sees a countdown and the to be recorded emotion overlaid on the webcam feed. After the countdown, the user expresses the asked for emotion. After the recording session, the user can choose to save the recording. This action triggers the system to execute an automated pipeline where the new data is recorded in the system database and a calibrated model is trained specifically for the user

Next to testing the basic functionality of the software system (data and error handling, message passing, etc.), we also performed some preliminary tests on the AI model performance. Many open questions remain on how to actually accomplish an accurate enough model: how to build and train the basic (non-personalized model), how much user feedback data is needed, can too much user feedback also degrade the system? In our test we simply collected user feedback (emotion expressions from 5 people recorded by the system) and evaluated the accuracy of emotion recognition in a number of different scenarios, see Figure 4 for details.

Figure 4. Preliminary results from the emotion recognition model calibration sessions. The first two bars (affect->affect, fer->fer) show the accuracy of a class of convolutional neural networks (CNN) trained for emotion classification on two publicly available dataset (Affect and FER2013): the accuracy is around 50% (with a random chance of 14% for 7 emotion classes), not very good, but much better that random predictions. The second two bars (affect->fer, fer->affect) show the performance of the models trained on one of the two datasets and evaluated on the other: the performance plummets. This shows that these models have not truly learned to generalize to unseen faces (the two datasets contain a different set of faces). Next we see the performance of the personalized models (calib for short, bars 5 and 6). The calibrated models show lower performance when evaluated on the two datasets: when a model is trained on FER2013 and then calibrated on a specific user, the performance on the original FER2013 dataset is lower, but not as low when evaluated on an entirely different dataset). In the next bar (calib -> same pers, same sess) we evaluated the performance of the models calibrated on the emotions of a specific person and tested on data from that same person from the same recording session. Indeed, as expected, we see the highest accuracy here (> 70% on average). However, when these calibrated models are tested on data from other persons (calib -> other person) performance is quite low again. More important, when the calibrated model is tested on data from the same person but from a different recording session (for instance a different background and lightning conditions) we see that the accuracy also drops

The first conclusion is that simply calibrating a basis model on a small and not very diverse amount of user data is not sufficient to get a good performance in general. But the important result is that we now have a system that enables us to answer questions about, for instance, how much data is actually required, or what is a good method to retrain the models. In the next phase of the project we will try to get an answer to these kinds of questions, not only for emotion recognition, but also for other use-cases. We intend to implement and test use-cases such as human pose estimation, plastic waste recycling and drone security.

The CHIMP system architecture

So what does the system that enables these features actually look like? The system consists of a number of network-connected components that have a certain functional responsibility. The biggest division in functional responsibility is between the front-end application and the back-end.

In our architectural design, the front-end application is the part that the user interacts with and where specific requirements for the use-case are implemented. For instance, for the emotion recognition use-case this consists of a web-interface with a webcam feed where both model emotion predictions are displayed and the feedback functionality is implemented. For other use-cases, this can be something different.

The system back-end is designed to implement the features of the system that are more or less use-case agnostic: all applications need data storage and versioning, model training and deployment. The components are the training service (including one or more context-specific plugins), the data store, the tracking service, and the serving service. The entire CHIMP system (except for off-the-shelf components such as MLFlow and Minio) is built using Python and Flask. Fig. 5 gives an overview of how each component connects. Below that a description of each service is given.

Figure 5: CHIMP component diagram. The training service combined with the context-specific plugins are responsible for managing data sets, training new models or uploading initial models (that are trained on a separate system), fine-tuning models, and storing models in the tracking service. The plugin system ensures that CHIMP can support many different use-cases. The data store is used for storing data sets. In the current implementation of CHIMP, this is a Minio database. This stores data sets in a file-system-like structure. This ensures that any folder structure, such as folder structures that are commonly used to separate different classes of images, is preserved. The serving service is used to load the correct model from the tracking service. The correct model can be either loaded for a general context or application or based on a session specifier (for example, a user ID for use cases that have user specific models). Finally, the tracking service, which is MLFlow in CHIMP, tracks different models and versions of models, their hyperparameters, and optionally a number of metrics

The training service

The training service provides a way to run training tasks triggered by API calls. These tasks can be the training of an initial model or fine-tuning an existing model (for example, for a specific user). The training tasks run separately from the API. This is done using Celery, combined with a RabbitMQ based task queue. Background workers (event workers) monitor this task queue and start training tasks when one is added to the queue. This queue ensures that training tasks, which can take a longer time to run, do not block the control flow of the application or the API by returning a response when a task is queued instead of waiting for the task to finish. The application can poll the API for the status of the task. This flow is shown in Figure 6 below (note that the order of operation is shown by the numbers).

Figure 6: call stack for fine-tuning a model

Separating the API and the execution of the training tasks also means that multiple workers can be used to run multiple tasks in parallel, as shown in Figure 7 below.

Figure 7: call stack for training multiple models in parallel

The plugin system

To enable CHIMP to support many different use-cases in a maintainable and modular manner, we designed and implemented a plugin system. Plugins define and implement the requirements of a specific use case, and are therefore conceptually coupled to the font-end application. For this system to function we need to ensure that all plugins have a single consistent interface, and can thus be called upon by general routines defined in the API. This abstract interface is defined in the BasePlugin class which must be used by all plugin implementations. The plugin system automatically reads and loads plugins from a local folder on the Training Service runs. In addition to the plugin implementation itself, each plugin also has a PluginInfo object. This object contains some general information about the plugin and is descriptive of the use-case that is coupled to it. This information consists of:

  • Name: Name of the plugin (this is used to select the correct task to run in the API)
  • Version: The version of the plugin for version management
  • Description: A description of what the plugin does
  • Arguments: A list of arguments and the properties of these arguments (name, type, description, and whether the argument is required)
  • Datasets: The expected datasets and the properties of these datasets. Note that this does not specify a specific dataset, but a type of dataset (for example, a dataset of images categorized by object)
  • Model return type: The type of object returned by the plugin (e.g. ONNX, Tensorflow, PyTorch, etc.).

If the argument and datasets fields of the PluginInfo object are set, the API will check if all the required arguments and datasets are included in the request. The arguments and datasets will be made available to the plugin, as well as an instance of the datastore class (for accessing datasets) and an instance of the connector class (for loading base models and for uploading new/fine-tuned models).
Several example plugin implementations can be found in the plugins folder of the training service, see Figure 8 for the components of the training service.

Figure 8: A full overview of the components of the training service

Serving service

The serving service is responsible for providing an API for inference. It supports serving different models at the same time. When a new model is requested, the serving API will check if a given model is available in the tracking service. If the model is available, it will be loaded and cached by the serving service. Cached models are updated at regular intervals. Internally, the InferenceManager is responsible for this process. This process is shown in the sequence diagramme in Figure 9.

Figure 9: sequence diagram for an inference call

The serving service can support many different model types. Currently, only ONNX model support has been implemented, but by creating more implementations of a BaseModel class, this can be extended to other types of models (e.g. PyTorch, Tensorflow, etc.).

A full overview of the components of the serving service is shown in Figure 10 below.

Figure 10: component diagram of the Serving Service

References

Marcus, Gary. “Deep learning: A critical appraisal” arXiv preprint arXiv:1801.00631 (2018).