Introduction to micro-models or: how I learned to stop worrying and love overfitting
The purpose of this post is to introduce the “micro-model” methodology we use at Cord to automate data annotation. We have deployed this approach on computer vision labelling tasks across a wide range of domains including medical imaging, agriculture, autonomous vehicles, and satellite imaging.
TLDR; What: Low bias models applied to a small domain of a data distribution. How: Overfitting deep learning models on a handful of examples of a narrowly defined task. Why: Saving hundreds of hours of hand labelling.
How much data do you need to build a model that detects Batman?
This of course depends on your goal. Maybe you want a general purpose model that can detect the Batmen of Adam West, Michael Keaton, and Batfleck all in one. Maybe you need it to include a Bruce Wayne detector that can also identify the man behind the mask.
But if you want a model that follows the Christian Bale Batman in one movie, in one scene, the answer is…five labelled images. The model used to produce the snippet of model inference results above was trained with the five labels below:
Now, does this answer the initial question? This model is only a partial Batman model. It doesn’t perform that well on the Val Kilmer or George Clooney Batmen, but it is still functional with a specific use case. We thus won’t call it a Batman model, but instead a Batman micro-model.
We started using micro-models in the early days of Cord when our focus was purely on video data. We stumbled upon the idea when trying out different modelling frameworks to automate the classification of gastroenterology videos (you can find more about that here). Our initial strategy was to try a “classical” data science approach of sampling frames from a wide distribution of videos, training a model, and then testing on out-of-sample images of a different set of videos. We would then use these models and measure our annotation efficiency improvement compared to human labelling. We realized in experiments, however, that a classification model trained on a small set of intelligently selected frames from only one video already produced strong results. We also noticed that as we turned the number of epochs up, our annotation efficiencies got higher.
This was contrary to what we knew about good data science; we were grossly overfitting a model to this video. But it worked, especially if we broke it up such that each video had its own model. We called them micro-models. While this was for video frame classification, we have since extended the practice to include tasks such as object detection, segmentation, and pose estimation.
What are micro-models exactly?
Most succinctly, micro-models are annotation specific models that are overtrained to a particular task or particular piece of data. They are purposefully overfit models such that they don’t perform well on general problems, but are very effective in automating one aspect of your data annotation. They are thus designed to only be good at one thing. To use them in practice, we ensemble many together to automate a comprehensive annotation process.
The distinction between a “traditional” model and a micro-model is not in their architecture or parameters, but in their domain of application, the counter-intuitive data science practices that are used to produce them, and their eventual end uses.
To cover how micro-models work, we will take a highly simplified toy model that can give a clearer insight into the underpinnings beneath them. Machine learning at its core is curve-fitting, just in a very high dimensional space with many parameters. It is thus instructive to distill the essence of building a model to one of the simplest possible cases, one-dimensional labelling. The following is slightly more technical, feel free to skip ahead.
Imagine you are an AI company with the following problem:
- You need to build a model that fits the hypothetical curve below
- You don’t have the x-y coordinates of the curve and can’t actually see the curve as a whole, you can only manually sample values of x and for each one you have to look up the corresponding y value associated with it (the “label” for x).
- Due to this “labelling” operation, each sample comes with a cost
You want to fit the whole curve with one model, but it is too expensive to densely sample points to do so. What strategies can you use here?
One strategy is fitting a high degree polynomial to some initial set of sampled points across the domain of the curve, re-sampling at random , evaluating the error, and updating the polynomial as necessary. The issue is that you will have to re-fit the whole curve each time new sample points are checked. Every point will affect every other. Your model will also have to be quite complex to handle all the different variations in the curve.
Another strategy, which resolves these problems, is to sample in a local region, fit a model that approximates just that region, and then stitch together many of these local regions over the entire domain. We can try to fit a model to just this curved piece below for instance:
This is spline interpolation, a common technique for curve-fitting. Each spline is purposefully “overfit” to a local region. It will not extrapolate well beyond its domain, but it doesn’t have to. This is the conceptual basis for micro-models, manifested in a low dimensional space. These individual spline units are analogous to “micro-models” that we are using to automate our x-value labelling.
The more general case follows similar core logic with some additional subtleties(like utilizing transfer learning and optimizing sampling strategies). To automate a full computer vision annotation process, we also “stitch” micro-models together like an assembly line. Note, ensembling weak models together to achieve better inference results is an idea that has been around for a long time. This is slightly different. We are not averaging micro-models together for a single prediction, we let each one handle predictions on their own. Micro-models are also not just “weak learners.” They just have limited coverage over a data distribution and exhibit very low bias over that coverage.
We are leveraging the fact that during an annotation process, we have access to some form of human supervision to “point” the models to the correct domain. This guidance over the domain of the micro-model allows us to get away with using very few human labels to start automating a process.
Models can be defined based on form (a quantifiable representation approximating some phenomena in the world) or on function (a tool that helps you do stuff). My view leans toward the latter. As the common quote goes:
All models are wrong, but some are useful.
Micro-models are no different. Their justification flows from using them in applications along a variety of domains.
To consider the practical considerations of micro-models with regards to annotation, let’s look at our Batman example. Taking fifteen hundred frames from the scene we trained our model on, we see that Batman is present in about half of them. Our micro-model in turn picks up about 70% of these instances. We thus get around five hundred Batman labels from only five manual annotations.
There are of course issues of correction. We have false positives for instance. Consider the inference result of one of the “faux” Batmen from the scene that our model picks up on.
We also have bounding boxes that are not as tight as they could be. This is all, however, with just the first pass of our micro-model. Like normal models, micro-models will go through a few rounds of iteration. For this, active learning is the best solution.
We only started with five labels, but now with some minimal correction and smart sampling, we have more than five hundred labels to work with to train the next generation of our micro-model. We then use this more powerful version to improve on our original inference results and produce higher quality labels. After another loop of this process, when accounting for number of human actions, including manual corrections, our Batman label efficiency with our micro-model gets to over 95%.
There are some additional points to consider in evaluating micro-models:
- Time to get started: You can start using micro-models in inference within five minutes of a new project due to requiring so few labels to train.
- Iteration time: A corollary to the speed of getting started is the short iteration cycle. You can get into active learning loops that are minutes long rather than hours or days.
- Prototyping: Short iteration cycles facilitate rapid model experimentation. We have seen micro-models serve as very useful prototypes for the future production models people are building. They are quick checks if ideas are minimally feasible for an ML project.
While we have found success in using micro-models for data annotation, we think there is also a realm of possibility beyond just data pipeline applications. As mentioned before, AI is curve fitting. But even more fundamentally, it is getting a computer to do something you want it to do, which is simply programming. It is just a form of programming that is statistically rather than logically driven.
“Normal” programming works by establishing deterministic contingencies for the transformation of inputs to outputs via logical operations. Machine learning thrives in high complexity domains where the difficulty in logically capturing these contingencies is supplanted by learning from examples. This statistical programming paradigm is still in its infancy and hasn’t developed the conceptual frameworks around it for robust practical use yet.
For example, factorization of problems into smaller components is one of the key elements to most problem solving frameworks. The object oriented programming paradigm was an organziational shift in this direction that accelerated the development of software engineering and is still in practice today. We are still in the early days of AI, and perhaps instantiating a data-oriented programming paradigm is necessary for similar rapid progression.
In this context, micro-models might have a natural analogue in the object paradigm. One lump in a complex data distribution is an “object equivalent” with the micro-model being an instantiation of that object. While these ideas are still early, they coincide well with the new emphasis on data-centric AI. Developing the tools to orchestrate these “data objects” is the burden for the next generation of AI infrastructure. We are only getting started, and still have a long way to go.