I started academically in physics, but dropped out(sorry, took a leave of absence) of my PhD in the first year and did a long stint in quantitative finance. So out of all the possible topics for my first peer-reviewed published paper: portfolio optimisation, dark matter signatures, density functional theory, I ended up with the topic of…drawing rectangles on a colonoscopy video. I didn’t think it would come to this, but here we are. In reality though, drawing boxes on a colonoscopy video is one of the most interesting problems I have worked on.
The purpose of this post is to review the paper we (including my cofounder Ulrik at Cord) recently published on this topic: “Novel artificial intelligence-driven software significantly shortens the time required for annotation in computer vision projects”. The paper, in the journal Endoscopy International Open, can be found here. This was cowritten with the deft and patient assistance of our collaborators at King’s College London, Dr Bu Hayee and Dr Mehul Patel.
To convince you to keep reading after already having seen the words “annotation” and “colonoscopy”, we can for starters state that the field of gastroenterology is immensely important for human wellbeing. This is both in terms of cancer incidence and everyday chronic ailments. From cancer.org:
In the United States, colorectal cancer is the third leading cause of cancer-related deaths in men and in women, and the second most common cause of cancer deaths when men and women are combined. It’s expected to cause about 52,980 deaths during 2021.
Even more prevalent is inflammatory bowel disease (IBD). In 2015 around 3 million people in the US had been diagnosed with IBD, a condition associated with higher likelihood of respiratory, liver, and cardiovascular disease among others.¹ Add this to the litany of other conditions and symptoms encapsulated under the GI umbrella (acid reflux, indigestion, haemorrhoids, etc.) and we find that the scope (no pun intended) of GI knows few bounds with its effect on the population.
But gastroenterology is also very important for the AI community. It is one of the early vanguards of commercial adoption of medical AI. Companies such as Pentax, FujiFilm, and Medtronic have all been part of the crop of medical device companies that are running into the field to build their own AI enabled scoping technology. These models can run live detection of polyps and act as a gastroenterologist’s assistant during a scoping procedure, sometimes even catching the doctor’s blind spots.
Progress in this field will be a beacon to the rest of a skeptical medical community that AI is not just a playground for mathematicians and computer scientists, but a practical tool that directly matters in people’s lives.
But, there is a problem.
Unlike a machine learning model that is serving up a binge-y Netflix show to an unsuspecting attention victim (where the stakes of a mistake are that you end up watching an episode of Emily in Paris), getting a polyp detection wrong or mis-diagnosing ulcerative colitis has drastic implications for people’s health. The models that are developed thus need to be as foolproof as you can get in the machine learning world. This requires prodigious quantities of data.
Empirically, models tend to require ever increasing amounts of data to combat stagnation in performance. Getting a model from 0% to 75% accuracy might require the same amount of data as 75% to 85% which requires the same as from 85% to 90% etc. To get over 99% accuracy, with the current methods and models we have, you need to throw a lot of data at the problem.
The issue is that for a model to train from this data, it needs to be annotated. These annotations can only effectively be completed by doctors themselves, who have the expertise to correctly identify and classify the patient video and images. This is a massive drain on doctor time.
A high accuracy endoscopy model might require one million annotated frames. Assuming a conservative estimate of 20 seconds per frame, including review from one or two other doctors, that’s 230 days of doctor time, about the number of working days in a year. That’s a doctor’s working year which is certainly better spent treating and caring for patients(and practicing their handwriting).
This opportunity cost was the original motivation for starting Cord. We wanted to save valuable time for anyone that has to undergo the necessary evil of data annotation, with doctors being the most egregious case. And after building our platform, we wanted to see if it actually worked. So, we ran an experiment.
We decided to run a simple A/B test of our platform against the most widely used open sourced video annotation tool (CVAT). Openly available video annotation tools are difficult to come by, but CVAT stands out as a platform with one of the most active users and stars on GitHub.
We set up a sample of data from an open source gastrointestinal dataset (the Hyper-Kvasir dataset) to perform the experiment. From the paper:
Using a subsample of polyp videos from the Hyper-Kvasir dataset, five independent annotators were asked to draw bounding boxes around polyps identified in videos from the dataset. A test set of 25,744 frames was used.
The experimental setup was:
- Each annotator would have two hours on Cord and two hours on CVAT
- The annotators would run through the data in the same order for both platforms and use any available feature from each platform
- Annotators could only submit frames that at the end of the process they had reviewed and were happy with
- At the end of the two hours, we would simply count the number of approved frames from each annotator on each platform
The power of the Cord platform (termed CdV in the paper) lies in its ability to quickly train and use annotation specific models, but for the experiment no labels or models were seeded for the annotators. They could only use models that they trained themselves with the data they had annotated within the time limit of the experiment. Normally this would not be the case of course. If you are tagging hundreds of thousands of frames, you will already have models and intelligence to pull from, but we wanted to stack the deck as much against as us we could and have the annotators start cold.
The results were not close. From the paper:
In the 120-minute project, a mean±SD of 2241±810 frames (less than 10% of the total) were labelled with CVAT compared to 10674±5388 with CdV (p=0.01). Average labelling speeds were 18.7/min and 121/min, respectively (a 6.4-fold increase; p=0.04) while labelling dynamics were also faster in CdV (p<0.0005; figure 2). The project dataset was exhausted by 3 of 5 annotators using CdV (in a mean time of 99.1±15.2 minutes), but left incomplete by all in CVAT.
With CVAT, most annotators did not make it past the third video. Every single annotator was able to produce more labels with Cord than CVAT. What was the most encouraging was that the most senior doctor of the annotators, the one who had the least experience with any annotation software, got a 16x increase in efficiency from Cord. This was the exact target user we designed the platform for, so it was very encouraging to see these results pan out. It was a major win for the realisation of our hypothesis.
Briefly, the reason Cord was more efficient was simply the automation of most of the labelling:
Labellers were allowed to adopt their own labelling strategies with any functionality offered in each platform. With CVAT, this consisted of tools to draw bounding boxes and propagate them across frames using linear interpolation of box coordinates. With CdV, labellers had access to both hand labelling annotation tools and CdV’s embedded intelligence features. This embedded intelligence was composed of object tracking algorithms and functionality to train and run convolutional neural networks (CNNs) to annotate the data.
Even with a completely cold start, Cord’s “embedded intelligence” automated over 96% of the labels produced during the experiment:
With CdV, only 3.44%±2.71% of labels produced were hand drawn by annotators. The remainder were generated through either models or tracking algorithms. Thus with CdV far more labels were produced with far less initial manual input (Figure 3). Automated labels still required manual time for review and/or adjustment. For model generated labels, a mean of 36.8±12.8 minutes of the allocated annotator time was spent looking over them frame by frame and making corrections.
The most interesting observation in my opinion was the acceleration of labelling rate under the Cord platform. For CVAT, label rate remained approximately constant for the duration of the experiment. With Cord, however, for every twenty minute interval on the platform, annotation speed increased by a median of 55%(!). Every label marginally informed the next. The hope is that with more labels and even larger projects, this effect will lead to a precipitous drop in the temporal (and financial) cost of creating training data sets.
While the results were favourable, we recognise there is a lot more to do. Polyp detection is a relatively simple annotation task, so while it is a costly tax on doctors, we realise there are even costlier ones that we need to address. Our software is designed to deal with arbitrarily complex labelling structures, but designing automation around this complexity is a tricky but interesting problem that we are working on.
That being said, we showed that we could save doctors a bunch of time annotating their data. Give them intelligent but accessible tools, and they will save their own time. With that, the bottleneck to the next iteration of medical AI does not need to be lack of training data.
If you want to chat about annotation or AI feel free to reach out to me at firstname.lastname@example.org.