A deep image classifier trained on e621's crowdsourced annotations
Prelude
There's no live tech demo. Instead, use test.py and eval.py. It may happen in the future, but if it does I don't want to be on the hook for maintaining it or deploying it. At least not right now. Besides, most of the general applicability of something like this is in transfer learning, not in the narrow objective of assigning tags to images. If we can make guarantees about the usefulness of the features being extracted for this overarching task, we can safely draw conclusions about their usefulness for other downstream tasks. More on that below.
All told, I got 0.84 validation AUC-ROC and 96.7% accuracy. These are quite different metrics, as the AUC accounts for the whole confusion matrix; you can find the noob-friendly article I used to learn about it as well as a bit of background on Type I and Type II error here.
Goals and Motives
This model is essentially just a very deep encoder that maps images to much more compact, fixed-length vectors of real values, culminating in the final outputs of the classification head following a sigmoid activation. These final outputs, and outputs of intermediate layers, can be used to quickly train new heads on related downstream classification objectives if you truncate my model head and attach a new one, or on regression objectives such as automatic judgment of illustration quality. It's also useful if you just want to be able to identify a certain label which doesn't appear in the top 1024. See this tutorial or just shoot me a message on Discord at flenser#5287 if you want me to talk to you about that.
It could also be extended to assist human annotators in quickly mining for rare tags by giving downstream algorithms some idea of what tags are associated with an image based on its contents, allowing simpler client routines to search for rare labels that might be applicable on that basis using cooccurrence data or other features. It isn't a replacement for human annotators: its data-hungry nature and my tendency to adhere to precedent set by DeepDanbooru meant I opted first to train it only to retrieve the top-k most common labels. That isn't too helpful for the annotation task directly, as these labels also unfortunately tend to be less descriptive; the distributional properties that make their semantics more widely applicable also make them less useful as discriminatory tools. Rare labels, however, aren't going to be ones that this model can consistently and accurately apply without some outside help.
In a similar vein, if you are interested solely in the model's stem for use as a dense image-feature extractor to turn image blobs into fixed-length dense vectors of floats, it can also do that. One example of an application for mapping images to a fixed-dimensional vector space might be so that you can roll your own image-query database with annoy or other sublinear nearest-neighbor implementations by retrieving and evaluating preview images. Don't know how much value-added there is in using this DNN's stem as a feature extractor instead of DCT Γ‘ la phash or other feature extraction methods, SotA inferences are hardly a necessity for all applications, and without those results or their associated overhead, simpler forms of AI can often still outperform baselines on their own.
One way to get around doing any of the above would be to try retraining or fine-tuning this model against an alternate dataset which omits common tags and trains its classification head only against rarer labels. I think there may be enough data to facilitate this. But hopefully even without any such measure, by augmenting insights from this classifier with some additional tooling or subsequent fine-tuning on new datasets, the ability to assign even common labels to arbitrary images or embed them as high-dimensional vectors could help ease the task of annotating and searching.
Lastly, I wanted to see if I could do something like this with a single commodity GPU, in my case, a GTX 970, so I conducted almost all my experiments with the same device I use for playing games and rendering my desktop. She's exhausted, but don't worry, I cleared the dust out from under her shroud and all of that. I'm happy to say that it turns out you totally can, even if you don't have any formal education in this subject matter or really know what you're doing, as long as you read enough theory and guidance about it I think you should be able to get some results you're proud of or find useful in impressively little time.
Architecture
To begin, I cobbled together a new network topology from a ResNet (ImageNet SotA) based on DeepDanbooruV4 architecture. By the time I was finished it had these topological changes:
- Attention augmented global image-feature encoder inspired by residual decoders appearing in masked language transformer architectures such as BERT and GPTx
- Explicit representation of correlated input-dependent label noise Γ‘ la current WebVision SotA (see heteroscedastic.py)
- Replaced ReLU with SiLU for all intermediate activations
The change in intermediate activation was prompted by some literature which suggested that image classification is a sparse-gradient problem. TensorFlow's default is also to use uniform initialization rather than Gaussian, which you may note is different than the authors of the ResNet publication would have recommended, so it seems unwise to have the derivative collapse completely to zero for negative values of specific parameters as it does under ReLU. SiLU takes a little bit more effort to saturate and doesn't fall into dead zones quite as easily, which is one of the things that makes it more expensive, but I still thought it was the right choice, so that's the choice I made.
No idea how much any of these changes contribute to its performance; got stuck in integration HE-double-hockey-sticks and didn't finish hyperparameter search in a timely enough manner to really want to check. E.g., SiLU seemed to train faster in early trials with adaptive optimizers, and my eye is usually pretty good for this type of thing, but if you really care, I'm sorry to say that I'm not carrying out this work with the same rigor as a researcher, because I'm not one of those. So if you want to check what kinds of effects each of these deltas had, you may just have to change it and test it yourself. Even if you're reluctant to do that but also reluctant to accept that justification, I commend your skepticism, because there are availability and confirmation biases acting on my assessment here. Also it's definitely slower at inference-time. But it did allow me to cut out some penultimate layer normalization in the classification head. There are definite tradeoffs to be had with every single one of these changes, mostly to do with parameter and flop counts, but this has been a learning experience for me, and my first attempts were all fairly naΓ―ve and consequently likely to be full of unrelated bugs that muddied the waters during my tests amid those initial stages of rapid prototyping, so not even I'm foolhardy or egotistical enough to fully trust my own judgment on any of these additions or their internal construction.
Mucking with the current topology of the model head now feels very precarious, so I'll be giving myself a break from it. That said, the following are some parting shots and notes to myself for future work on this model's architecture, particularly on reducing its size:
- Maybe I should attach my crazy sophisticated classification head to an EfficientNet or a DenseNet and see if that works as well as the ResNet configurations to try to make things more compact.
- Using fully-parametrized locally connected layers in the penultimate feature extraction pipeline as opposed to a form of simpler downsampling, e.g. shared-weight convolutional kernels or pooling, is more than excessive.
- In the same vein as DenseNet: Perhaps it's worth using residual concatenation instead of summation to forward-propagate globally-attenuated image features from the attention heads to the final layer, or simply removing the intermediate attention mechanisms. Wouldn't be hard to find out with some experiments, and it's definitely contributing a LOT to the overall bulk of the current topology.
- Scalability issues. This model architecture probably wouldn't work for rarer tags if we decided to try to have it predict them because DeepDanbooru's ResNet topology extracts a number of intermediate features linear in the number of outputs, and the intermediate parameter count in the penultimate layers is also quadratic in the number of outputs, which I learned during this project causes the parameter counts of those ultra-dense penultimate layers to explode surprisingly quickly.
Proceedings
As I mentioned in my discussion of motive, I carried out the final training on my workstation's video card. The dataset was sampled with a goal of ascribing a minimum of 512 positive training instances to each label, making for ~500,000 total instances, which I retrieved from ca. 2014 to present (see download.py. Most of the challenge of carrying out this endeavor was in getting good I/O throughput on the resulting data and in conducting hyperparameter search; model topologies for image classification are mostly solved, with most recent iteration on their architecture revolving around making them cheaper and more robust, not more powerful, and I owe a lot of what I'm doing here to prior works. In fact, the field is now saturated enough that research from top labs is being performed with variations of model stem topologies half a decade old which have succeeded in environments with less ambiguous ground-truth on new tasks where human annotations are sparse or noisy, bringing us to new frontiers of self-supervised meta-learning. E621 presents few of those challenges, really, making my solution highly overengineered, but technically correct (the best kind of correct).
According to a 2017 Facebook Research publication, AIs outperforming humans in object detection is old hat. So I copied some of the objective outlined in this paper for my own training regime. Specifically, I set alpha = to the same value selected by DeepDanbooru (0.25), and then wrapped it in robust adaptive loss. I haven't played around with alpha very much. Please try this for me if you end up doing your own experiments.
Now we come to why exactly I am prioritizing model training speed and convergence stability so heavily instead of model serving or ease of deployment. I'm not just impatient, my optimization hyperparameters are as well. The (gratuitous) architecture was fit with SGD and momentum=0.99 over a single superconvergence cycle learning-rate schedule with LR ~ 0.1-4.0 over 4 epochs, or apx. 45000 steps, with batch size = 12. Note that my architecture and training outline, while brief, omitted any mention of explicit regularization measures beyond this choice of learning rate; that is by design: with this training procedure, they are not to be included. Also note that the authors of the latter publication would probably be puzzled that I used a high, constant momentum instead of another one-cycle schedule to dynamically compute a step-wise momentum ~ 0.85-0.95, but I got some pretty good mileage with the learning rate adjustments already, the model is only seeing each training instance once, and the Keras API's Model.fit doesn't support dynamic mutation of very many optimizer parameters. I didn't feel like plugging custom behavior into the optimizer by sub-classing it, and I figure I can just give myself high momentum as an excuse for the regularization effects of my low batch size anyway, which should just about balance it out. I think. Don't quote me on that because like so much else about this work, that is actually not even empirical conjecture, just my gut. I saw that it was converging faster with high momentum than anything modest, and without increased momentum, model performance plateaued very early, and so here we are.
Updated by Versperus