Hello, you can use this chat to ask me how I can assist with your project or business needs and it will return detailed information about my capabilities.

    Social Media Caption Generator Explained

    An application of a transformer model to the problem of image captioning and the process for training it

    Data Source and Preprocessing

      The data I used to train my image captioning model was the InstaCities1Million dataset. I heavily preprocessed the data by using regular expressions to remove hashtags, URLs, data with irregular spacing in the text, emojis, and other special characters. I removed examples of data that had too many or too few words after repeated training of models outputting concise predictions and predictions that continuously repeated just a few words in a cycle. I used the Langdetect library to filter the data for English examples and the SpaCy library to detect and remove instances with named entities. I also removed training examples for data where the images did not have three color channels. At the end of preprocessing, I was left with just under 200,000 entries for my dataset.
      The vocabulary for the model covered 99% of the words in the dataset, and all other words were labeled as a single token. I split the data into a training and validation set, with a validation set size of 10,000. For my Vision Transformer approach, I also stored features of the Inception model to use in place of the image data. For the Perceiver variant, I moved the concatenation of Fourier features outside of the model and into a custom Keras data generator. I augmented the image data through the Imgaug library for all models before training the model on it.

    Use Cases and Findings During Data Exploration

      While exploring the data set, I found that many of the captions were either very short, used the term, “I don’t know what to say,” or were left empty. I find that the data shows an enormous demand for inspiration when it comes to captioning a social media post, so I decided to make a machine learning model to do just that. This model is not meant to be used in an automated process or for commercial purposes. I hope that anyone who chooses to use it will enjoy what it has to offer and use it either for helping them create a caption of their own, using a caption generated by the model, or for their enjoyment.

    Model Architecture

      I used a Perceiver encoder and a standard transformer decoder for my best-performing model. I used a pre-normalization convention throughout the model and made it fully residual. This model did not use dropout layers and was my implementation of both a Perceiver and standard transformer model. I implemented all models in Tensorflow using subclassed Layer and Model objects. I used the Gaussian error linear unit inside all layers within the model and a softmax activation for the final layer for activation functions. The best-performing model had shared weights between repetitions in the Perceiver encoder blocks except for the first block. Hyperparameters were chosen based on validation error and seemed constrained by my hardware. I would expect the same architecture to perform better with an increased number of model parameters. For the choice of the optimizer, I found that LAMB performed more reliably than RectifiedAdam. Generated captions are found through Beam Search with some randomness for variety.

    Comparison to Baseline

      The previous gold standard for image captioning, a convolutional neural network encoder with a long-short term memory decoder, was used as a baseline to compare my model’s results. The CNN and LSTM combination only achieved a maximum validation accuracy of 12% on the same dataset, whereas my transformer model achieved a maximum validation accuracy of 38%. In this case, the use of a transformer model far exceeded my expectations and proved to be a much better choice for this task. Compared with the other tested architectures, I achieved a 20% validation accuracy with the Vision transformer model, which significantly improved over the baseline. I expected the Vision transformer to perform better. Compared to the Perceiver, it may have underperformed because the pre-trained computer vision model used for its input features was not tuned to the data set. I attempted to create a convolutional neural network to replace the pre-trained model; however, it achieved a similar result.

    Optimizing Computation through Tensorboard

      I learned a lot about optimizing for computational efficiency during my time working on this project through the use of Tensorboard and Weights and Biases. I had to optimize the input pipeline for feeding data into the model and utilized multiprocessing to rectify the issue of a data loading bottleneck. Kernel launch time was a significant portion of model training. I created an XLA compatible version of the layers of the model. I optimized the training loop from taking two minutes and thirty seconds per hundred batches down to twenty seconds per hundred batches through these optimizations while doubling the batch size through mixed precision calculations. I was then able to train models to convergence in a much more reasonable amount of time.