Hello, you can use this chat to ask me how I can assist with your project or business needs and it will return detailed information about my capabilities.

    Advanced Machine Learning Specialization Review

    A review of topics from the Advanced Machine Learning Specialization by HSE on Coursera and additional commentary

    Overview

      The advanced machine learning specialization was a challenging but rewarding series of courses that helped me understand the current machine learning application approaches and bridged the gap between theory and practice. I found the structure of the lessons to be well thought out as they built upon each other and used ideas from previous courses in the series as I progressed. I recommend it to anyone looking for a rigorous explanation of advanced topics in machine learning and anyone who would like to understand the nuances of actual applications of the field. In this article, I will be highlighting essential subjects from the series of courses, giving my thoughts on the future of the respective area, and attempting to explain core concepts intuitively.

    Introduction to Deep Learning

      The first course in the series was essentially a comprehensive review of the topics in the deep learning specialization that I took through deeplearning.ai. It covered linear models, gradient descent, regularization, overfitting, backpropagation, convolutional neural networks, recurrent neural networks, and unsupervised learning.
      The novel aspect of this course for me was the introduction to autoencoders. Autoencoders are neural networks that compress the original data into a small neural network representation and then aim to reconstruct the original data from the neural network representation. They can be helpful in terms of creating a neural network representation of any kind of data for more general machine learning applications. I believe this technology will be used for data transmission or other related compression use cases, as an input for a graph-level neural network for recommendation systems, or as a universal connector between neural networks systems. They are fantastic because the data they are trained on does not need to be manually labeled, so the potential for their use in the future is limitless in terms of self-supervised machine learning.

    How to Win a Data Science Competition

      Data science competitions, such as those hosted on kaggle.com, are much different from the actual practical applications of machine learning. There is generally only one metric that matters in a competition setting, a limited amount of data, and a predefined set of data on which it will be evaluated. This often leads to strategies specific to these constraints, such as loss functions that resemble the competition metric as closely as possible. A clear example of a discrepancy between the error of a model and its metric that sticks out to me is in my project for image captioning. During training, I noticed that the validation set accuracy continued to increase while the error also increased after some time. This can mean that correct predictions were becoming less confident or incorrect predictions were becoming more confident. In either case, the model would probably generate more realistic captions while the validation loss is lower, even with a lower validation accuracy. However, it would only be the metric that matters in a competition setting.
      Another unfortunate consequence of contests is data leakages, where the target variable can be predicted easily from any number of flaws in the competition setup. Another kind of exploitation and shortcoming of competitions is that a valuable strategy for specifically tailoring the model to the predefined set of test data is that it is possible to encode the mean value of another variable in relation to the target variable. This strategy of mean encoding is not meaningful in terms of a research or production viewpoint, though, because it is specifically fit to perform best on the competition data set. Other similar strategies to tailor the model performance to the testing data are used in competitions, such as attempts to make the distributions of the training and test set match as closely as possible.
      There are good skills that are learned through competitions too. Exploratory data analysis is vital to doing well in a contest. It helps to build skills to understand a data set better so that you can apply those insights to the task through feature engineering and data preprocessing. For example, missing values to the untrained eye might intuitively be filled in by the mean of the data. If the data has outliers, the mean will be deceiving and inaccurate to use. It would be better to fill the data in with the median or through a regression based on the other features. Exploratory data analysis produces summary statistics like the mean, median, mode, interquartile range, covariance, and other data elements and visualizations to make informed decisions about the rest of the process. It could even be used directly in a business setting to demonstrate key focus areas.
      The course did include some fascinating feature engineering information. In particular, the technique I wanted to draw attention to was the K-Nearest neighbor features. An optional programming assignment included a portion for multiprocessing in python and a section on how to design neural network features related to the K-nearest neighbor algorithm. The K-nearest neighbors algorithm, or KNN, attempts to group points according to their features by classifying nearest points within the same category. Similarly, it is possible to include information about distances to nearby neighbors, characteristics of its neighbors, the number of data points within a given distance, and other information before feeding the data into the model. The additional information can significantly increase model performance for models that do not function in this way.
      Tree-based models are especially prevalent in competitions. This branch of machine learning is different from that of deep learning. It attempts to predict the target variable by creating a set of weaker classifiers that each contribute to the final prediction. An individual classifier in a tree-based model typically has only a few considerations when giving its prediction, such as examining a small subset of the input features and classifying the data by drawing a line to separate two classes. Examples of tree-based models include Random Forests and Extra trees, where the previous explanation of the models holds. Other tree-based models focus on boosting. Boosting models like Gradient Boosted Decision Trees such as Light Gradient Boosted Machine, more commonly referred to as LightGBM, and the Extreme Gradient Boosting, XGBoost, create the smaller classifiers in series such that the next part of the model focuses on correcting the incorrectly predicted points for the overall model.
      For better results, an ensemble of machine learning models can be combined in different ways, either through feature engineering or using models’ predictions as features for other models. A simple yet effective strategy can be to utilize multiple different types of machine learning models together, such as tree-based models, deep learning models, generative models like the naive Bayes classifier, models for unsupervised learning through clustering data points for features, and many others rather than relying on models that take similar approaches to model the data because their combination will capture different spans of information. An ensemble approach may train a linear regression of all model predictions to the target variable and use the regression’s output as the final prediction, such as model stacking.
      All in all, competitions can be a great way to learn about every aspect of the process of data science and have introduced me to many new ways to approach problems than I had previously considered. I find that the field of data science and machine learning is so vast and fluid that it is paramount to continue learning about new concepts. There is no better way to learn it than to have hands-on experience experimenting with a model and seeing what others have to say regarding the same data set to refine one’s abilities.

    Bayesian Methods for Machine Learning

      Bayesian machine learning methods rely on Bayesian statistics theory, so to talk about this course, I should first explain what that means. Bayesian statistics is a part of statistics that deals with the probabilities of conditional events. Through this line of thinking, it is possible to quantify the effect of known unknowns, such as the variables that are being modeled, namely, epistemic uncertainty. Likewise, it is possible to deduce the quantity of the unknown unknowns or things that would not be modeled due to missing information or inherent randomness, called aleatoric uncertainty. In essence, Bayesian statistics measure how much influence an event can have on another versus how much can be attributed to unknown circumstances.
      The objective of training a Bayesian model is to start with an initial assumption and adjust those assumptions to be the most likely in reality. This is made possible by including prior information, or what we already know about the model or initially assume to be accurate, and the likelihood of the outcome given the assumptions, or how well the model explains the data. Then given the data, we can measure the probabilities of the initial hypotheses and update them to fit the data, or evidence, more accurately by maximizing the posterior probability or the probability of the parameters given the evidence. The parameters of a Bayesian model are not fixed values that follow the best fit of the data but instead probability distributions. Hence, it is possible to calculate the probability of the model’s parameters to be within the region that explains the data. It is vital to make this distinction because Bayesian models can measure the confidence or expected range of values that the target variable could take given the relevant information.
      Because it is difficult to know the actual form of the data distribution in advance, it is instead assumed to be a known probability distribution introduced in the prior. The prior needs to be chosen such that it is conjugate to the likelihood for the posterior to remain within the same family of probability distributions. This means that the prior multiplied by the likelihood must be represented by the same form of the distribution as the prior itself. An example would be if the prior is a normal distribution, then the posterior must also be a normal distribution.
      Latent variable models are models which attempt to model unseen information. They accomplish that by supposing that there is an underlying variable that causes the other variables. An example of this would be a model which can generate images, where the characteristics of the image being produced may be latent variables. Latent variables help quantify uncertainty in a process. Latent variables can be applied to many other areas of machine learning. Still, the core concept fits most naturally in the context of Bayesian machine learning due to its likeness to models within that area of study.
      A running theme of Bayesian machine learning is the application of probability distributions to data. One interesting approach is to cluster data by placing several normal distributions, named a mixture of Gaussians model, onto the data and iterating on each normal distribution to put it closer to the center of a cluster of points and with a covariance matrix that best fits the group. The number of normal distributions that best describe the data can be found by plotting the total likelihood of the mixture of distributions as the number of distributions increases, fitting them to the data, and looking for the point at which the joint probability of the data plateaus when increasing the number of distributions. This is very similar to choosing the optimal number of clusters for the K-means clustering algorithm by plotting what is referred to as an elbow graph. Another possibility for selecting the correct number of Gaussian distributions is to split the data into a training and validation set and fit the distributions to the training data while finding the number of distributions that maximize the likelihood of the validation data. This approach to clustering data may be used as a classifier on its own or as a feature of another neural network input, Similarly to K-means and K-nearest neighbor algorithms.
      Iteratively maximizing the likelihood of the distributions is called the expectation-maximization algorithm. It has two steps. First is the expectation step, where the data points are labeled according to their probabilities of occurring within each distribution. Then, the maximization step moves the distributions to the geometric mean of the points labeled belonging to it weighted by their probabilities and gives a new covariance matrix that best describes the labeled points. In other words, the second step is to update the distributions such that the likelihood of the labels is maximized.
      A statistical model can be optimized using the Kullback-Liebler divergence for data that does not follow a known distribution. It is a kind of distance metric that measures the distance of two distributions; however, it is not an actual distance metric because it is asymmetric. The K-L divergence from a given distribution to another may not be equal to the K-L divergence of the second distribution to the first. The optimization of the K-L divergence between a model and the target data is accomplished by maximizing the likelihood of a variational lower bound of the likelihood of the evidence through the expectation-maximization algorithm. Put more simply, the supposed form of the prior fits a point estimate of the evidence distribution. It is then determined in which direction the slope of the point where the lower bound meets the point estimate is, and the prior distribution moves in the direction of the slope. This runs into the problem of local maximums in the evidence distributions if the evidence distribution is multimodal.
      A more direct method for estimating the distribution of a data set is to use Markov Chain Monte Carlo approximation. MCMC is most easily explained through Gibbs sampling. The idea is that given a data set, we can plot point estimates of the original distribution that the data came from by continuous resampling of the data. To create a point estimate through Gibbs sampling, you would start at a random point, calculate the mean and variance of one feature of the input data, and sample from a normal distribution that matches the feature’s mean and variance. Then, we move the point estimate along the axis for the given feature. This process is repeated until there is an arbitrarily low error in estimating the target distribution. This requires many steps, but a graph of the point estimates can be as close to the actual distribution as desired, given enough computing resources. There is a better way to sample points for a Markov chain called the Metropolis-Hastings sampling technique, which can be run in parallel instead of in a series that produces samples that are not correlated with one another.
      Variational autoencoders are another fascinating application for Bayesian machine learning. They are similar to standard autoencoders except that the neural network representation of the data is a latent distribution instead of a single point. They can be used as a universal connector to a more extensive system of neural networks and have the advantage that they provide information about the uncertainty of latent variables. They also perform incredibly well as image generation models, even recently outperforming Generative Adversarial Networks in some cases. In terms of natural language processing tasks, they may be able to generate a variety of interpretations in translation tasks. They also inherit the applications of standard autoencoders in terms of compression algorithms and self-supervised learning potential.
      Deep learning can also benefit in a variety of ways from methods in Bayesian machine learning. A few examples of this are variational dropout regularization, hyperparameter tuning by Bayesian optimization, and the introduction of latent variables in deep learning models, such as in a Bayesian neural network where all the parameters of the neural network are distributions over possible weights. Variational dropout regularization works similarly to a standard dropout layer. It forces the model to predict with incomplete information; however, a variational dropout layer instead includes noise from a sampled distribution.
      Bayesian optimization is a very nice improvement over grid search. It is possible to quantify the uncertainty about combinations of hyperparameters to explore those with the highest expected value and combinations with a considerable variance for the expected value. Bayesian optimization makes it possible to search over a much larger volume of possible combinations of hyperparameters and with greater computational efficiency. The process of doing this hyperparameter search usually is through a Gaussian process. A Gaussian process is a joint distribution over points where each point is a random variable modeled by a normal distribution correlated with the other points. Then, the covariance matrix of the joint distribution is determined by a kernel function, which is some distance metric for the distance between the points represented by the Gaussian distributions.

    Practical Reinforcement Learning

      Reinforcement learning is an area of machine learning that has a great deal of potential. I particularly enjoy the theory behind reinforcement learning because it profoundly connects to game theory and economics. Both at their core are the study of decision making, and reinforcement learning uses many of the ideas from economics. It can completely revolutionize the world through robotics and automation and is, in my opinion, the most underutilized branch of machine learning up to this point. In reinforcement learning, algorithms are designed to optimize a reward function. For example, this can be the total score in a video game, the profits of a stock trading algorithm, or a function of time that it takes for a machine to complete a task. The benefits of reinforcement learning are that it can exceed human abilities as it is not reliant on labeled data in the way that supervised learning is; it only requires a reward function in place of a large set of data. It can also continually improve because it can generate new training data as it interacts with its environment.
      The best way to think about a reinforcement learning algorithm is to frame it as a Markov decision process. A Markov decision process is a series of actions that an agent can take in an environment to obtain a reward given the current state. In reinforcement learning, an agent is an actor in the environment, which is the space that an agent acts within. The agent can take actions in the environment to influence its state and obtain rewards. The goal of the agent is to maximize the rewards that it receives.
      There are several approaches to model and train agents to optimize their sum of rewards. For more straightforward use cases where all states of the environment can easily be modeled individually, a policy function is enough to represent the whole space. It can be optimized through the cross-entropy method through policy iteration. The cross-entropy method can find optimal actions in models with a relatively small and limited number of possible actions and states. A history of the states, actions, and rewards can train a model through dynamic programming. Dynamic programming is when a program can refer to previously stored information in a data structure. Dynamic programming is crucial in reinforcement learning because the expected value of the reward function at a given state and for a given action depends on the future states and actions. The outcomes of the training process are prone to many of the same kinds of problems with locally optimal points as deep learning. Due to feedback loops based on the reward function, a model may find an area where it can continually receive rewards from a series of actions.
      When the state space becomes innumerable, it is possible to use value iteration. The value function updates according to the maximum expected values of each action in the explored states in the training history. This process repeats until the value function changes in an arbitrarily low amount from one training step to the next. Value iteration is serviceable for large state and action spaces because it does not require explicitly storing the actions of each state and allows the model to decide the action in a state based on the value function.
      Complete information is not always available about the probabilities to transition from state to state given the action, such as in most real-world applications due to inherent randomness in the real world. Many factors cannot be controlled, like the wind or a mechanical failure. Therefore, state-space and value combinations need an arbitrary function approximator, specifically, a deep learning model. This approach to approximate the function of an unknown state space and predict optimal actions through deep learning is called model-free reinforcement learning or Q-learning. It is called model-free instead of its model-based counterpart because the agent’s environment is unknown concerning the state transitions or their probabilities and must be learned. This comes with some interesting challenges, though, as a deep learning model would attempt to predict what it thinks is the best action at every point, avoiding areas of uncertainty. To counteract that, you can train the model by occasionally forcing it to choose a random action so that it can explore areas within the state space.
      There are a few drawbacks to this approach, though. There would be a potentially infinite sum of rewards because of the amount of time spent in the environment. Accordingly, that problem is avoided by multiplying the future rewards by a value between zero and one for each state transition. Due to this fact, Q-learning may learn to take the shortest path to its objective even if it has a more dangerous path. Then to fix the problems of potentially hazardous actions, the rule can be changed for updating the values from the maximum reward over actions in a state to the expected value of the actions in the state. The keyword for this approach is SARSA, or state, action, reward, state, action, as it accounts for the expected value of the following step in a series. Another approach to avoid this behavior is to train the model initially on data collected by an expert performing a task while using data performed by a human will not allow the model to perform above the level of the human expert.
      The last thing I will say about the exploration-exploitation trade-off in reinforcement learning, the problem of exploring the state space while learning the optimal decision-making process, is that it can be combined with Bayesian machine learning approaches. It is possible to quantify the uncertainty of the regions within a space by using methods from Bayesian machine learning. In a setting where it is imperative to perform this task efficiently, such as an advertising engine that serves ads on the internet, uncertainty quantification will likely be of the utmost importance.

    Deep Learning in Computer Vision

      The course on computer vision in the advanced machine learning specialization was challenging in comparison to the other classes, in my opinion. I also thought it was not as well thought out as the other courses; nonetheless, the challenges in approaching the assignments were a great introduction to working with real-world data. I had to apply many of the concepts that I learned in the other courses, and it did a phenomenal job of testing one’s intuition for the applications of various techniques. It started slowly with a gradual buildup towards the main project of the course, a face recognition algorithm, or rather a pipeline of multiple deep learning models.
      This course explained the process of edge detection in early approaches to computer vision thoroughly. It then described the fine details of convolutions, especially regarding what an individual filter does and how they evolved from edge detection algorithms. Then, it showed how the receptive field of filters grows larger as convolutional layers are stacked by showing that early layers in a convolutional neural network learn to detect edges, which grow to segments of an image, and eventually to the image as a whole. I specifically remember that the image data fed through the convolutional layers that showed the highest levels of activation for the given filter made it spectacularly clear what the network was learning at each layer within the network, and it was particularly fascinating. I highly recommend anyone curious about how these networks function to look at that kind of demonstration.
      As for Convolutional neural network architectures, a few key points seem to stand the test of time. Convolutional layers with different sizes of filters can capture additional information, and so it is common to see multiple layers with varying sizes of filters used together. It is also a long-standing trend that convolutional neural networks are increasingly being up-scaled in the number of layers within the neural network or the depth of the network.
      Greater computational efficiency can be achieved by decomposing larger filter sizes into multiple one-dimensional convolutions, allowing for more complex functions to be learned due to the increased number of nonlinearities between layers. Various scales of filters can also act as a single convolutional block by stacking their outputs and padding according to their difference in output volume.
      Further, residual connections are commonly used in deep convolutional neural networks by adding the output of earlier layers to layers further in the model. Residual connections are also called skip-connections and increase the network’s ability to learn by sharing information. Skip-connections allow collections of layers to be treated as one layer, capable of extracting much more information than its parts. Another similar idea is to have each block of convolutional layers connected to all other blocks in the same way forwards in the network, named Densely connected convolutional neural networks.
      One part of convolutional neural networks which has come into question is the pooling layer. In essence, a pooling layer is a maximum or average of the activations within a region that decreases the volume of the inputs. The stride of a convolutional layer is the amount that a filter moves when it is used as a sliding window across the convolutional volume. Experimentally, it has been found that increasing the stride of convolutional layers can accomplish essentially the same result as using a pooling layer.
      Region-based convolutional neural networks, R-CNN, for object localization problems were also discussed in detail in the course and function by hierarchical segmentation of the inputs. In essence, the input is divided into a grid of subsections. A classifying CNN can identify regions containing each data class within the training set. They achieve a much greater computational efficiency by running the segments in parallel through the network to attain real-time object localization.
      For landmark detection, like facial key-points tasks, a standard convolutional neural network architecture can be used with the final layers of the network being linear. It is treated as a regression problem of the features of the convolutional layers. The convolutional volume is simply flattened to connect the output volume of a convolutional layer to a one-dimensional form that a linear layer accepts. Activation functions between layers in this setting usually are a ReLU or a variation of it. An appropriate loss function for this type of problem is the mean squared error between the predicted coordinates of the landmark and the labeled data.
      For image classification, the same type of architecture may be used as in the landmark regression problem but with a final activation layer of either a softmax or a sigmoid function. The loss function would be categorical cross-entropy or standard binary cross-entropy, respectively, depending on whether the classification problem is binary or for a more extensive number of possible classifications. Once the network for classification is trained, it can easily be converted into a fully convolutional neural network. This is accomplished by removing the flattening layer and using convolutional layers with a filter size of one and the same number of parameters as the linear layers that the original model was trained on and transferring the weights of the classifier to the weights of the convolutional layers for a heat map of the activations for each part of the image to be seen. Another unusual characteristic of fully convolutional neural networks is that they can accept data that vary in size and shape. When the original network is trained on video data instead of image data, it can be used for action recognition, object tracking, and pose estimation.
      The face recognition project was a particularly challenging part of the course. In it, I was tasked with a multi-class classification problem with thousands of individual people for the pipeline to identify. It was a multi-step process starting with the localization of the person through a fully convolutional neural network. My approach was to augment the original training data through various techniques such as Gaussian blurring, random cropping, rotations, and other slight distortions and their combinations. I then used the much-improved model’s activation heat map to crop the input image to the region of interest by finding the mean of the activation along both dimensions in the image to center the crop and their standard deviations to decide the size of the final crop. The next step was to normalize the images through landmark detection. I again augmented the original training data before retraining the landmark detection model. I rotated the cropped image so that both eyes’ landmarks were parallel to the x-axis. Finally, the assignment asked us to use a K-nearest neighbors classifier from the training set to the test set for the final classification task through neural network features of activations inside a neural network to cluster the points.
      There was also a section about Generative adversarial networks in this course. A GAN consists of a generator and a discriminator. The generator’s objective is to create realistic images to make the discriminator unable to tell the generated images apart from the training data while never seeing the training data. On the side of the discriminator, its objective is to classify which images are authentic and which are computer-generated correctly. These models are notoriously difficult to train and have many design choices specific to them. For this course, the kind of GAN model we were tasked with creating was a deep convolutional GAN or DCGAN.

    Natural Language Processing

      Natural language processing deals with text data. For a computer to digest text data, it must be formatted in a particular way. First, text data needs to be preprocessed into a sequence of tokens. Tokens are a mapping from words or subwords into the number space. Then, a natural language processing model sees sentences as a series of numbers. Then, for each index in the token mapping process, the token is given an embedding, an arbitrarily sized vector that is a symbolic representation of the word. For example, similar terms will be given embeddings closer to each other in the embedding space. These vectors are found through a neural network that attempts to predict what word will appear in a sentence, given its context, in a process called word2vec. The embedding then is the activation of the neural network used in this classification problem. Usually, this is done through the use of n-grams. N-grams are a series of words that can be found near each other. So when the word is being predicted in the embedding classifier, it is through finding the most highly correlated words with the other tokens and n-grams.
      For some tasks, such as email filtering, these embeddings can be combined with simpler machine learning models by creating a bag-of-words. It is a simple yet effective summary of the contents of the text document found by averaging the embeddings of the words found in the email or other text document. These averages can then be clustered into their appropriate category by finding a hyperplane that separates them.
      Text documents belonging to an overall category can also be classified by word embeddings using the bag-of-words approach and term frequencies. For example, a tag on a social media website might be suggested by a machine learning model based on the contents of a post. The term frequencies and inverse document frequencies, TF-IDF, measure the number of times a term is used in a document relative to the number of times the term is used in a topic.
      Named entity recognition uses the embeddings by training a deep learning model to predict if the given word in the input is a named entity based on the other words in the input. If the training data is available, it can also classify the named entity into subcategories. LSTM models have been used historically for this application, but they have been somewhat recently overtaken by transformer-type models. The objective of a named entity recognition model is to classify each word on if it is a named entity and, if so, what category of named entity it is.
      There are a few main themes in neural network architectures for natural language processing. In many cases, models will be in the form of an encoder and decoder. The encoder part of the model creates a neural network representation of the inputs, and the decoder can take the encoder’s output to perform several tasks, like machine translation.
      Attention mechanisms are also an essential part of natural language processing tasks. An attention mechanism attempts to allow the model to figure out which inputs are most relevant to each other. This can be in the form of learned weighted averages or through another form of attention mechanism, like through a dot product of layers in a neural network, additive attention in a similar fashion, or matrix multiplication of latent parameters within a model. The seq2seq model can use the form of attention where it is a learned weighted average of the inputs to measure relevance between inputs or by keeping track of a hidden state that acts like a memory for future passes through the network to keep track of previous information. It is made of an encoder and a decoder, and it can be used for applications where both the input sequence and output sequence have a variable length. The encoder and decoder can be a recurrent model, such as an LSTM or GRU.
      For the course’s final project, I used a similarity measure between the embeddings of questions on StackOverflow and a topic classifier to find posts on StackOverflow that answered the given question. I then implemented this into another model through a python library to create a chatbot capable of locating answers to questions on stackoverflow.com.

    Addressing Large Hadron Collider Challenges

      The final course in the advanced machine learning specialization revolved around applying machine learning to data from the large hadron collider. It tested the student’s ability to apply machine learning and data science in an unfamiliar domain. I had to use many of the techniques learned from previous courses to complete the assignments.
      The first task was pretty straightforward. I used an optimizer from Scikit-Learn to fit a distribution to a data set to find its mean value. The first week of content mainly explained some background information about particle physics so that it would be possible to choose appropriate methods to create models for future weeks. The second week was pretty easy as I used a gradient boosted machine and a neural network to classify particles.
      Subsequent assignments were more involved in feature engineering, choice of machine learning models, and evaluation metrics. I used an AdaBoost classifier and decision tree classifier with hyperparameters found through grid search to track particle decays. It also involved using statistical tests, such as the Kolmogorov-Smirnov distance, to measure the similarities between the distributions of the predictions and the actual data.
      My favorite assignments were in the final two weeks of the course. I applied machine learning with feature engineering similar to the K-nearest neighbor feature engineering but with the ball-tree algorithm to isolate individual particle trajectories as the large hadron collider separated them to search for evidence of dark matter. I also used the distances between near points, polynomial features, and dot products between features to enhance the model’s predictive power. For model choice, I used XGBoost and achieved a significant improvement over other models through feature engineering.
      The final assignment for the course involved Bayesian optimization applied to a planned system to find the optimal arrangement of parts in a proposed design. I found it interesting that it was possible to use machine learning in a simulated environment to optimize infrastructure before it is created. The specific task was to find a combination of spacing between barriers for particles such that the particles are stopped at the highest rate possible.