Advanced Data Science Professional Certificate Review

Review of topics from the certificate program by IBM on Coursera

Overview

The advanced data science professional certificate program through IBM was a great introduction to the use of big data technologies. It focused on Apache Spark as the technology framework and allowed me to learn how to use python to interface with Spark. In addition, it showed me how to use cloud infrastructures for data science and machine learning through IBM Watson Studio and gave me a better understanding of the intricacies of distributed computing.

Fundamentals of Scalable Data Science

The Fundamentals of Scalable Data Science course focused on Apache Spark, its file systems, terminology related to Spark, and interfacing with a database through Spark with Python. Apache Spark is a big data technology that allows many other technologies to work together and in a network of computers all at once. Spark can be seen as a universal connector of many different technology stacks, from programming languages to file storage technologies and even web technologies. Spark follows functional programming paradigms as opposed to Python’s object-oriented programming. This means that spark functions must be written to be executed on an individual row within a database. Spark does not attempt to store the values of the database during the call to the function. In this way, it is possible to run the same process across multiple machines in parallel. The functions do not depend on data not accessible from one device to another in a computing cluster.
There are two types of nodes in a computing cluster managed by Spark. A worker node is a computer used to run the programs given by the driver node. The driver node is responsible for communicating with all of its worker nodes. The driver is like a network controller that collects the results from its workers and distributes the instructions to its worker nodes. A single computer may have both a driver node and a worker node.
The data storage systems that Spark can interact with are SQL, NoSQL, and any file stored on a computer, such as a spreadsheet through ObjectStorage. It distributes the files of a database across a network. All data used during computations is readily available in the random access memory of the worker nodes if there is enough memory available and can utilize disk space if needed. Spark’s datasets that maximize the speed of computations are called Resilient Distributed Datasets, or RDDs. Spark can be written in multiple different programming languages. However, it converts all of its programs to run inside of a java virtual machine to maximize computation speed and memory efficiency to the highest possible degree. This is due to the application programming interface of RDDs in Spark.
Spark is commonly used with Hadoop Distributed File System, HDFS, when each worker node in a computing cluster has storage media. Hadoop takes the file and breaks it into pieces that can fit on the worker nodes. HDFS keeps track of which computers have what sections of data so that the individual parts can act as a single file. Most commonly, it is ideal to have data stored in the Apache Parquet file format due to its ability to compress data while maintaining the ability to read data from a disk quickly.
Alternatively, Spark can use existing SQL databases by creating a DataFrame object out of them. DataFrames can be run on the same application programming interface as an RDD. Its instructions are optimized through a process in spark internally. When interacting with a DataFrame object in Spark, it is also possible to use the SparkSQL programming interface. SparkSQL is the preferred way to interact with DataFrames because it is automatically optimized and converted into RDD functions.
The course also covers how to use the SparkSQL interface to generate summary statistics for large datasets, such as the mean, variance, skewness, and kurtosis. In addition, there was an exercise on plotting data from Spark in python through the Matplotlib library and principal component analysis in Spark.

Advanced Machine Learning and Signal Processing

Machine learning done through Spark is similar to machine learning on a single machine. First, data needs to be loaded and preprocessed. Then, a model is fitted to the data and evaluated. This process in spark is done by creating a pipeline. A pipeline takes a list of functions and combines them into one program. Machine learning algorithms in Spark can be executed through the SystemML API. SystemML is a framework for machine learning that uses its implementations of algorithms and also supports Keras models. The focus of this course was to build pipelines in Spark that effectively composed all of the major steps in training a machine learning model from smaller functions for extracting, transforming, and loading data into models and training and evaluating models. Data for the programming assignments in the course was in the form of SQL databases, and so it allowed me to learn the specific details of SparkSQL. I did all of the programming assignments in notebooks hosted on IBM Watson Studio. Watson Studio is a cloud ecosystem developed by IBM similar to other cloud infrastructures.
Models covered by the course content included linear and logistic regressions, naive Bayes classifier, support vector machines, ensemble models such as tree-based models and boosting models, and k-means clustering. Naive Bayes is a generative classifier. It attempts to learn the distribution over features instead of finding the optimal boundary between classes. Its predictions are a joint probability over the individual features given the classification. The naive Bayes classifier is naive because it assumes that the features are independent of one another. That is not always the case in practice, but it can still perform reasonably well even under those circumstances. Support vector machines are discriminative models that find the hyperplane that separates the classes of the points by maximizing the margin between classes.
The course also covered topics related to signal processing, such as the Fourier and wavelet transforms. The Fourier transform is a function that takes an input in the time domain and outputs that same function’s equivalent in the frequency domain. The idea is to decompose the original signal into a sum of sine and cosine functions by measuring the magnitudes and length of their cycles. A signal in the time domain will appear as a function with cycles. A signal in the frequency domain will be a discrete function with a vector of a given magnitude based on how much that frequency changes the signal. The length of the cycle determines the position of the vector that describes the frequency within the signal. The Wavelet transform is similar to the Fourier transform except that it can find frequencies with a starting and stopping point in the signal. It effectively separates the frequencies by looking at a section of the signal at a time and repeats this process for multiple different sizes of the region that it compares at one time. The Fourier and Wavelet transforms can generate features for machine learning models by taking the frequencies as inputs.

Applied AI with Deep Learning

The third course in the Advanced Data Science specialization by IBM focused on neural networks and their implementation in a distributed environment. It was a quick review of the deep learning models that I learned in great detail while taking the advanced machine learning specialization. At the same time, I gained information about applying the same machine learning frameworks on the cloud as I would on a single computer. In addition, the course shows how to use the cloud to deploy machine learning programs as a REST API.

Hello, you can use this chat to ask me how I can assist with your project or business needs and it will return detailed information about my capabilities.

Advanced Data Science Professional Certificate Review

Overview

Fundamentals of Scalable Data Science

Advanced Machine Learning and Signal Processing

Applied AI with Deep Learning