Learn SQL Basics for Data Science by UCD

A review of topics from the professinal certificate program by UCD on Coursera

Overview

The Learn SQL basics for Data Science professional certificate by the University of California Davis was a great introduction to SQL. It details specific terminology, standard conventions in SQL and allows SQL to be taught by working through real-world data. I was impressed by the course content because it is thorough and efficient in its delivery. In addition, the program teaches how to use data analysis in a natural and generalizable way.

SQL for Data Science

Structured Query Language is a computer language that allows a user or program to interface with a database. It can retrieve information from a database, as in a machine learning pipeline. It is also possible to insert data into the database through SQL or modify the existing data. Retrieving data from a database is straightforward. The keyword SELECT followed by the data you are requesting and then FROM the table in the database that holds the information is enough to retrieve data from a database using SQL. A temporary table can be made using CREATE TEMPORARY TABLE followed by the name of the temporary table and then a query from an existing table. Temporary tables are a fast and easy way to simplify more complex queries.
Queries can be filtered by the values in a table. If a column in a table should only contain a set of values, it can be specified by the WHERE keyword. In queries that have aggregation functions, functions over a group of rows according to their value within a specified column, the HAVING keyword should be used instead. Aggregation functions are used by keywords like COUNT, SUM, MIN, MAX, and others during the SELECT clause followed by GROUP BY and then column name or set of columns and a HAVING clause if required. It is also possible to sort data by the values in its columns using the ORDER BY clause at the end of a query followed by the column name or names and then ASCENDING or DESCENDING.
Standard math operations can be used on columns in a SQL query using their conventional symbols and the AS keyword following SELECT to create an alias for the output column. Other operations such as comparisons can use the CASE WHEN keywords, like string matching or specifying a range of values.
Database tables must have a unique identifier for each row, referred to as its primary key. A database may have more than one table, with tables in the database having columns that reference another table’s primary key. This kind of column is called a foreign key. It is also possible for a primary key in a table to refer to another table and act as both a primary key and a foreign key. Usually, entity-relationship diagrams show these relations between tables in a database. An entity-relationship diagram is a graph of the topology of a database and follows a set of conventions.
When querying from multiple tables in one query to gather information from both sources, it is necessary to use the foreign key to match the rows of the database. This is possible through subqueries and joins. A subquery is when you SELECT FROM another query or in a WHERE column IN clause during a query. Continual subqueries can lead to poor performance, however. A join uses the primary key from one table and a foreign key from another to retrieve the information from both tables. There are CROSS JOINS in which all primary and foreign key combinations are returned. In INNER JOINS, only rows that match in both the selected columns will be returned. LEFT, RIGHT, and FULL OUTER JOINS will return all from the first table, all from the second table, or all rows from both tables. Self-joins are used when two columns in a table share some values, such as an employee ID and their supervisor’s ID. A self-join can also be used in combination with a subquery.

Data Wrangling, Analysis and AB Testing with SQL

Databases may not always contain a complete set of records for each column in each table. It is best practice to assume that this is the case when first examining a database and checking for missing values and unreliable data because it is outdated or inaccurate. If missing values are a problem during a query, you can use the COALESCE keyword to make the columns being joined have a stand-in value for null values to avoid matching all null values in one table to all null values in the other
Data types of columns are critical to examine and get a better sense of the contents of a table. When creating a table or temporary table, using the appropriate data type for a column is important because data types can restrict what operations can be used. Because every row in a column must use the same data type, it is also essential to choose the correct size of data type because it needs to fit all the possible values for the data while not taking excess space. Creating tables is helpful for when the same data will need to be retrieved multiple times, used in other queries, or for standardization.
Date time information stored in SQL tables has its own set of functions. They can be used to create rolling averages of other columns or to order data. In SQL, this is called a windowing function, and a table made from a windowing function is named a rollup table. The keywords for windowing functions where other columns would typically use a GROUP BY are the RANK, OVER, and PARTITION BY clauses. They can also be used in aggregate functions such as quarterly reports or tracking the changes from one month to the next of another metric.
This course also went over the process of AB testing. Experimental design choices need to be used to create a fair testing environment so that the test results do not suffer from biases. AB testing is the process of designing an experiment to decide which of two choices is better by assigning a control and test group and measuring both for a specific metric. Then, a statistical test is used to decide if the test group had a statistically significant difference from the control group in the testing metric based on its confidence interval.

Distributed Computing with Spark SQL

The final course in this professional certificate focused on big data technologies with SQL. Namely, it involved working with Spark DataFrames through Spark SQL. Spark DataFrames are essentially the big data equivalent of a SQL database. It can be stored across several machines in a network and loaded into the random access memory of the computers in a network to maximize throughput. Spark SQL is a relational database management system, RDBMS, with similar syntax for querying as any other RDBMS.
A Spark DataFrame is distributed across a network and acts as one file. It is partitioned into smaller chunks that an individual machine can hold. Spark is lazy when loading data into memory and will only do so if the data is cached. Caching is done simply by calling CACHE TABLE followed by the table’s name and provides an increase in speed of up to one hundred times or more. Similarly, it can be UNCACHED or through CLEAR CACHE. Another way to cache a table is to call the CACHE LAZY TABLE command, which only caches the data when needed.
Because data is stored locally on individual machines, shuffling data can take a long time. The data does not need to be transferred between devices in a network in a narrow operation. For machine learning, wide operations need to be used to shuffle data. A wide shuffle involves moving data between machines by partitioning the data further and reading and writing to and from one device to the next instead of in large chunks. Broadcast joins can minimize the number of wide operations on the data. A broadcast join is a join where all of the data from one query goes onto one machine instead of being spread across the network.
Data in Spark can be a managed or unmanaged table. Spark does not store persistent data, and for this reason, a managed table, which is a table managed by Spark, will be erased once the network is shut down. On the other hand, an unmanaged table or external table is a table that comes from another data source, such as a SQL database.
Machine learning models or other functions that are written in another framework can be used in Spark through a User Defined Function, or UDF. A UDF can be used in any call to a Spark function or Spark SQL query. This can be useful for obtaining machine learning algorithm predictions over an extensive database through a distributed workload and allows for machine learning at scale.

Hello, you can use this chat to ask me how I can assist with your project or business needs and it will return detailed information about my capabilities.

Learn SQL Basics for Data Science by UCD

Overview

SQL for Data Science

Data Wrangling, Analysis and AB Testing with SQL

Distributed Computing with Spark SQL