Statistic Descriptive in SQL

Setup

  1. Start by installing MySQL from here: https://dev.mysql.com/downloads/

  2. Follow the instructions from here to setup MySQL on your local machine.

  3. Install MySQL Workbench from here: https://dev.mysql.com/downloads/workbench/

  4. Download dataset used in this lecture from here: https://github.com/ardhiraka/PFDS_sources/blob/master/aapl_historical_stock_price.csv

The Basics

SQL COUNT

Counting all rows COUNT is a SQL aggregate function for counting the number of rows in a particular column. COUNT is the easiest aggregate function to begin with because verifying your results is extremely simple. Let’s begin by using * to select all rows from the Apple stock prices dataset:

SELECT COUNT(*)
  FROM hacktiv8.aapl_historical_stock_price

Counting individual columns Things start to get a little bit tricky when you want to count individual columns. The following code will provide a count of all of rows in which the high column is not null.

SELECT COUNT(high) AS count_high
  FROM hacktiv8.aapl_historical_stock_price

SQL SUM

SUM is a SQL aggregate function. that totals the values in a given column. Unlike COUNT, you can only use SUM on columns containing numerical values.

The query below selects the sum of the volume column from the Apple stock prices dataset:

SELECT SUM(volume)
  FROM hacktiv8.aapl_historical_stock_price

SQL MIN/MAX

MIN and MAX are SQL aggregation functions that return the lowest and highest values in a particular column.

They’re similar to COUNT in that they can be used on non-numerical columns. Depending on the column type, MIN will return the lowest number, earliest date, or non-numerical value as close alphabetically to “A” as possible. As you might suspect, MAX does the opposite—it returns the highest number, the latest date, or the non-numerical value closest alphabetically to “Z.”

For example, the following query selects the MIN and the MAX from the numerical volume column in the Apple stock prices dataset.

SELECT MIN(volume) AS min_volume,
       MAX(volume) AS max_volume
  FROM hacktiv8.aapl_historical_stock_price

SQL AVG

AVG is a SQL aggregate function that calculates the average of a selected group of values. It’s very useful, but has some limitations. First, it can only be used on numerical columns. Second, it ignores nulls completely. You can see this by comparing these two queries of the Apple stock prices dataset:

SELECT AVG(high)
  FROM hacktiv8.aapl_historical_stock_price
 WHERE high IS NOT NULL

The above query produces the same result as the following query:

SELECT AVG(high)
  FROM hacktiv8.aapl_historical_stock_price

All queries can be combined like this:

SELECT
    count(*) AS Volume_Count,
    avg(volume) AS Volume_Average,
    stddev(volume) AS Volume_Stddev,
    min(volume) AS Volume_Min,
    max(volume) AS Volume_Max
FROM
    hacktiv8.aapl_historical_stock_price

SQL GROUP BY

SQL aggregate function like COUNT, AVG, and SUM have something in common: they all aggregate across the entire table. But what if you want to aggregate only part of a table? For example, you might want to count the number of entries for each year.

In situations like this, you’d need to use the GROUP BY clause. GROUP BY allows you to separate data into groups, which can be aggregated independently of one another. Here’s an example using the Apple stock prices dataset:

SELECT year,
       COUNT(*) AS count
  FROM hacktiv8.aapl_historical_stock_price
 GROUP BY year

You can group by multiple columns, but you have to separate column names with commas—just as with ORDER BY):

SELECT year,
       month,
       COUNT(*) AS count
  FROM hacktiv8.aapl_historical_stock_price
 GROUP BY year, month

The order of column names in your GROUP BY clause doesn’t matter—the results will be the same regardless. If you want to control how the aggregations are grouped together, use ORDER BY. Try running the query below, then reverse the column names in the ORDER BY statement and see how it looks:

SELECT year,
       month,
       COUNT(*) AS count
  FROM hacktiv8.aapl_historical_stock_price
 GROUP BY year, month
 ORDER BY month, year

SQL HAVING

However, you’ll often encounter datasets where GROUP BY isn’t enough to get what you’re looking for. Let’s say that it’s not enough just to know aggregated stats by month. After all, there are a lot of months in this dataset. Instead, you might want to find every month during which AAPL stock worked its way over $400/share. The WHERE clause won’t work for this because it doesn’t allow you to filter on aggregate columns—that’s where the HAVING clause comes in:

SELECT year,
       month,
       MAX(high) AS month_high
  FROM hacktiv8.aapl_historical_stock_price
 GROUP BY year, month
HAVING MAX(high) > 400
 ORDER BY year, month

SQL CASE

The CASE statement is SQL’s way of handling if/then logic. The CASE statement is followed by at least one pair of WHEN and THEN statements—SQL’s equivalent of IF/THEN in Excel. Because of this pairing, you might be tempted to call this SQL CASE WHEN, but CASE is the accepted term.

Every CASE statement must end with the END statement. The ELSE statement is optional, and provides a way to capture values not specified in the WHEN/THEN statements. CASE is easiest to understand in the context of an example:

SELECT date,
       volume,
       CASE WHEN close > 500 THEN 'yes'
            ELSE 'no' END AS big_number
  FROM hacktiv8.aapl_historical_stock_price

You can also define a number of outcomes in a CASE statement by including as many WHEN/THEN statements as you’d like:

SELECT date,
       volume,
       CASE WHEN close > 500 THEN 'yes'
            WHEN close > 300 THEN 'maybe'
            ELSE 'no' END AS big_number
  FROM hacktiv8.aapl_historical_stock_price

SQL DISTINCT

ou’ll occasionally want to look at only the unique values in a particular column. You can do this using SELECT DISTINCT syntax. To select unique values from the month column in the Apple stock prices dataset, you’d use the following query:

SELECT DISTINCT month
  FROM hacktiv8.aapl_historical_stock_price

If you include two (or more) columns in a SELECT DISTINCT clause, your results will contain all of the unique pairs of those two columns:

SELECT DISTINCT year, month
  FROM hacktiv8.aapl_historical_stock_price

You can use DISTINCT when performing an aggregation. You’ll probably use it most commonly with the COUNT function.

In this case, you should run the query below that counts the unique values in the month column.

SELECT COUNT(DISTINCT month) AS unique_months
  FROM hacktiv8.aapl_historical_stock_price

Subqueries in SQL

Subqueries (also known as inner queries or nested queries) are a tool for performing operations in multiple steps. For example, if you wanted to take the sums of several columns, then average all of those values, you’d need to do each aggregation in a distinct step.

Subqueries can be used in several places within a query, but it’s easiest to start with the FROM statement. Here’s an example of a basic subquery:

SELECT *
FROM (
	SELECT *
	FROM hacktiv8.aapl_historical_stock_price
    WHERE year = 2014
    ) aapl_historical_stock_price
WHERE close > 400

Let’s break down what happens when you run the above query:

First, the database runs the “inner query”—the part between the parentheses:

SELECT *
	FROM hacktiv8.aapl_historical_stock_price
    WHERE year = 2014

If you were to run this on its own, it would produce a result set like any other query. It might sound like a no-brainer, but it’s important: your inner query must actually run on its own, as the database will treat it as an independent query. Once the inner query runs, the outer query will run using the results from the inner query as its underlying table:

SELECT *
FROM (
	<<results from inner query go here>>
    ) aapl_historical_stock_price
WHERE close > 400

Subqueries are required to have names, which are added after parentheses the same way you would add an alias to a normal table. In this case, we’ve used the name “aapl_historical_stock_price.”

SQL Window Functions

For windows function, please download data from here https://github.com/ardhiraka/PFDS_sources/blob/master/dc_bikeshare_q1_2012.csv

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

The most practical example of this is a running total:

SELECT duration_seconds,
       SUM(duration_seconds) OVER (ORDER BY start_time) AS running_total
  FROM hacktiv8.dc_bikeshare_q1_2012

When using window functions, you can apply the same aggregates that you would under normal circumstances—SUM, COUNT, and AVG. The easiest way to understand these is to re-run the previous example with some additional functions. Make

SELECT start_terminal,
       duration_seconds,
       SUM(duration_seconds) OVER
         (PARTITION BY start_terminal) AS running_total,
       COUNT(duration_seconds) OVER
         (PARTITION BY start_terminal) AS running_count,
       AVG(duration_seconds) OVER
         (PARTITION BY start_terminal) AS running_avg
  FROM hacktiv8.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'

ROW_NUMBER() does just what it sounds like—displays the number of a given row. It starts are 1 and numbers the rows according to the ORDER BY part of the window statement. ROW_NUMBER() does not require you to specify a variable within the parentheses:

SELECT start_terminal,
       start_time,
       duration_seconds,
       ROW_NUMBER() OVER (ORDER BY start_time)
                    AS rn
  FROM hacktiv8.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'

Using the PARTITION BY clause will allow you to begin counting 1 again in each partition. The following query starts the count over again for each terminal:

SELECT start_terminal,
       start_time,
       duration_seconds,
       ROW_NUMBER() OVER (PARTITION BY start_terminal
                          ORDER BY start_time)
                    AS rn
  FROM hacktiv8.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'

RANK() is slightly different from ROW_NUMBER(). If you order by start_time, for example, it might be the case that some terminals have rides with two identical start times. In this case, they are given the same rank, whereas ROW_NUMBER() gives them different numbers. In the following query, you notice the 4th and 5th observations for start_terminal 31000—they are both given a rank of 4, and the following result receives a rank of 6:

SELECT start_terminal,
       duration_seconds,
       RANK() OVER (PARTITION BY start_terminal
                    ORDER BY start_time)
              AS ranking
  FROM hacktiv8.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'

You can also use DENSE_RANK() instead of RANK() depending on your application. Imagine a situation in which three entries have the same value. Using either command, they will all get the same rank. For the sake of this example, let’s say it’s “2.” Here’s how the two commands would evaluate the next results differently:

  • RANK() would give the identical rows a rank of 2, then skip ranks 3 and 4, so the next result would be 5

  • DENSE_RANK() would still give all the identical rows a rank of 2, but the following row would be 3—no ranks would be skipped.

You can use window functions to identify what percentile (or quartile, or any other subdivision) a given row falls into. The syntax is NTILE(# of buckets). In this case, ORDER BY determines which column to use to determine the quartiles (or whatever number of ‘tiles you specify). For example:

SELECT start_terminal,
       duration_seconds,
       NTILE(4) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds)
          AS quartile,
       NTILE(5) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds)
         AS quintile,
       NTILE(100) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds)
         AS percentile
  FROM hacktiv8.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'
 ORDER BY start_terminal, duration_seconds

It can often be useful to compare rows to preceding or following rows, especially if you’ve got the data in an order that makes sense. You can use LAG or LEAD to create columns that pull values from other rows—all you need to do is enter which column to pull from and how many rows away you’d like to do the pull. LAG pulls from previous rows and LEAD pulls from following rows:

SELECT start_terminal,
       duration_seconds,
       LAG(duration_seconds, 1) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds) AS lag,
       LEAD(duration_seconds, 1) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds) AS lead
  FROM hacktiv8.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'
 ORDER BY start_terminal, duration_seconds

This is especially useful if you want to calculate differences between rows:

SELECT start_terminal,
       duration_seconds,
       duration_seconds -LAG(duration_seconds, 1) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds)
         AS difference
  FROM hacktiv8.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'
 ORDER BY start_terminal, duration_seconds

The first row of the difference column is null because there is no previous row from which to pull. Similarly, using LEAD will create nulls at the end of the dataset. If you’d like to make the results a bit cleaner, you can wrap it in an outer query to remove nulls:

SELECT *
  FROM (
    SELECT start_terminal,
           duration_seconds,
           duration_seconds -LAG(duration_seconds, 1) OVER
             (PARTITION BY start_terminal ORDER BY duration_seconds)
             AS difference
      FROM hacktiv8.dc_bikeshare_q1_2012
     WHERE start_time < '2012-01-08'
     ORDER BY start_terminal, duration_seconds
       ) sub
 WHERE sub.difference IS NOT NULL

If you’re planning to write several window functions in to the same query, using the same window, you can create an alias. Take the NTILE example above:

SELECT start_terminal,
       duration_seconds,
       NTILE(4) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds)
         AS quartile,
       NTILE(5) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds)
         AS quintile,
       NTILE(100) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds)
         AS percentile
  FROM hacktiv8.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'
 ORDER BY start_terminal, duration_seconds