SQL study notes#

Basic data manipulation with SQL#

  • SQL clause order: FROM --> WHERE --> GROUP BY --> HAVING --> ORDER BY --> SELECT

    • SQL evaluates these clauses in the order: FROM, WHERE, GROUP BY, HAVING, and finally, SELECT. Therefore, each clause receives the filtered results of the previous filter. It would look like this: SELECT(HAVING(GROUP BY(WHERE(FROM...))))

  • Create new/modify existing variables

    • String functions

      • UPPER() & LOWER()

      • REPLACE(var, pattern, replacement)

      • CONCAT(var1, connection symbol, var2)

      • SUBSTR(var, start_index, length): Oracle

      • SUBSTRING(var, start_index, length): MySql

      • LEN()

      • LTRIM(), RTRIM(), & TRIM(): removing empty space in leading, trailing or both

      • TRIM(leading/trailing/both, char to be trimed FROM var): to remove specific char from a variable

      • LEFT(str, length), RIGHT(): selecte certain length of string from left/right

      • POSITION(substring IN string): return a numeric value, which is the index counted from left where the substring appears first in the string

      • Conditioning

        CASE WHEN ... THEN ... 
        ELSE ... 
        END AS new_var_name
        
    • Date functions

      • DATEDD: add one year to an existing date

      • TO_DATE: convertrs a string into date

      • DATEDIFF: find the difference b/t two given dates

      • DATEPART: get year, month, or date from the date variable

      • DAY: get day of the month for the given date

      • CURRENT_TIMESTAMP: get the date and time (time stamp)

  • Using aggregate functions (COUNT, AVG, SUM, MIN, MAX)

    • usually used with GROUP BY

    • NULL value is elimiated in all thse functions, except for COUNT(*) (COUNT(var) still excludes NULL records)

  • Apply conditions with WHERE

    • use BETWEEN ... AND... to select obs with value with a range of a variable (inclusive)

    • Using regular expression with LIKE

      • %: any string of 0 or more character

      • _(under score): any single character;

      • []: any single character with the specificed range e.g. [a-f] <=> [abcdef]

      • [^]: any single character not within the specified range

  • Apply conditions using aggregated varaible/value with HAVING: be clear about the differnce vs WHERE

  • Converting data types

    • CAST(column_name AS integer) or column_name::integer: convert to integer

  • Update SQL table

      UPDATE table_name
      SET column1 = value1, column2 = value2, ...
      WHERE condition;
    

Advanced SQL tips#

Joining tables#

c5831be5.png

  • Filtering with “join” operations:

    • using AND after ON clause: filtering happens before joining

    • using WHERE after ON caluse: filtering happens after joining

  • Another way to visualize SQL joins: You Should Use This to Visualize SQL Joins Instead of Venn Diagrams

    NtZh3N_2022_04_24

  • “Cross join”: performs cross product b/t 2 tables: it connects each row in the left table with each row in the right table

Stacking tables#

  • UNION operator: to stack one dataset on top of the other

    SELECT var1, var2, var3 as var3_new
    FROM table1
    UNION
    SELECT var1, var2, var4 as var3_new
    FROM table2
    
    • If there are same rows from two tables, only one unique row will be shown; or to use UNION ALL to keep duplicates; The opposite is UNION DISTINCT

    • Two tables must have same # of cols

    • Columns must have same data types in the same order

Create a view of the data table[1]:#

  • security of the data, giving access to only the variables included in the view table

    CREATE VIEW table_name AS
    SELECT var1, var2, var3
    FROM table;
    

Subquery#

  • innter query/nested query: to perform operation in several steps

    • subquery need to have an alias

    • EXAMPLE CODE NEEDED

Window functions#

  • performs a calculation across a set of table rows that are somehow related to the current row

    • Example

      Aggregate_fun(var) OVER
        (PARTITION BY var1 ORDER BY var2
         ORDER BY var3
         ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS new_Var
      
  • This doesn’t cause rows to be grouped into a single output row as done in the aggreated functions

  • Three (optional) components:

    • PARTITION BY: divides the rows of the table into different groups

    • ORDER BY: defines an ordering with each partition.

    • The final (window) clause: defining the window frame: defines the set of rows used in each calculation. e.g.

      • ROWS BETWEEN 1 PRECEDING AND CURRENT ROW - the previous row and the current row.

      • ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING - the 3 previous rows, the current row, and the following row.

      • ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING - all rows in the partition.

  • Three types of analytic functions[2]

    1. Analytic aggretate functions

    2. Analytic navigation functions

      • FIRST_VALUE()/LAST_VALUE(): returns the first/last value in the input

      • NTILE(): determine the percentiles of a give value in a variable

      • LAG() & LEAD(): LAG pulls record(s) from previous and LEAD pulls from following rows

    3. Analytic numbering functions

      • ROW_NUMBER(): show row # across the OVER & ORDER BY var

      • RANK(): similar to ROW_NUMBER() but will give same rank if value for ORDER BY variable is the same

      • DENSE_RANK(): differently from RANK(), it won’t skip rank number if there are more than one obs. records with same value share the same rank

  • Defining a window alias: a convenient way to use several window functions that use the same window:

    WINDOW window_name AS
      (PARTITION BY var1 ORDER BY var2)
    

    this should come after WHERE clause

  • Can’t include a window function in a GROUP BY clause

Nested data#

  • SQL data can also include a column with multiple fields in it; those fields are nested inside of this column and the nested column has type STRUCT (or type RECORD)

  • To query a column with nested data, we need to identify each field in the context of the column that contains it:

    • C.Name: referes to the Name field in the C column.

Repeated data#

  • When an ID contains multiple records in another table, and we want to put the multiple records into one column and we say this conlumn contains repeated data.

  • Each entry in a repeated field is an ARRAY, or an ordered list of (zero or more) values with the same datatype.

  • When querying repeated data, we need to put the name of the column containing the repeated data inside an UNNEST() function. UNNEST(Column) AS var_name; This essentially flattens the repeated data (which is then appended to the right side of the table) so that we have one element on each row.

Nested and repeated data#

  • A combination of nest and repeat in a column

  • When reading the values, we need UNNEST and use C.Name to read individual field

Pivoting data in SQL (TBA)#


Miscellaneous tips#

  • IS NULL: check if it’s missing value

  • JOIN is equivalent to INNTER JOIN

  • COALESCE(): select teh first non NULL value in a list

  • Use EXPLAIN at the beginning of the program, which will roughly show the complex of the program

  • Performance tuning SQL queries

    • Table size: try to use LIMIT XX to limit the lines to show

    • Joins: make joins less complicated

  • We refer to the structure of a table as its schema (something like str() in R).


Resrouces to learn and practice SQL#