Sql Syntax

Having syntax support

Generally, there are two cases in having. One is non-agg condition(See Case1 below) which will be regarded as filter operator and pushed down before group by partial agg.

Case1:

SELECT col_a, SUM(col_a) AS sum_a FROM test_table GROUP BY col_a HAVING col_a > 2;

The other one is agg condition(See Case2 below) which should be handled after group by final agg.

Case2:

SELECT col_a, SUM(col_a) AS sum_a FROM test_table GROUP BY col_a HAVING SUM(col_a) > 2;

For Case1 in Cider, we will get substrait plan in which having clause is transfered to filter operator already. When it comes to Case3 that contains multiple conditions, we will receive a substrait plan with multiple conditions. Then Cider will merge all those conditions and push them down before group by partial agg.

Case3:

SELECT col_a, SUM(col_a) AS sum_a FROM test_table WHERE col_a < 10 GROUP BY col_a HAVING col_a > 2;

For Case2 in Cider, we expect to get two plans. One is table scan and partial agg, the other is final agg, filter and project. So when it comes to Case4, two different conditions from where and having won’t appear in a same substrait plan and be merged into single EU.

Case4:

SELECT col_a, SUM(col_a) AS sum_a FROM test_table WHERE col_a < 10 GROUP BY col_a HAVING SUM(col_a) > 2;

In addition to those above, if we get an unexpected substrait plan like putting having agg condition together with partial agg plan, we will get wrong result batch without throwing exception.

In syntax support

The IN clause allows multi values definition in WHERE conditions. For example:

SELECT column_name(s)
FROM table_name
WHERE column_name IN (value1,value2,...);

Under this scenario, user translates IN expression to a substrait ScalaFunction with List as its second arg. Then Cider translates it into Analyzer::InValues for further codegen and computation.

IN can also be used together with a subquery:

SELECT eno
FROM employee
WHERE dno IN
      (SELECT dno
      FROM dept
      WHERE floor = 3);

In this case, plan parser in frontend framework will parse it either “IN (value1, value2, …)” or a JoinNode when ‘eno’ col is known as a primary key or an unique index, like following:

SELECT eno
FROM employee join dept
WHERE employee.dno = dept.dno and dept.floor = 3;

Thus this IN clause is handled through join op in Cider.

SELECT DISTINCT

Mainstream databases such as Spark and Presto will transform ‘SELECT DISTINCT’ sql to ‘GROUP BY’ sql when do optimization on logical plan.

Spark:

/**
* Replaces logical [[Distinct]] operator with an [[Aggregate]] operator.
* {{{
*   SELECT DISTINCT f1, f2 FROM t  ==>  SELECT f1, f2 FROM t GROUP BY f1, f2
* }}}
*/
object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] {
        def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning(
                _.containsPattern(DISTINCT_LIKE), ruleId) {
                case Distinct(child) => Aggregate(child.output, child.output, child)
        }
}

Presto:

When execute sql select distinct nationkey from customer, part of the json generated by Presto is:

{
        "id":"2",
        "root":{
                "@type":".AggregationNode",
        "groupingSets":{
                "groupingKeys":[
                        {
                        "@type":"variable",
                        "sourceLocation":{
                                "line":1,
                                "column":17
                        },
                        "name":"nationkey",
                        "type":"bigint"
                        }
                ],
                "groupingSetCount":1,
                "globalGroupingSets":[
                ]
        }
}

In above cases, the original ‘SELECT DISTINCT’ sql is converted to an Aggregation type, and the columns shoule be distinct will become ‘GROUP BY’ keys.

ALL/ANY

In SQL, ‘ALL’ and ‘ANY’ are used to decorate compare operators(<, <=, =, !=, >, >=) between column values and a subquery result.

‘ALL’ will return TRUE if the value matches all corresponding values in the subquery, while ‘ANY’ returns TRUE if it matches any single one.

Example:

Given test.col_i8 is

 col_i8
--------
5
3
3
(3 rows)

then execute the following sql:

SELECT col_i8 < ALL(VALUES 4,5,6) from test;

will return:

 _col0
-------
false
true
true
(3 rows)

because 5 is not less than 4 while 3 is less than all the right values.

For above case, the logical plan generated by Presto is:

  1. use the aggregate function MIN to get the min value of right rows.
  2. use a cross join to generate the boolean results, whose left and right arguments are left rows and the min value in the first step.

The logical plan tree:

- Output[_col0] => [expr_3:boolean]                                                                                                                                  >
        _col0 := expr_3 (1:23)                                                                                                                                       >
    - RemoteStreamingExchange[GATHER] => [expr_3:boolean]                                                                                                            >
        - Project[projectLocality = LOCAL] => [expr_3:boolean]                                                                                                       >
                expr_3 := SWITCH(count_all, WHEN(BIGINT'0', BOOLEAN'true'), ((col_i8_0) < (min)) AND (SWITCH(BOOLEAN'true', WHEN((count_all) <> (count_non_null), nul>
            - CrossJoin => [col_i8_0:integer, min:integer, count_all:bigint, count_non_null:bigint]                                                                  >
                        Distribution: REPLICATED                                                                                                                     >
                - ScanProject[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpch, tableName=test, analyzePartitionValues=Optio>
                        col_i8_0 := CAST(col_i8 AS integer) (1:49)                                                                                                   >
                        LAYOUT: tpch.test{}                                                                                                                          >
                        col_i8 := col_i8:tinyint:0:REGULAR (1:48)                                                                                                    >
                - LocalExchange[SINGLE] () => [min:integer, count_all:bigint, count_non_null:bigint]                                                                 >
                    - RemoteStreamingExchange[REPLICATE] => [min:integer, count_all:bigint, count_non_null:bigint]                                                   >
                        - Aggregate => [min:integer, count_all:bigint, count_non_null:bigint]                                                                        >
                                min := "presto.default.min"((field)) (1:17)                                                                                          >
                                count_all := "presto.default.count"(*) (1:17)                                                                                        >
                                count_non_null := "presto.default.count"((field)) (1:17)                                                                             >
                            - Values => [field:integer]                                                                                                              >
                                    (INTEGER'4')                                                                                                                     >
                                    (INTEGER'5')                                                                                                                     >
                                    (INTEGER'6')

For <, <=, >, >=

The only two changes in plan are:

  1. the aggregate function to get the max or min value.
  2. the compare operator in the project step.
ALL/ANY operator function
ALL </<= MIN
ALL >/>= MAX
ANY </<= MAX
ANY >/>= MIN

For =, !=

For ‘=’ in ‘ALL’ cases, there will be two aggregate functions MIN and MAX, and the project expression will become (min) = (max)) AND ((expr) = (max)).

While for ‘!=’, there will be only a semi join between left rows(expr) and right rows to get boolean results, then a NOT operation will be implemented to get final results.

For ‘ANY’ cases, the plans for ‘=’ and ‘!=’ are exactly the same as those of ‘!=’ and ‘=’ in ‘ALL’ cases.