所有操作都為 ANSI SQL compliant 的語法。ANSI SQL 相關介紹可以參考 W3School SQL Tutorial。
為了企業數據安全，我們在 CannerFlow 虛擬數據中台只能讓使用者 READ-only 也就是目前僅支援 SELECT statement，不支援 Update, Delete 等操作。
以下為 CannerFlow 所支援的 SELECT statement 相關用法。
from_item is one of
join_type is one of
grouping_element is one of
Retrieve rows from zero or more tables.
WITH clause defines named relations for use within a query. It
allows flattening nested queries or simplifying subqueries. For example,
the following queries are equivalent:
This also works with multiple subqueries:
Additionally, the relations within a
WITH clause can chain:
Currently, the SQL for the
WITH clause will be inlined
anywhere the named relation is used. This means that if the relation is
used more than once and the query is non-deterministic, the results may
be different each time.
SELECT clause specifies the output of the query. Each
select_expression defines a column or columns to be included in the
DISTINCT quantifiers determine whether duplicate rows
are included in the result set. If the argument
ALL is specified, all
rows are included. If the argument
DISTINCT is specified, only unique
rows are included in the result set. In this case, each output column
must be of a type that allows comparison. If neither argument is
specified, the behavior defaults to
select_expression must be in one of the following forms:
In the case of
expression [ [ AS ] column_alias ], a single output
column is defined.
In the case of
row_expression.* [ AS ( column_alias [, ...] ) ], the
row_expression is an arbitrary expression of type
ROW. All fields of
the row define output columns to be included in the result set.
In the case of
relation.*, all columns of
relation are included in
the result set. In this case column aliases are not allowed.
In the case of
*, all columns of the relation defined by the query are
included in the result set.
In the result set, the order of columns is the same as the order of their specification by the select expressions. If a select expression returns multiple columns, they are ordered the same way they were ordered in the source relation or row type expression.
If column aliases are specified, they override any preexisting column or row field names:
Otherwise, the existing names are used:
and in their absence, anonymous columns are produced:
GROUP BY Clause
GROUP BY clause divides the output of a
SELECT statement into
groups of rows containing matching values. A simple
GROUP BY clause
may contain any expression composed of input columns or it may be an
ordinal number selecting an output column by position (starting at one).
The following queries are equivalent. They both group the output by the
nationkey input column with the first query using the ordinal position
of the output column and the second query using the input column name:
GROUP BY clauses can group output by input column names not appearing
in the output of a select statement. For example, the following query
generates row counts for the
customer table using the input column
GROUP BY clause is used in a
SELECT statement all output
expressions must be either aggregate functions or columns present in the
GROUP BY clause.
Complex Grouping Operations
Presto also supports complex aggregations using the
ROLLUP syntax. This syntax allows users to perform analysis
that requires aggregation on multiple sets of columns in a single query.
Complex grouping operations do not support grouping on expressions
composed of input columns. Only column names or ordinals are allowed.
Complex grouping operations are often equivalent to a
UNION ALL of
GROUP BY expressions, as shown in the following examples. This
equivalence does not apply, however, when the source of data for the
aggregation is non-deterministic.
Grouping sets allow users to specify multiple lists of columns to group
on. The columns not part of a given sublist of grouping columns are set
GROUPING SETS semantics are demonstrated by this example query:
The preceding query may be considered logically equivalent to a
UNION ALL of multiple
GROUP BY queries:
However, the query with the complex grouping syntax (
ROLLUP) will only read from the underlying data source once,
while the query with the
UNION ALL reads the underlying data three
times. This is why queries with a
UNION ALL may produce inconsistent
results when the data source is not deterministic.
CUBE operator generates all possible grouping sets (i.e. a power
set) for a given set of columns. For example, the query:
is equivalent to:
ROLLUP operator generates all possible subtotals for a given set
of columns. For example, the query:
is equivalent to:
Combining multiple grouping expressions
Multiple grouping expressions in the same query are interpreted as having cross-product semantics. For example, the following query:
which can be rewritten as:
is logically equivalent to:
DISTINCT quantifiers determine whether duplicate
grouping sets each produce distinct output rows. This is particularly
useful when multiple complex grouping sets are combined in the same
query. For example, the following query:
is equivalent to:
However, if the query uses the
DISTINCT quantifier for the
only unique grouping sets are generated:
The default set quantifier is
grouping(col1, ..., colN) -> bigint
The grouping operation returns a bit set converted to decimal,
indicating which columns are present in a grouping. It must be used in
GROUP BY and its
arguments must match exactly the columns referenced in the corresponding
GROUP BY clause.
To compute the resulting bit set for a particular row, bits are assigned to the argument columns with the rightmost column being the least significant bit. For a given grouping, a bit is set to 0 if the corresponding column is included in the grouping and to 1 otherwise. For example, consider the query below:
The first grouping in the above result only includes the
column and excludes the
The bit set constructed for that grouping is
011 where the most
significant bit represents
HAVING clause is used in conjunction with aggregate functions and
GROUP BY clause to control which groups are selected. A
clause eliminates groups that do not satisfy the given conditions.
HAVING filters groups after groups and aggregates are computed.
The following example queries the
customer table and selects groups
with an account balance greater than the specified value:
UNION | INTERSECT | EXCEPT Clause
EXCEPT are all set operations. These clauses
are used to combine the results of more than one select statement into a
single result set:
DISTINCT controls which rows are included in the
final result set. If the argument
ALL is specified all rows are
included even if the rows are identical. If the argument
specified only unique rows are included in the combined result set. If
neither is specified, the behavior defaults to
argument is not supported for
Multiple set operations are processed left to right, unless the order is
explicitly specified via parentheses. Additionally,
more tightly than
UNION. That means
A UNION B INTERSECT C EXCEPT D is the same as
A UNION (B INTERSECT C) EXCEPT D.
UNION combines all the rows that are in the result set from the first
query with those that are in the result set for the second query. The
following is an example of one of the simplest possible
It selects the value
13 and combines this result set with a second
query that selects the value
The following query demonstrates the difference between
UNION ALL. It selects the value
13 and combines this result set with
a second query that selects the values
INTERSECT returns only the rows that are in the result sets of both
the first and the second queries. The following is an example of one of
the simplest possible
INTERSECT clauses. It selects the values
42 and combines this result set with a second query that selects
42 is only in the result set of the first query,
it is not included in the final results.:
EXCEPT returns the rows that are in the result set of the first query,
but not the second. The following is an example of one of the simplest
EXCEPT clauses. It selects the values
combines this result set with a second query that selects the value
13 is also in the result set of the second query, it is
not included in the final result.:
ORDER BY Clause
ORDER BY clause is used to sort a result set by one or more output
Each expression may be composed of output columns, or it may be an
ordinal number selecting an output column by position, starting at one.
ORDER BY clause is evaluated after any
GROUP BY or
clause, and before any
FETCH FIRST clause. The
default null ordering is
NULLS LAST, regardless of the ordering
Note that, following the SQL specification, an
ORDER BY clause only
affects the order of rows for queries that immediately contain the
clause. Presto follows that specification, and drops redundant usage of
the clause to avoid negative performance impacts.
In the following example, the clause only applies to the select statement.
Since tables in SQL are inherently unordered, and the
ORDER BY clause
in this case does not result in any difference, but negatively impacts
performance of running the overall insert statement, Presto skips the
Another example where the
ORDER BY clause is redundant, and does not
affect the outcome of the overall statement, is a nested query:
OFFSET clause is used to discard a number of leading rows from the
ORDER BY clause is present, the
OFFSET clause is evaluated
over a sorted result set, and the set remains sorted after the leading
rows are discarded:
Otherwise, it is arbitrary which rows are discarded. If the count
specified in the
OFFSET clause equals or exceeds the size of the
result set, the final result is empty.
LIMIT or FETCH FIRST Clauses
FETCH FIRST clause restricts the number of rows in the
The following example queries a large table, but the
restricts the output to only have five rows (because the query lacks an
ORDER BY, exactly which rows are returned is arbitrary):
LIMIT ALL is the same as omitting the
FETCH FIRST clause supports either the
ROWS keywords. These keywords are equivalent and the
choice of keyword has no effect on query execution.
If the count is not specified in the
FETCH FIRST clause, it defaults
OFFSET clause is present, the
FETCH FIRST clause
is evaluated after the
FETCH FIRST clause, the argument
controls which rows are included in the result set.
If the argument
ONLY is specified, the result set is limited to the
exact number of leading rows determined by the count.
If the argument
WITH TIES is specified, it is required that the
ORDER BY clause be present. The result set consists of the same set of
leading rows and all of the rows in the same peer group as the last of
them ('ties') as established by the ordering in the
ORDER BY clause.
The result set is sorted:
There are multiple sample methods:
BERNOULLI Each row is selected to be in the table sample with a
probability of the sample percentage. When a table is sampled using the
Bernoulli method, all physical blocks of the table are scanned and
certain rows are skipped (based on a comparison between the sample
percentage and a random value calculated at runtime).
The probability of a row being included in the result is independent from any other row. This does not reduce the time required to read the sampled table from disk. It may have an impact on the total query time if the sampled output is processed further.
SYSTEM This sampling method divides the table into logical segments of
data and samples the table at this granularity. This sampling method
either selects all the rows from a particular segment of data or skips
it (based on a comparison between the sample percentage and a random
value calculated at runtime).
The rows selected in a system sampling will be dependent on which connector is used. For example, when used with Hive, it is dependent on how the data is laid out on HDFS. This method does not guarantee independent sampling probabilities.
Neither of the two methods allow deterministic bounds on the number of rows returned.
Using sampling with joins:
UNNEST can be used to expand an :ref:
array_type or :ref:
into a relation. Arrays are expanded into a single column, and maps are
expanded into two columns (key, value).
UNNEST can also be used with
multiple arguments, in which case they are expanded into multiple
columns, with as many rows as the highest cardinality argument (the
other columns are padded with nulls).
UNNEST can optionally have a
WITH ORDINALITY clause, in which case an additional ordinality column
is added to the end.
UNNEST is normally used with a
JOIN and can
reference columns from relations on the left side of the join.
Using a single column:
Using multiple columns:
WITH ORDINALITY clause:
Joins allow you to combine data from multiple relations.
A cross join returns the Cartesian product (all combinations) of two
relations. Cross joins can either be specified using the explit
CROSS JOIN syntax or by specifying multiple relations in the
Both of the following queries are equivalent:
nation table contains 25 rows and the
region table contains 5
rows, so a cross join between the two tables produces 125 rows:
Subqueries appearing in the
FROM clause can be preceded by the keyword
LATERAL. This allows them to reference columns provided by preceding
LATERAL join can appear at the top level in the
FROM list, or
anywhere within a parenthesized join tree. In the latter case, it can
also refer to any items that are on the left-hand side of a
which it is on the right-hand side.
FROM item contains
LATERAL cross-references, evaluation
proceeds as follows: for each row of the
FROM item providing the
cross-referenced columns, the
LATERAL item is evaluated using that row
set's values of the columns. The resulting rows are joined as usual with
the rows they were computed from. This is repeated for set of rows from
the column source tables.
LATERAL is primarily useful when the cross-referenced column is
necessary for computing the rows to be joined:
Qualifying Column Names
When two relations in a join have columns with the same name, the column references must be qualified using the relation alias (if the relation has an alias), or with the relation name:
The following query will fail with the error
Column 'name' is ambiguous:
A subquery is an expression which is composed of a query. The subquery is correlated when it refers to columns outside of the subquery. Logically, the subquery will be evaluated for each row in the surrounding query. The referenced columns will thus be constant during any single evaluation of the subquery.
Support for correlated subqueries is limited. Not every standard form is supported.
EXISTS predicate determines if a subquery returns any rows:
IN predicate determines if any values produced by the subquery are
equal to the provided expression. The result of
IN follows the
standard rules for nulls. The subquery must produce exactly one column:
A scalar subquery is a non-correlated subquery that returns zero or one
row. It is an error for the subquery to produce more than one row. The
returned value is
NULL if the subquery produces no rows:
Currently only single column can be returned from the scalar subquery.