During our previous Launch Week we announced the development of a Postgres Language Server. The response was more enthusiastic than we imagined and we received some excellent ideas.
This is an update on the Parser, which is the fundamental primitive of a Language Server. We’ve been through several iterations of the parser and we want to detail our explorations.
Want to learn some more acronyms?
There will be a few acronyms in this post. We don’t want this post to be “inside baseball” so here are a few concepts that you need to know:
- LSP / Language Server Protocol: Something that helps code editors understand code better. It provides functionality like linting and error detection.
- Postgres Language Server: a server that will help with Postgres programming - writing SQL, type inference, etc.
- Parser: A parser converts written code into a form that tools can work with. For example, it identifies variables in code. and then the tools can easily access those variables. We’ll describe this more below.
Background: Why build a Language Server?
Postgres is gaining popularity, and yet the tooling around it still needs a lot of work. Writing SQL in editors like VSCode is a pain. One of the unique features of Supabase is the ability to access your Postgres database from a browser or mobile app through our Data APIs. This means that developers are writing more PL/pgSQL.
While code editors have great support for most programming languages, SQL support is underwhelming. We want to make Postgres as simple as Python.
Language Server's Core: The Role of the Parser
On the highest level, a language server is a thing which
- accepts input source code from the client
- turns it into a semantic model of the code
- and provides language-specific smarts back to the client.
The parser is the core of every language server. It takes the raw string, turns it into a stream of tokens, and builds a syntax tree. This syntax tree can then be used to extract structural information.
Usually, the parser builds a concrete syntax tree (CST) before turning it into an abstract syntax tree (AST). A CST preserves all syntax elements present in the source code with the goal of being able to re-create the exact original source, while an AST contains only the meaning of the source. For example, take a simple expression 2 * (7 + 3)
:
_15 CST AST_15 ----- -----_15 expr *_15 / | \ / \_15 term * term 2 +_15 | | / \_15factor factor 7 3 _15 | / | \_15 2 ( expr ) _15 / | \_15 term + term_15 | |_15 factor factor_15 | |_15 7 3
Implementing a Parser for Postgres
Implementing a parser for Postgres is challenging because of the ever-evolving and complex syntax of Postgres. Even select
statements are very complex to parse. Then there are common table expressions, sub-queries and the like. This is one of the reasons why existing Postgres tooling is scarce, badly maintained, and often does not work well.
We decided to not create a custom parser. Instead, we leverage libpg_query to parse SQL code reliably. The pganalyze team has a published a great blog post on why this approach is preferable.
However, libpgquery is designed to parse _executable SQL — not to provide language intelligence. Using it for a language server means adapting it to our specific use case. Let’s explore how we adapted it:
Tokenization
Before any syntax tree can be built, the input string needs to be converted into a stream of tokens. libpg_query exposes a scan
API that returns all non-whitespace tokens of the source, even for invalid SQL. For example, a simple statement select '1';
returns
_18ScanToken {_18 start: 0,_18 end: 6,_18 token: Select,_18 keyword_kind: ReservedKeyword,_18},_18ScanToken {_18 start: 7,_18 end: 10,_18 token: Sconst,_18 keyword_kind: NoKeyword,_18},_18ScanToken {_18 start: 10,_18 end: 11,_18 token: Ascii59,_18 keyword_kind: NoKeyword,_18}
Every ScanToken
token contains its variant, a range, and a keyword kind. To simplify the implementation of the parser, we extract the text for each token using the range for now. We arrive at the following Token
struct.
_10pub struct Token {_10 /// The kind of the token_10 pub kind: SyntaxKind,_10 /// Text from the input_10 pub text: String,_10 /// Range within the input_10 pub span: TextRange,_10 /// Variants from `ScanToken.keyword_kind` + `Whitespace`_10 pub token_type: TokenType,_10}
To have a complete token stream, we merge the results of scan
with a list of all whitespace tokens in the source, where the latter are extracted by a simple regular expression. For a simple statement select '1';
, the following tokens are returned by the lexer.
_26[_26 Token {_26 kind: Select,_26 text: "select",_26 span: 0..6,_26 token_type: ReservedKeyword,_26 },_26 Token {_26 kind: Whitespace,_26 text: " ",_26 span: 6..7,_26 token_type: Whitespace,_26 },_26 Token {_26 kind: Sconst,_26 text: "'1'",_26 span: 7..10,_26 token_type: NoKeyword,_26 },_26 Token {_26 kind: Ascii59,_26 text: ";",_26 span: 10..11,_26 token_type: NoKeyword,_26 },_26]
Conversion to a Syntax Tree
To transform the stream of tokens into a syntax tree, we face the first challenges with libpg_query
. In a language server, it's important to handle incomplete or improperly formatted input gracefully. When an error occurs you don’t want the parser to “stop”. You want it to check for all errors in your code:
_10create table posts (_10 id serial primary key # <- ERR: missing comma, Parser should continue_10 content text_10);_10_10create table comments (_10 id serial primary key,_10 post_id int references posts # <- ERR: missing comma, second error returned_10 comment text_10);
Unfortunately, the parse
api from libpg_query
only parses the entire input — if any SQL statement contains a syntax error, an error is returned for the entire input.
To overcome this, we implemented a resilient source
parser. This parser breaks the input into individual SQL statements and parses them one by one, allowing us to handle syntax errors within each statement independently. It is implemented as a top-down LL parser. Specifically, the parser iterates the token stream left-to-right, and checks if the cursor currently is at the start of a new statement. Once a statement is entered, it walks the tokens until ;
or another statement start is encountered while skipping sub-statements.
Luckily, Postgres statements always start with distinct keywords. An update statement is identifiable with update
, a delete statement with delete from
. There are a few statements that need more than the first few tokens to be distinguishable, but we only care about whether there is a statement, and not what statement there is exactly, so ambiguity is very much acceptable.
For the implementation, we only need to provide the distinct keywords each statement starts with and compare it to the current tokens using a lookahead mechanism.
Reverse-Engineering the CST
The second limitation we encountered: libpg_query
only exposes an API for the AST, not for the CST. To provide language intelligence on the source code, both are required. Since we do not want to implement our own parser, we need to work with what we have to build the CST: the AST and a stream of tokens. The goal is to reverse-engineer the AST into the CST. This involves re-using the AST nodes as CST nodes, and figuring out what token belongs beneath what node. The exemplary statement select '1';
should be be parsed into
_10 [email protected] "select"_10_10 [email protected] "'1'"_10
To do that, we need to know the range of every node that is within the AST.
We made numerous iterations over the past few months to figure out how to accomplish this with minimal manual implementations. Before diving into details, lets take a closer look at the parse
API of libpg_query
. For the exemplary statement above, it returns (simplified for readability):
_23SelectStmt {_23 target_list: [_23 ResTarget {_23 val: Some(_23 Node {_23 node: Some(_23 AConst {_23 location: 7,_23 val: Some(_23 Sval(_23 String {_23 sval: "1",_23 },_23 ),_23 ),_23 },_23 ),_23 },_23 ),_23 location: 7,_23 },_23 ],_23}
There are a few limitations:
- Some nodes do have a
location
property that indicates where the node starts in the source, but not all. - There is no information on the range of a node within the source.
- Some locations are not what you would expect. For example the location of the expression node is the location of the operator, not the start of the left expression.
To summarise: we need a range for each node, and we only have a start position, but not always, and sometimes it is wrong.
Our first very iterations were naive. We explored what information we could extract from scan
and parse
, and if anything can help in reverse-engineering the CST.
It turns out, the most reliable way of determining the range of a node is by knowing all properties of that node, and its position in the tree.
A property is any text for which a Token can potentially be found in the source code. For example, a SelectStmt
node has the select
keyword as a property, and if there is a from_clause
, a from
keyword. For the exemplary statement above, the properties are
_10SelectStmt: [Select],_10RangeVar: [],_10AConst: ['1']
Note that we do not have an extra String
node, and instead add its properties to the parent AConst
node. This reason is that a String
node does not bring any value to the CST, since we already know that ‘1'
is a string from the kind of the respective Token
. The same is true for all nodes that just contain type information such as Integer
, Boolean
and Float
.
The position of any node is the same in AST and CST, and thereby can be reflected from it.
Implementation
Before the actual parsing begins, the AST returned by parse
is converted into an uniform tree structure where each node holds the kind of the node, a list of properties, the depth and, if available, the location.
For example, for select '1';
:
_39edges: (0, 1), (1, 2),_39nodes: {_39 0: Node {_39 kind: SelectStmt,_39 depth: 1,_39 properties: [_39 TokenProperty {_39 value: None,_39 kind: Some(_39 Select,_39 ),_39 },_39 ],_39 location: None,_39 },_39 1: Node {_39 kind: ResTarget,_39 depth: 2,_39 properties: [],_39 location: Some(_39 7,_39 ),_39 },_39 2: Node {_39 kind: AConst,_39 depth: 3,_39 properties: [_39 TokenProperty {_39 value: Some(_39 "1",_39 ),_39 kind: None,_39 },_39 ],_39 location: Some(_39 7,_39 ),_39 },_39},
To implement such a conversion we need a way to
- get an
Option<usize>
with the location for every node type, if any - compile a list of all properties that a node contains
- walk down the AST until the leaves are reached
Due to the strict type system of Rust, a manual implementation would be a significant and repetitive effort. With languages such as JavaScript, getting the location of a node would be as simple as node?.location
. In Rust, a large match statement covering all possible nodes is required to do the same. Luckily, libpg_query
exports a protobuf definition containing all AST nodes and their fields. For example, an AExpr
node is defined as
_10message A_Expr_10{_10 A_Expr_Kind kind = 1 [json_name="kind"];_10 repeated Node name = 2 [json_name="name"];_10 Node lexpr = 3 [json_name="lexpr"];_10 Node rexpr = 4 [json_name="rexpr"];_10 int32 location = 5 [json_name="location"];_10}
We introspect that definition to generate code at build time using procedural macros.
Leveraging the powerful repetition feature of the quote
crate, the match
statement of a get_location
function can be implemented with just a few lines of code.
_20pub fn get_location_mod(_item: proc_macro2::TokenStream) -> proc_macro2::TokenStream {_20 // Parse the proto file using a custom parser_20 let parser = ProtoParser::new("libpg_query/protobuf/pg_query.proto");_20 let proto_file = parser.parse();_20_20 // get all node identifiers for the left side of the match arm_20 let node_identifiers = node_identifiers(&proto_file.nodes);_20 // get a TokenStream for each node that returns the location_20 // if it is part of the node properties_20 let location_idents = location_idents(&proto_file.nodes);_20_20 quote! {_20 /// Returns the location of a node_20 pub fn get_location(node: &NodeEnum) -> Option<usize> {_20 match node {_20 #(NodeEnum::#node_identifiers(n) => #location_idents),*_20 }_20 }_20 }_20}
The location_idents
function iterates all nodes, searches for a location
property in the protobuf definition for each node, and returns a TokenStream
with either Some(n.location)
or None
for each.
_18fn location_idents(nodes: &[Node]) -> Vec<TokenStream> {_18 nodes_18 .iter()_18 .map(|node| {_18 if node_18 .fields_18 .iter()_18 // has location property?_18 .find(|n| n.name == "location" && n.field_type == FieldType::Int32)_18 .is_some()_18 {_18 quote! { Some(n.location) }_18 } else {_18 quote! { None }_18 }_18 })_18 .collect()_18}
Similarly, we can generate code to recursively walk down the FieldType == Node
and FieldType == Vec<Node>
properties of the AST nodes. No manual work required.
Even the function that returns all properties for a node can be generated, at least partly. All AST fields of FieldType == String
can always be added to the list of properties. In the example above, the sval: “1”
of the String
node makes up the properties of its parent, the AConst
node. What is remaining are mostly just the keywords that need to be defined for every node. A SelectStmt
node has the select
keyword as a property, and if there is a from_clause
, a from
keyword.
_11fn custom_handlers(node: &Node) -> TokenStream {_11 match node.name.as_str() {_11 "SelectStmt" => quote! {_11 tokens.push(TokenProperty::from(Token::Select));_11 if n.from_clause.len() > 0 {_11 tokens.push(TokenProperty::from(Token::From));_11 }_11 ..._11 },_11 ..._11}
Parsing a Statement
After the tree has been generated, the parser goes through the tokens and finds the node in whose properties the current token can be found. But not every node is a possible successor. Lets look how the parser builds the CST for the statement select '1' from contact
at a high level.
We start with the full tree, and all tokens:
_10Remaining Tokens: ["select", "'1'", "from", "contact"]_10_10Parse Tree:_10_10 SelectStmt (0, [Select, From])_10 / \_101 (ResTarget, []) 2 (RangeVar, ['contact'])_10 | _10 3 (AConst, ['1'])
Starting with the root node, the parser first searches the current node for the token. In this case, with success. Select
is removed from SelectStmt
.
In the next iteration, we search for '1'
. Since its not in the current node, a breadth-first search is used to find the property within children nodes. We arrive at AConst
, open all nodes we encountered along the way, and advance.
_17Remaining Tokens: ["from", "contact"]_17_17Parse Tree:_17_17 SelectStmt (0, [From])_17 / \_171 (ResTarget, []) 2 (RangeVar, ['contact'])_17 | _17 3 (AConst, []) _17_17CST:_17SelectStmt_17 [email protected] "select"_17_17 [email protected] "'1'"
Since we arrived at a leaf node with no properties left, the next token can not be part of this node or any child. It can be closed immediately after advancing the token. We remove it from the tree and set the current node to its parent. The same now applies to ResTarget
, so we arrive back at SelectStmt
:
_15Remaining Tokens: ["from", "contact"]_15_15Parse Tree:_15_15 SelectStmt (0, [From])_15 \_15 2 (RangeVar, ['contact'])_15_15CST:_15SelectStmt_15 [email protected] "select"_15_15 [email protected] "'1'"
The from
token can once again be found within the current node and we just advance the parser. Since SelectStmt
is not a leaf node, we stay where we are.
_17Remaining Tokens: ["contact"]_17_17Parse Tree:_17_17 SelectStmt (0, [])_17 \_17 2 (RangeVar, ['contact'])_17_17CST:_17SelectStmt_17 [email protected] "select"_17_17 [email protected] "'1'"_17_17 [email protected] "from"
From here, it repeats itself: contact
is found within RangeVar
using a breadth-first search. It becomes a leaf node that is closed after applying the token. Since no tokens are left, we finish parsing by closing SelectStmt
, resulting in:
_11 [email protected] "select"_11_11 [email protected] "'1'"_11_11 [email protected] "from"_11_11 [email protected] "contact"
Keep in mind, this illustration shows you only an overview of the process.If you are interested in the details, take a look at the source code here.
You may have noticed that neither location
nor depth
were mentioned. Both are only used to improve performance and safeguard. Among other things, branches with nodes behind the current position of the parser are skipped. Further, the parser panics when a node is opened and either its position or its depth does not match the current state of the parser. This means that the returned CST is guaranteed to be correct.
Limitations
If the SQL is invalid and parse
returns an error, the returned CST is just a flat list of tokens. Consequently, the statement parser is not resilient. This is not great, but we have intentionally implemented it so that custom and resilient implementations can be added statement by statement later.
Ultimately, we want the libpg_query
-based parser to just serve as a fallback. For now however, our goal is to provide a usable language server as fast as possible. And even if some intelligence features will only work on valid statements, we believe it is still better than what we have today: no intelligence at all.
Next Steps
There are some minor improvements remaining for the parser. But the largest part are the manual implementations missing in get_node_properties
. Its a time-consuming effort, and we would love to get help from the community here. Check out this issue if you like to support.
After that, we will move on to the semantic data model and the language server itself. Other parser features such as support for PL/pgSQL function body parsing will be added later. We want to get this project into a usable state as fast as possible.