跳到內容

解析器

我們將要建構的解析器稱為遞迴下降解析器,它是手動沿著語法向下走並建立 AST 的過程。

解析器從簡單開始,它持有原始碼、詞法分析器以及從詞法分析器消耗的當前標記。

rust
pub struct Parser<'a> {
    /// Source Code
    source: &'a str,

    lexer: Lexer<'a>,

    /// Current Token consumed from the lexer
    cur_token: Token,

    /// The end range of the previous token
    prev_token_end: usize,
}

impl<'a> Parser<'a> {
    pub fn new(source: &'a str) -> Self {
        Self {
            source,
            lexer: Lexer::new(source),
            cur_token: Token::default(),
        }
    }

    pub fn parse(&mut self) -> Program<'a> {
        Ok(Program {
            node: Node {
                start: 0,
                end: self.source.len(),
            }
            body: vec![]
        })
    }
}

輔助函式

目前標記 cur_token: Token 持有從詞法分析器返回的當前標記。我們將新增一些輔助函式來瀏覽和檢查此標記,使解析器程式碼更簡潔。

rust
impl<'a> Parser<'a> {
    fn start_node(&self) -> Node {
        let token = self.cur_token();
        Node::new(token.start, 0)
    }

    fn finish_node(&self, node: Node) -> Node {
        Node::new(node.start, self.prev_token_end)
    }

    fn cur_token(&self) -> &Token {
        &self.cur_token
    }

    fn cur_kind(&self) -> Kind {
        self.cur_token.kind
    }

    /// Checks if the current index has token `Kind`
    fn at(&self, kind: Kind) -> bool {
        self.cur_kind() == kind
    }

    /// Advance if we are at `Kind`
    fn bump(&mut self, kind: Kind) {
        if self.at(kind) {
            self.advance();
        }
    }

    /// Advance any token
    fn bump_any(&mut self) {
        self.advance();
    }

    /// Advance and return true if we are at `Kind`, return false otherwise
    fn eat(&mut self, kind: Kind) -> bool {
        if self.at(kind) {
            self.advance();
            return true;
        }
        false
    }

    /// Move to the next token
    fn advance(&mut self) {
        let token = self.lexer.next_token();
        self.prev_token_end = self.cur_token.end;
        self.cur_token = token;
    }
}

解析函式

DebuggerStatement 是最簡單的陳述式,所以讓我們嘗試解析它並返回有效的程式

rust
impl<'a> Parser<'a> {
    pub fn parse(&mut self) -> Program {
        let stmt = self.parse_debugger_statement();
        let body = vec![stmt];
        Program {
            node: Node {
                start: 0,
                end: self.source.len(),
            }
            body,
        }
    }

    fn parse_debugger_statement(&mut self) -> Statement {
        let node = self.start_node();
        // NOTE: the token returned from the lexer is `Kind::Debugger`, we'll fix this later.
        self.bump_any();
        Statement::DebuggerStatement {
            node: self.finish_node(node),
        }
    }
}

所有其他解析函式都建立在這些基本輔助函式的基礎上,例如在 swc 中解析 while 陳述式

rust
// https://github.com/swc-project/swc/blob/554b459e26b24202f66c3c58a110b3f26bbd13cd/crates/swc_ecma_parser/src/parser/stmt.rs#L952-L970

fn parse_while_stmt(&mut self) -> PResult<Stmt> {
    let start = cur_pos!(self);

    assert_and_bump!(self, "while");

    expect!(self, '(');
    let test = self.include_in_expr(true).parse_expr()?;
    expect!(self, ')');

    let ctx = Context {
        is_break_allowed: true,
        is_continue_allowed: true,
        ..self.ctx()
    };
    let body = self.with_ctx(ctx).parse_stmt(false).map(Box::new)?;

    let span = span!(self, start);
    Ok(Stmt::While(WhileStmt { span, test, body }))
}

解析表達式

表達式的語法是深層嵌套且遞迴的,這可能會在長表達式上導致堆疊溢位(例如在這個 TypeScript 測試中),

為了避免遞迴,我們可以使用一種稱為「Pratt 解析」的技術。更深入的教學可以在這裡找到,由 Rust-Analyzer 的作者撰寫。還有一個 Rust 版本在 Rome 中。

列表

有很多地方我們需要解析以標點符號分隔的列表,例如 [a, b, c]{a, b, c}

解析列表的程式碼都很相似,我們可以透過使用 traits 來使用模板方法模式來避免重複。

rust
// https://github.com/rome/tools/blob/85ddb4b2c622cac9638d5230dcefb6cf571677f8/crates/rome_js_parser/src/parser/parse_lists.rs#L131-L157

fn parse_list(&mut self, p: &mut Parser) -> CompletedMarker {
    let elements = self.start_list(p);
    let mut progress = ParserProgress::default();
    let mut first = true;
    while !p.at(JsSyntaxKind::EOF) && !self.is_at_list_end(p) {
        if first {
            first = false;
        } else {
            self.expect_separator(p);

            if self.allow_trailing_separating_element() && self.is_at_list_end(p) {
                break;
            }
        }

        progress.assert_progressing(p);

        let parsed_element = self.parse_element(p);

        if parsed_element.is_absent() && p.at(self.separating_element_kind()) {
            // a missing element
            continue;
        } else if self.recover(p, parsed_element).is_err() {
            break;
        }
    }
    self.finish_list(p, elements)
}

此模式還可以防止我們陷入無限迴圈,特別是 progress.assert_progressing(p);

然後可以為不同的列表提供實作細節,例如

rust
// https://github.com/rome/tools/blob/85ddb4b2c622cac9638d5230dcefb6cf571677f8/crates/rome_js_parser/src/syntax/expr.rs#L1543-L1580

struct ArrayElementsList;

impl ParseSeparatedList for ArrayElementsList {
    fn parse_element(&mut self, p: &mut Parser) -> ParsedSyntax {
        match p.cur() {
            T![...] => parse_spread_element(p, ExpressionContext::default()),
            T![,] => Present(p.start().complete(p, JS_ARRAY_HOLE)),
            _ => parse_assignment_expression_or_higher(p, ExpressionContext::default()),
        }
    }

    fn is_at_list_end(&self, p: &mut Parser) -> bool {
        p.at(T![']'])
    }

    fn recover(&mut self, p: &mut Parser, parsed_element: ParsedSyntax) -> RecoveryResult {
        parsed_element.or_recover(
            p,
            &ParseRecovery::new(
                JS_UNKNOWN_EXPRESSION,
                EXPR_RECOVERY_SET.union(token_set!(T![']'])),
            ),
            js_parse_error::expected_array_element,
        )
    }

    fn list_kind() -> JsSyntaxKind {
        JS_ARRAY_ELEMENT_LIST
    }

    fn separating_element_kind(&mut self) -> JsSyntaxKind {
        T![,]
    }

    fn allow_trailing_separating_element(&self) -> bool {
        true
    }
}

Cover Grammar

cover grammar中詳細說明,有時我們需要將 Expression 轉換為 BindingIdentifier。動態語言(例如 JavaScript)可以簡單地重寫節點類型

javascript
https://github.com/acornjs/acorn/blob/11735729c4ebe590e406f952059813f250a4cbd1/acorn/src/lval.js#L11-L26

但是在 Rust 中,我們需要進行結構到結構的轉換。一個好而簡潔的方法是使用 trait。

rust
pub trait CoverGrammar<'a, T>: Sized {
    fn cover(value: T, p: &mut Parser<'a>) -> Result<Self>;
}

trait 接受 T 作為輸入類型,並接受 Self 和輸出類型,因此我們可以定義以下內容

rust
impl<'a> CoverGrammar<'a, Expression<'a>> for BindingPattern<'a> {
    fn cover(expr: Expression<'a>, p: &mut Parser<'a>) -> Result<Self> {
        match expr {
            Expression::Identifier(ident) => Self::cover(ident.unbox(), p),
            Expression::ObjectExpression(expr) => Self::cover(expr.unbox(), p),
            Expression::ArrayExpression(expr) => Self::cover(expr.unbox(), p),
            _ => Err(()),
        }
    }
}

impl<'a> CoverGrammar<'a, ObjectExpression<'a>> for BindingPattern<'a> {
    fn cover(obj_expr: ObjectExpression<'a>, p: &mut Parser<'a>) -> Result<Self> {
        ...
        BindingIdentifier::ObjectPattern(ObjectPattern { .. })
    }
}

impl<'a> CoverGrammar<'a, ArrayExpression<'a>> for BindingPattern<'a> {
    fn cover(expr: ArrayExpression<'a>, p: &mut Parser<'a>) -> Result<Self> {
        ...
        BindingIdentifier::ArrayPattern(ArrayPattern { .. })
    }
}

然後,對於任何需要將 Expression 轉換為 BindingPattern 的地方,呼叫 BindingPattern::cover(expression)


TypeScript

因此,您已經完成了 JavaScript,並且想要挑戰解析 TypeScript?壞消息是沒有規格,但好消息是 TypeScript 解析器位於單個檔案中 🙃。

JSX vs TSX

對於以下程式碼,

javascript
let foo = <string> bar;

如果這是 tsx(未終止的 JSX),則會發生語法錯誤,但如果這是帶有 TSTypeAssertion 的正確 VariableDeclaration

先行預讀

在某些情況下,解析器需要先行預讀並查看多個標記,以確定正確的語法。

TSIndexSignature

例如,要解析 TSIndexSignature,請考慮以下兩種情況

typescript
type A = { readonly [a: number]: string }
           ^__________________________^ TSIndexSignature

type B = { [a]: string }
           ^_________^ TSPropertySignature

對於第一個 { 上的 type A,我們需要查看 5 個標記 (readonly[a:number),以確保它是 TSIndexSignature 而不是 TSPropertySignature

為了使此操作可行且有效率,詞法分析器需要一個用於儲存多個標記的緩衝區。

箭頭表達式

cover grammar中討論過,當在 SequenceExpression 後找到 => 標記時,我們需要從 Expression 轉換為 BindingPattern

但是此方法不適用於 TypeScript,因為 () 內的每個項目都可能具有 TypeScript 語法,有太多情況需要涵蓋,例如

typescript
<x>a, b as c, d!;
(a?: b = {} as c!) => {};

建議研究 TypeScript 原始碼以瞭解此特定情況。相關程式碼為

typescript
function tryParseParenthesizedArrowFunctionExpression(allowReturnTypeInArrowFunction: boolean): Expression | undefined {
  const triState = isParenthesizedArrowFunctionExpression();
  if (triState === Tristate.False) {
    // It's definitely not a parenthesized arrow function expression.
    return undefined;
  }

  // If we definitely have an arrow function, then we can just parse one, not requiring a
  // following => or { token. Otherwise, we *might* have an arrow function.  Try to parse
  // it out, but don't allow any ambiguity, and return 'undefined' if this could be an
  // expression instead.
  return triState === Tristate.True
    ? parseParenthesizedArrowFunctionExpression(/*allowAmbiguity*/ true, /*allowReturnTypeInArrowFunction*/ true)
    : tryParse(() => parsePossibleParenthesizedArrowFunctionExpression(allowReturnTypeInArrowFunction));
}

//  True        -> We definitely expect a parenthesized arrow function here.
//  False       -> There *cannot* be a parenthesized arrow function here.
//  Unknown     -> There *might* be a parenthesized arrow function here.
//                 Speculatively look ahead to be sure, and rollback if not.
function isParenthesizedArrowFunctionExpression(): Tristate {
  if (
    token() === SyntaxKind.OpenParenToken ||
    token() === SyntaxKind.LessThanToken ||
    token() === SyntaxKind.AsyncKeyword
  ) {
    return lookAhead(isParenthesizedArrowFunctionExpressionWorker);
  }

  if (token() === SyntaxKind.EqualsGreaterThanToken) {
    // ERROR RECOVERY TWEAK:
    // If we see a standalone => try to parse it as an arrow function expression as that's
    // likely what the user intended to write.
    return Tristate.True;
  }
  // Definitely not a parenthesized arrow function.
  return Tristate.False;
}

總之,TypeScript 解析器使用先行預讀(快速路徑)和回溯的組合來解析箭頭函式。

在 MIT 授權下發布。