Tolerant PHP Parser

专为 IDE 使用场景设计的早期 PHP 解析器。「An early-stage PHP parser designed for IDE usage scenarios.」

Github星跟踪图

Tolerant PHP Parser(容错的PHP解析器)

这是一个早期设计的 PHP 解析器,从一开始就是为 IDE 使用场景设计的(有关详细信息,请参阅设计目标)。还有大量的工作要做,所以在这一点上,这个仓库主要是作为一个实验和对话的开始。

开始

配置您的计算机之后,您可以使用解析器通过友好的 API 生成和使用抽象语法树(AST)。

<?php
//自动加载所需的类
require __DIR__ . "/vendor/autoload.php";
use Microsoft\PhpParser\{DiagnosticsProvider, Node, Parser, PositionUtilities};
// Instantiate new parser instance
$parser = new Parser();
// Return and print an AST from string contents
$astNode = $parser->parseSourceFile('<?php /* comment */ echo "hi!"');
var_dump($astNode);
// Gets and prints errors from AST Node. The parser handles errors gracefully,
// so it can be used in IDE usage scenarios (where code is often incomplete).
$errors = DiagnosticsProvider::getDiagnostics($astNode);
var_dump($errors);
// Traverse all Node descendants of $astNode
foreach ($astNode->getDescendantNodes() as $descendant) {
    if ($descendant instanceof Node\StringLiteral) {
        // Print the Node text (without whitespace or comments)
        var_dump($descendant->getText());
        // All Nodes link back to their parents, so it's easy to navigate the tree.
        $grandParent = $descendant->getParent()->getParent();
        var_dump($grandParent->getNodeKindName());
        // The AST is fully-representative, and round-trippable to the original source.
        // This enables consumers to build reliable formatting and refactoring tools.
        var_dump($grandParent->getLeadingCommentAndWhitespaceText());
    }
    // In addition to retrieving all children or descendants of a Node,
    // Nodes expose properties specific to the Node type.
    if ($descendant instanceof Node\Expression\EchoExpression) {
        $echoKeywordStartPosition = $descendant->echoKeyword->getStartPosition();
        // To cut down on memory consumption, positions are represented as a single integer 
        // index into the document, but their line and character positions are easily retrieved.
        $lineCharacterPosition = PositionUtilities::getLineCharacterPositionFromPosition(
            $echoKeywordStartPosition,
            $descendant->getFileContents()
        );
        echo "line: $lineCharacterPosition->line, character: $lineCharacterPosition->character";
    }
}

注意: API 尚未最终定稿,所以请通过文件问题让我们知道您想公开哪些功能,然后我们再看看我们能做什么!另外,请在解析树中记录任何具有意外行为的错误。我们还在早期阶段,非常感谢您的任何反馈。

设计目标

  • 容错设计 -- 在 IDE 场景中,根据定义,代码是不完整的。在输入无效代码的情况下,解析器仍应能够恢复并生成有效的+完整树,以及相关的诊断信息。
  • 快速且轻量级(应该能够每秒解析几 MB 的源代码, 为其他功能留出空间)。
    • 内存高效的数据结构
    • 允许将来进行增量式解析
  • 遵守 PHP语言规范, 支持 PHP5 和 PHP7 语法
  • 生成的 AST 提供了语义和转换操作所必需的属性(具有完全代表性等),这些属性也需要有效。
    • 完全具有代表性,并可回溯到被解析的文本(解析树中包括所有空格和注释细节)。
    • 可以通过父/子节点轻松遍历树
    • &lt; 100 ms UI响应时间, 所以每个语言服务器操作应该是&lt; 50毫秒,为所有其他并行进行的事情留出空间。
  • 随着时间的推移,简单且可维护的 -- 解析器往往会变得非常混乱,非常快,所以可读性和可调试性是高优先级的。
  • 可测试 -- 解析器应该生成可证明有效的解析树。我们通过定义和持续测试一组关于树的不变量来实现这一点。
  • 友好和描述性的 API,来使其他人在其上能够轻松构建。
  • 用PHP编写 -- 尽可能简化 PHP 社区的使用和贡献。

当前状态和方法

为了确保在每一步的正确性足够的水平, 解析器正在使用以下增量方法开发:

  • [x] 阶段1:编写不支持PHP语法的词法分析器,但支持EOF 和未知的令牌。为所有不变式编写测试。
  • [x] 阶段2:支持PHP词法语法,大量测试
  • [x] 阶段3:编写不支持PHP语法的解析器,但生成树 错误节点。为所有不变式编写测试。
  • [x] 阶段4:支持PHP语法语法,大量测试
  • [] 阶段5(正在进行中:正在运行:):真实世界验证和优化
      正确性: 验证示例代码库中没有产生错误,与其他解析器进行基准测试(调查任何不一致的实例),fuzz-testing >
    • ,针对大型PHP应用程序进行基准测试
  • [] 阶段6:完成API以尽可能简化人们的使用。

其他备注

一些PHP语法结构(即 yield-expression 和模板字符串)还不受支持,还有其他各种各样的bug。但是,由于解析器具有容错能力,所以可以优雅地处理这些错误,否则生成的树是完整的。要更全面地了解我们所处的位置,您可以运行验证测试套件(请参阅贡献指南以获得关于运行测试的更多信息)。或者简单地看一下当前验证测试结果

尽管我们尚未开始性能优化阶段,但到目前为止,我们已经看到了令人鼓舞的结果,并且还有很多改进的余地。 有关我们当前方法的详细信息,请参见工作原理,并在您自己的计算机上运行性能测试以亲自体验。


这个项目采用了 Microsoft开源行为准则。 有关更多信息,请参阅 行为准则常见问题解答或联系 opencode@microsoft.com 以及其他任何问题或评论。


(The first version translated by vz on 2020.07.26)

主要指标

概览
名称与所有者microsoft/tolerant-php-parser
主编程语言PHP
编程语言PHP (语言数: 5)
平台BSD, Linux, Mac, Solaris, Windows
许可证MIT License
所有者活动
创建于2016-12-28 21:26:25
推送于2024-09-28 11:58:12
最后一次提交
发布数26
最新版本名称v0.1.2 (发布于 )
第一版名称v0.0.1 (发布于 )
用户参与
星数886
关注者数46
派生数79
提交数893
已启用问题?
问题数172
打开的问题数51
拉请求数211
打开的拉请求数8
关闭的拉请求数21
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?

Tolerant PHP Parser

Build Status

This is an early-stage PHP parser designed, from the beginning, for IDE usage scenarios (see Design Goals for more details). There is
still a ton of work to be done, so at this point, this repo mostly serves as
an experiment and the start of a conversation.

image

Get Started

After you've configured your machine, you can use the parser to generate and work
with the Abstract Syntax Tree (AST) via a friendly API.

<?php
// Autoload required classes
require __DIR__ . "/vendor/autoload.php";

use Microsoft\PhpParser\{DiagnosticsProvider, Node, Parser, PositionUtilities};

// Instantiate new parser instance
$parser = new Parser();

// Return and print an AST from string contents
$astNode = $parser->parseSourceFile('<?php /* comment */ echo "hi!"');
var_dump($astNode);

// Gets and prints errors from AST Node. The parser handles errors gracefully,
// so it can be used in IDE usage scenarios (where code is often incomplete).
$errors = DiagnosticsProvider::getDiagnostics($astNode);
var_dump($errors);

// Traverse all Node descendants of $astNode
foreach ($astNode->getDescendantNodes() as $descendant) {
    if ($descendant instanceof Node\StringLiteral) {
        // Print the Node text (without whitespace or comments)
        var_dump($descendant->getText());

        // All Nodes link back to their parents, so it's easy to navigate the tree.
        $grandParent = $descendant->getParent()->getParent();
        var_dump($grandParent->getNodeKindName());
        
        // The AST is fully-representative, and round-trippable to the original source.
        // This enables consumers to build reliable formatting and refactoring tools.
        var_dump($grandParent->getLeadingCommentAndWhitespaceText());
    }
    
    // In addition to retrieving all children or descendants of a Node,
    // Nodes expose properties specific to the Node type.
    if ($descendant instanceof Node\Expression\EchoExpression) {
        $echoKeywordStartPosition = $descendant->echoKeyword->getStartPosition();
        // To cut down on memory consumption, positions are represented as a single integer 
        // index into the document, but their line and character positions are easily retrieved.
        $lineCharacterPosition = PositionUtilities::getLineCharacterPositionFromPosition(
            $echoKeywordStartPosition,
            $descendant->getFileContents()
        );
        echo "line: $lineCharacterPosition->line, character: $lineCharacterPosition->character";
    }
}

Note: the API is not yet finalized, so please file issues let us know what functionality you want exposed,
and we'll see what we can do! Also please file any bugs with unexpected behavior in the parse tree. We're still
in our early stages, and any feedback you have is much appreciated :smiley:.

Design Goals

  • Error tolerant design - in IDE scenarios, code is, by definition, incomplete. In the case that invalid code is entered, the
    parser should still be able to recover and produce a valid + complete tree, as well as relevant diagnostics.
  • Fast and lightweight (should be able to parse several MB of source code per second,
    to leave room for other features).
    • Memory-efficient data structures
    • Allow for incremental parsing in the future
  • Adheres to PHP language spec,
    supports both PHP5 and PHP7 grammars
  • Generated AST provides properties (fully representative, etc.) necessary for semantic and transformational
    operations, which also need to be performant.
    • Fully representative and round-trippable back to the text it was parsed from (all whitespace and comment "trivia" are included in the parse tree)
    • Possible to easily traverse the tree through parent/child nodes
    • < 100 ms UI response time,
      so each language server operation should be < 50 ms to leave room for all the
      other stuff going on in parallel.
  • Simple and maintainable over time - parsers have a tendency to get really
    confusing, really fast, so readability and debug-ability is high priority.
  • Testable - the parser should produce provably valid parse trees. We achieve this by defining and continuously testing
    a set of invariants about the tree.
  • Friendly and descriptive API to make it easy for others to build on.
  • Written in PHP - make it as easy as possible for the PHP community to consume and contribute.

Current Status and Approach

To ensure a sufficient level of correctness at every step of the way, the
parser is being developed using the following incremental approach:

  • Phase 1: Write lexer that does not support PHP grammar, but supports EOF
    and Unknown tokens. Write tests for all invariants.
  • Phase 2: Support PHP lexical grammar, lots of tests
  • Phase 3: Write a parser that does not support PHP grammar, but produces tree of
    Error Nodes. Write tests for all invariants.
  • Phase 4: Support PHP syntactic grammar, lots of tests
  • Phase 5 (in progress :running:): Real-world validation and optimization
    • Correctness: validate that there are no errors produced on sample codebases, benchmark against other parsers (investigate any instance of disagreement), fuzz-testing
    • Performance: profile, benchmark against large PHP applications
  • Phase 6: Finalize API to make it as easy as possible for people to consume.

Additional notes

A few of the PHP grammatical constructs (namely yield-expression, and template strings)
are not yet supported and there are also other miscellaneous bugs. However, because the parser is error-tolerant,
these errors are handled gracefully, and the resulting tree is otherwise complete. To get a more holistic sense for
where we are, you can run the "validation" test suite (see Contributing Guidelines for more info
on running tests). Or simply, take a look at the current validation test results.

Even though we haven't yet begun the performance optimization stage, we have seen promising results so far,
and have plenty more room for improvement. See How It Works for details on our current
approach, and run the Performance Tests on your
own machine to see for yourself.

Learn more

:dart: Design Goals - learn about the design goals of the project (features, performance metrics, and more).

:book: Documentation - learn how to reference the parser from your project, and how to perform
operations on the AST to answer questions about your code.

:eyes: Syntax Visualizer Tool - get a more tangible feel for the AST. Get creative - see if you can break it!

:chart_with_upwards_trend: Current Status and Approach - how much of the grammar is supported? Performance? Memory? API stability?

:wrench: How it works - learn about the architecture, design decisions, and tradeoffs.

:sparkling_heart: Contribute! - learn how to get involved, check out some pointers to educational commits that'll
help you ramp up on the codebase (even if you've never worked on a parser before),
and recommended workflows that make it easier to iterate.


This project has adopted the Microsoft Open Source Code of Conduct.
For more information see the Code of Conduct FAQ or contact
opencode@microsoft.com with any additional questions or comments.