RegExp Tree

JavaScript 的正则表达式处理器。「Regular expressions processor in JavaScript」

  • 所有者: DmitrySoshnikov/regexp-tree
  • 平台: Linux, Mac, Windows
  • 許可證: MIT License
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

regexp-tree

Build Status npm version npm downloads

Regular expressions processor in JavaScript

TL;DR: RegExp Tree is a regular expressions processor, which includes parser, traversal, transformer, optimizer, and interpreter APIs.

You can get an overview of the tool in this article.

Table of Contents

Installation

The parser can be installed as an npm module:

npm install -g regexp-tree

You can also try it online using AST Explorer.

Development

  1. Fork https://github.com/DmitrySoshnikov/regexp-tree repo
  2. If there is an actual issue from the issues list you'd like to work on, feel free to assign it yourself, or comment on it to avoid collisions (open a new issue if needed)
  3. Make your changes
  4. Make sure npm test still passes (add new tests if needed)
  5. Submit a PR

The regexp-tree parser is implemented as an automatic LR parser using Syntax tool. The parser module is generated from the regexp grammar, which is based on the regular expressions grammar used in ECMAScript.

For development from the github repository, run build command to generate the parser module, and transpile JS code:

git clone https://github.com/<your-github-account>/regexp-tree.git
cd regexp-tree
npm install
npm run build

NOTE: JS code transpilation is used to support older versions of Node. For faster development cycle you can use npm run watch command, which continuously transpiles JS code.

Usage as a CLI

Note: the CLI is exposed as its own regexp-tree-cli module.

Check the options available from CLI:

regexp-tree-cli --help
Usage: regexp-tree-cli [options]

Options:
   -e, --expression   A regular expression to be parsed
   -l, --loc          Whether to capture AST node locations
   -o, --optimize     Applies optimizer on the passed expression
   -c, --compat       Applies compat-transpiler on the passed expression
   -t, --table        Print NFA/DFA transition tables (nfa/dfa/all)

To parse a regular expression, pass -e option:

regexp-tree-cli -e '/a|b/i'

Which produces an AST node corresponding to this regular expression:

{
  type: 'RegExp',
  body: {
    type: 'Disjunction',
    left: {
      type: 'Char',
      value: 'a',
      symbol: 'a',
      kind: 'simple',
      codePoint: 97
    },
    right: {
      type: 'Char',
      value: 'b',
      symbol: 'b',
      kind: 'simple',
      codePoint: 98
    }
  },
  flags: 'i',
}

NOTE: the format of a regexp is / Body / OptionalFlags.

Usage from Node

The parser can also be used as a Node module:

const regexpTree = require('regexp-tree');

console.log(regexpTree.parse(/a|b/i)); // RegExp AST

Note, regexp-tree supports parsing regexes from strings, and also from actual RegExp objects (in general -- from any object which can be coerced to a string). If some feature is not implemented yet in an actual JavaScript RegExp, it should be passed as a string:

// Pass an actual JS RegExp object.
regexpTree.parse(/a|b/i);

// Pass a string, since `s` flag may not be supported in older versions.
regexpTree.parse('/./s');

Also note, that in string-mode, escaping is done using two slashes \\ per JavaScript:

// As an actual regexp.
regexpTree.parse(/\n/);

// As a string.
regexpTree.parse('/\\n/');

Capturing locations

For source code transformation tools it might be useful also to capture locations of the AST nodes. From the command line it's controlled via the -l option:

regexp-tree-cli -e '/ab/' -l

This attaches loc object to each AST node:

{
  type: 'RegExp',
  body: {
    type: 'Alternative',
    expressions: [
      {
        type: 'Char',
        value: 'a',
        symbol: 'a',
        kind: 'simple',
        codePoint: 97,
        loc: {
          start: {
            line: 1,
            column: 1,
            offset: 1,
          },
          end: {
            line: 1,
            column: 2,
            offset: 2,
          },
        }
      },
      {
        type: 'Char',
        value: 'b',
        symbol: 'b',
        kind: 'simple',
        codePoint: 98,
        loc: {
          start: {
            line: 1,
            column: 2,
            offset: 2,
          },
          end: {
            line: 1,
            column: 3,
            offset: 3,
          },
        }
      }
    ],
    loc: {
      start: {
        line: 1,
        column: 1,
        offset: 1,
      },
      end: {
        line: 1,
        column: 3,
        offset: 3,
      },
    }
  },
  flags: '',
  loc: {
    start: {
      line: 1,
      column: 0,
      offset: 0,
    },
    end: {
      line: 1,
      column: 4,
      offset: 4,
    },
  }
}

From Node it's controlled via setOptions method exposed on the parser:

const regexpTree = require('regexp-tree');

const parsed = regexpTree
  .parser
  .setOptions({captureLocations: true})
  .parse(/a|b/);

The setOptions method sets global options, which are preserved between calls. It is also possible to provide options per a single parse call, which might be more preferred:

const regexpTree = require('regexp-tree');

const parsed = regexpTree.parse(/a|b/, {
  captureLocations: true,
});

Parsing options

The parser supports several options which can be set globally via the setOptions method on the parser, or by passing them with each parse method invocation.

Example:

const regexpTree = require('regexp-tree');

const parsed = regexpTree.parse(/a|b/, {
  allowGroupNameDuplicates: true,
});

The following options are supported:

  • captureLocations: boolean -- whether to capture AST node locations (false by default)
  • allowGroupNameDuplicates: boolean -- whether to skip duplicates check of the named capturing groups

Set allowGroupNameDuplicates would make the following expression possible:

/
  # YYY-MM-DD date format:

  (?<year>  \d{4}) -
  (?<month> \d{2}) -
  (?<day>   \d{2})

  |

  # DD.MM.YYY date format

  (?<day>   \d{2}) .
  (?<month> \d{2}) .
  (?<year>  \d{4})

/x

Using traversal API

The traverse module allows handling needed AST nodes using the visitor pattern. In Node the module is exposed as the regexpTree.traverse method. Handlers receive an instance of the NodePath class, which encapsulates node itself, its parent node, property, and index (in case the node is part of a collection).

Visiting a node follows this algorithm:

  • call pre handler.
  • recurse into node's children.
  • call post handler.

For each node type of interest, you can provide either:

  • a function (pre).
  • an object with members pre and post.

You can also provide a * handler which will be executed on every node.

Example:

const regexpTree = require('regexp-tree');

// Get AST.
const ast = regexpTree.parse('/[a-z]{1,}/');

// Traverse AST nodes.
regexpTree.traverse(ast, {

  // Visit every node before any type-specific handlers.
  '*': function({node}) {
    ...
  },

  // Handle "Quantifier" node type.
  Quantifier({node}) {
    ...
  },

  // Handle "Char" node type, before and after.
  Char: {
    pre({node}) {
      ...
    },
    post({node}) {
      ...
    }
  }

});

// Generate the regexp.
const re = regexpTree.generate(ast);

console.log(re); // '/[a-z]+/'

Using transform API

NOTE: you can play with transformation APIs, and write actual transforms for quick tests in AST Explorer. See this example.

While traverse module provides basic traversal API, which can be used for any purposes of AST handling, transform module focuses mainly on transformation of regular expressions.

It accepts a regular expressions in different formats (string, an actual RegExp object, or an AST), applies a set of transformations, and retuns an instance of TransformResult. Handles receive as a parameter the same NodePath object used in traverse.

Example:

const regexpTree = require('regexp-tree');

// Handle nodes.
const re = regexpTree.transform('/[a-z]{1,}/i', {

  /**
   * Handle "Quantifier" node type,
   * transforming `{1,}` quantifier to `+`.
   */
  Quantifier(path) {
    const {node} = path;

    // {1,} -> +
    if (
      node.kind === 'Range' &&
      node.from === 1 &&
      !node.to
    ) {
      path.replace({
        type: 'Quantifier',
        kind: '+',
        greedy: node.greedy,
      });
    }
  },
});

console.log(re.toString()); // '/[a-z]+/i'
console.log(re.toRegExp()); // /[a-z]+/i
console.log(re.getAST()); // AST for /[a-z]+/i

Transform plugins

A transformation plugin is a module which exports a transformation handler. We have seen above how we can pass a handler object directly to the regexpTree.transform method, here we extract it into a separate module, so it can be implemented and shared independently:

Example of a plugin:

// file: ./regexp-tree-a-to-b-transform.js


/**
 * This plugin replaces chars 'a' with chars 'b'.
 */
module.exports = {
  Char({node}) {
    if (node.kind === 'simple' && node.value === 'a') {
      node.value = 'b';
      node.symbol = 'b';
      node.codePoint = 98;
    }
  },
};

Once we have this plugin ready, we can require it, and pass to the transform function:

const regexpTree = require('regexp-tree');
const plugin = require('./regexp-tree-a-to-b-transform');

const re = regexpTree.transform(/(a|c)a+[a-z]/, plugin);

console.log(re.toRegExp()); // /(b|c)b+[b-z]/

NOTE: we can also pass a list of plugins to the regexpTree.transform. In this case the plugins are applied in one pass in order. Another approach is to run several sequential calls to transform, setting up a pipeline, when a transformed AST is passed further to another plugin, etc.

You can see other examples of transform plugins in the optimizer/transforms or in the compat-transpiler/transforms directories.

Using generator API

The generator module generates regular expressions from corresponding AST nodes. In Node the module is exposed as regexpTree.generate method.

Example:

const regexpTree = require('regexp-tree');

const re = regexpTree.generate({
  type: 'RegExp',
  body: {
    type: 'Char',
    value: 'a',
    symbol: 'a',
    kind: 'simple',
    codePoint: 97
  },
  flags: 'i',
});

console.log(re); // '/a/i'

Using optimizer API

Optimizer transforms your regexp into an optimized version, replacing some sub-expressions with their idiomatic patterns. This might be good for different kinds of minifiers, as well as for regexp machines.

NOTE: the Optimizer is implemented as a set of regexp-tree plugins.

Example:

const regexpTree = require('regexp-tree');

const originalRe = /[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/;

const optimizedRe = regexpTree
  .optimize(originalRe)
  .toRegExp();

console.log(optimizedRe); // /\w+e+/

From CLI the optimizer is available via --optimize (-o) option:

regexp-tree-cli -e '/[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/' -o

Result:

Optimized: /\w+e+/

See the optimizer README for more details.

Optimizer ESLint plugin

The optimizer module is also available as an ESLint plugin, which can be installed at: eslint-plugin-optimize-regex.

Using compat-transpiler API

The compat-transpiler module translates your regexp in new format or in new syntax, into an equivalent regexp in a legacy representation, so it can be used in engines which don't yet implement the new syntax.

NOTE: the compat-transpiler is implemented as a set of regexp-tree plugins.

Example, "dotAll" s flag:

/./s

Is translated into:

/[\0-\uFFFF]/

Or named capturing groups:

/(?<value>a)\k<value>\1/

Becomes:

/(a)\1\1/

To use the API from Node:

const regexpTree = require('regexp-tree');

// Using new syntax.
const originalRe = '/(?<all>.)\\k<all>/s';

// For legacy engines.
const compatTranspiledRe = regexpTree
  .compatTranspile(originalRe)
  .toRegExp();

console.log(compatTranspiledRe); // /([\0-\uFFFF])\1/

From CLI the compat-transpiler is available via --compat (-c) option:

regexp-tree-cli -e '/(?<all>.)\k<all>/s' -c

Result:

Compat: /([\0-\uFFFF])\1/

Compat-transpiler Babel plugin

The compat-transpiler module is also available as a Babel plugin, which can be installed at: babel-plugin-transform-modern-regexp.

Note, the plugin also includes extended regexp features.

RegExp extensions

Some of the non-standard feature are also supported by regexp-tree.

NOTE: "non-standard" means specifically ECMAScript standard, since in other regexp egnines, e.g. PCRE, Python, etc. these features are standard.

One of such features is the x flag, which enables extended mode of regular expressions. In this mode most of whitespaces are ignored, and expressions can use #-comments.

Example:

/
  # A regular expression for date.

  (?<year>\d{4})-    # year part of a date
  (?<month>\d{2})-   # month part of a date
  (?<day>\d{2})      # day part of a date

/x

This is normally parsed by the regexp-tree parser, and compat-transpiler has full support for it; it's translated into:

/(\d{4})-(\d{2})-(\d{2})/

RegExp extensions Babel plugin

The regexp extensions are also available as a Babel plugin, which can be installed at: babel-plugin-transform-modern-regexp.

Note, the plugin also includes compat-transpiler features.

Creating RegExp objects

To create an actual RegExp JavaScript object, we can use regexpTree.toRegExp method:

const regexpTree = require('regexp-tree');

const re = regexpTree.toRegExp('/[a-z]/i');

console.log(
  re.test('a'), // true
  re.test('Z'), // true
);

Executing regexes

It is also possible to execute regular expressions using exec API method, which has support for new syntax, and features, such as named capturing group, etc:

const regexpTree = require('regexp-tree');

const re = `/

  # A regular expression for date.

  (?<year>\\d{4})-    # year part of a date
  (?<month>\\d{2})-   # month part of a date
  (?<day>\\d{2})      # day part of a date

/x`;

const string = '2017-04-14';

const result = regexpTree.exec(re, string);

console.log(result.groups); // {year: '2017', month: '04', day: '14'}

Using interpreter API

NOTE: you can read more about implementation details of the interpreter in this series of articles.

In addition to executing regular expressions using JavaScript built-in RegExp engine, RegExp Tree also implements own interpreter based on classic NFA/DFA finite automaton engine.

Currently it aims educational purposes -- to trace the regexp matching process, transitioning in NFA/DFA states. It also allows building state transitioning table, which can be used for custom implementation. In API the module is exposed as fa (finite-automaton) object.

Example:

const {fa} = require('regexp-tree');

const re = /ab|c*/;

console.log(fa.test(re, 'ab')); // true
console.log(fa.test(re, '')); // true
console.log(fa.test(re, 'c')); // true

// NFA, and its transition table.
const nfa = fa.toNFA(re);
console.log(nfa.getTransitionTable());

// DFA, and its transition table.
const dfa = fa.toDFA(re);
console.log(dfa.getTransitionTable());

For more granular work with NFA and DFA, fa module also exposes convenient builders, so you can build NFA fragments directly:

const {fa} = require('regexp-tree');

const {
  alt,
  char,
  or,
  rep,
} = fa.builders;

// ab|c*
const re = or(
  alt(char('a'), char('b')),
  rep(char('c'))
);

console.log(re.matches('ab')); // true
console.log(re.matches('')); // true
console.log(re.matches('c')); // true

// Build DFA from NFA
const {DFA} = fa;

const reDFA = new DFA(re);

console.log(reDFA.matches('ab')); // true
console.log(reDFA.matches('')); // true
console.log(reDFA.matches('c')); // true

Printing NFA/DFA tables

The --table option allows displaying NFA/DFA transition tables. RegExp Tree also applies DFA minimization (using N-equivalence algorithm), and produces the minimal transition table as its final result.

In the example below for the /a|b|c/ regexp, we first obtain the NFA transition table, which is further converted to the original DFA transition table (down from the 10 non-deterministic states to 4 deterministic states), and eventually minimized to the final DFA table (from 4 to only 2 states).

./bin/regexp-tree-cli -e '/a|b|c/' --table all

Result:

> - starting
✓ - accepting

NFA transition table:

┌─────┬───┬───┬────┬─────────────┐
│     │ a │ b │ c  │ ε*          │
├─────┼───┼───┼────┼─────────────┤
│ 1 > │   │   │    │ {1,2,3,7,9} │
├─────┼───┼───┼────┼─────────────┤
│ 2   │   │   │    │ {2,3,7}     │
├─────┼───┼───┼────┼─────────────┤
│ 3   │ 4 │   │    │ 3           │
├─────┼───┼───┼────┼─────────────┤
│ 4   │   │   │    │ {4,5,6}     │
├─────┼───┼───┼────┼─────────────┤
│ 5   │   │   │    │ {5,6}       │
├─────┼───┼───┼────┼─────────────┤
│ 6 ✓ │   │   │    │ 6           │
├─────┼───┼───┼────┼─────────────┤
│ 7   │   │ 8 │    │ 7           │
├─────┼───┼───┼────┼─────────────┤
│ 8   │   │   │    │ {8,5,6}     │
├─────┼───┼───┼────┼─────────────┤
│ 9   │   │   │ 10 │ 9           │
├─────┼───┼───┼────┼─────────────┤
│ 10  │   │   │    │ {10,6}      │
└─────┴───┴───┴────┴─────────────┘


DFA: Original transition table:

┌─────┬───┬───┬───┐
│     │ a │ b │ c │
├─────┼───┼───┼───┤
│ 1 > │ 4 │ 3 │ 2 │
├─────┼───┼───┼───┤
│ 2 ✓ │   │   │   │
├─────┼───┼───┼───┤
│ 3 ✓ │   │   │   │
├─────┼───┼───┼───┤
│ 4 ✓ │   │   │   │
└─────┴───┴───┴───┘


DFA: Minimized transition table:

┌─────┬───┬───┬───┐
│     │ a │ b │ c │
├─────┼───┼───┼───┤
│ 1 > │ 2 │ 2 │ 2 │
├─────┼───┼───┼───┤
│ 2 ✓ │   │   │   │
└─────┴───┴───┴───┘

AST nodes specification

Below are the AST node types for different regular expressions patterns:

Char

A basic building block, single character. Can be escaped, and be of different kinds.

Simple char

Basic non-escaped char in a regexp:

z

Node:

{
  type: 'Char',
  value: 'z',
  symbol: 'z',
  kind: 'simple',
  codePoint: 122
}

NOTE: to test this from CLI, the char should be in an actual regexp -- /z/.

Escaped char
\z

The same value, escaped flag is added:

{
  type: 'Char',
  value: 'z',
  symbol: 'z',
  kind: 'simple',
  codePoint: 122,
  escaped: true
}

Escaping is mostly used with meta symbols:

// Syntax error
*
\*

OK, node:

{
  type: 'Char',
  value: '*',
  symbol: '*',
  kind: 'simple',
  codePoint: 42,
  escaped: true
}
Meta char

A meta character should not be confused with an escaped char.

Example:

\n

Node:

{
  type: 'Char',
  value: '\\n',
  symbol: '\n',
  kind: 'meta',
  codePoint: 10
}

Among other meta character are: ., \f, \r, \n, \t, \v, \0, [\b] (backspace char), \s, \S, \w, \W, \d, \D.

NOTE: Meta characters representing ranges (like ., \s, etc.) have undefined value for symbol and NaN for codePoint.

NOTE: \b and \B are parsed as Assertion node type, not Char.

Control char

A char preceded with \c, e.g. \cx, which stands for CTRL+x:

\cx

Node:

{
  type: 'Char',
  value: '\\cx',
  symbol: undefined,
  kind: 'control',
  codePoint: NaN
}
HEX char-code

A char preceded with \x, followed by a HEX-code, e.g. \x3B (symbol ;):

\x3B

Node:

{
  type: 'Char',
  value: '\\x3B',
  symbol: ';',
  kind: 'hex',
  codePoint: 59
}
Decimal char-code

Char-code:

\42

Node:

{
  type: 'Char',
  value: '\\42',
  symbol: '*',
  kind: 'decimal',
  codePoint: 42
}
Octal char-code

Char-code started with \0, followed by an octal number:

\073

Node:

{
  type: 'Char',
  value: '\\073',
  symbol: ';',
  kind: 'oct',
  codePoint: 59
}
Unicode

Unicode char started with \u, followed by a hex number:

\u003B

Node:

{
  type: 'Char',
  value: '\\u003B',
  symbol: ';',
  kind: 'unicode',
  codePoint: 59
}

When using the u flag, unicode chars can also be represented using \u followed by a hex number between curly braces:

\u{1F680}

Node:

{
  type: 'Char',
  value: '\\u{1F680}',
  symbol: '🚀',
  kind: 'unicode',
  codePoint: 128640
}

When using the u flag, unicode chars can also be represented using a surrogate pair:

\ud83d\ude80

Node:

{
  type: 'Char',
  value: '\\ud83d\\ude80',
  symbol: '🚀',
  kind: 'unicode',
  codePoint: 128640,
  isSurrogatePair: true
}

Character class

Character classes define a set of characters. A set may include as simple characters, as well as character ranges. A class can be positive (any from the characters in the class match), or negative (any but the characters from the class match).

Positive character class

A positive character class is defined between [ and ] brackets:

[a*]

A node:

{
  type: 'CharacterClass',
  expressions: [
    {
      type: 'Char',
      value: 'a',
      symbol: 'a',
      kind: 'simple',
      codePoint: 97
    },
    {
      type: 'Char',
      value: '*',
      symbol: '*',
      kind: 'simple',
      codePoint: 42
    }
  ]
}

NOTE: some meta symbols are treated as normal characters in a character class. E.g. * is not a repetition quantifier, but a simple char.

Negative character class

A negative character class is defined between [^ and ] brackets:

[^ab]

An AST node is the same, just negative property is added:

{
  type: 'CharacterClass',
  negative: true,
  expressions: [
    {
      type: 'Char',
      value: 'a',
      symbol: 'a',
      kind: 'simple',
      codePoint: 97
    },
    {
      type: 'Char',
      value: 'b',
      symbol: 'b',
      kind: 'simple',
      codePoint: 98
    }
  ]
}
Character class ranges

As mentioned, a character class may also contain ranges of symbols:

主要指標

概覽
名稱與所有者DmitrySoshnikov/regexp-tree
主編程語言JavaScript
編程語言JavaScript (語言數: 2)
平台Linux, Mac, Windows
許可證MIT License
所有者活动
創建於2017-03-19 07:47:16
推送於2024-08-13 23:56:40
最后一次提交2024-08-13 16:56:40
發布數111
最新版本名稱v0.1.26 (發布於 2023-04-29 18:56:43)
第一版名稱v0.0.2 (發布於 2017-03-19 21:21:35)
用户参与
星數410
關注者數11
派生數45
提交數620
已啟用問題?
問題數136
打開的問題數46
拉請求數119
打開的拉請求數0
關閉的拉請求數13
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?