floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.

Usage

Take this HTML as an example:

<!doctype html>
<html>
<body>
  <section id="content">
    <p class="headline">Floki</p>
    <span class="headline">Enables search using CSS selectors</span>
    <a href="https://github.com/philss/floki">Github page</a>
    <span data-model="user">philss</span>
  </section>
  <a href="https://hex.pm/packages/floki">Hex package</a>
</body>
</html>

Here are some queries that you can perform (with return examples):

{:ok, document} = Floki.parse_document(html)

Floki.find(document, "p.headline")
# => [{"p", [{"class", "headline"}], ["Floki"]}]

document, > Floki.find("p.headline"), > Floki.raw_html
# => <p class="headline">Floki</p>

Each HTML node is represented by a tuple like:

{tag_name, attributes, children_nodes}

Example of node:

{"p", [{"class", "headline"}], ["Floki"]}

So even if the only child node is the element text, it is represented inside a list.

Installation

Add Floki to your mix.exs:

defp deps do
  [
    {:floki, "~> 0.25.0"}
  ]
end

After that, run mix deps.get.

Dependencies

Floki needs the leex module in order to compile.
Normally this module is installed with Erlang in a complete installation.

If you get this kind of error,
you need to install the erlang-dev and erlang-parsetools packages in order get the leex module.
The packages names may be different depending on your OS.

Alternative HTML parsers

By default Floki uses a patched version of mochiweb_html for parsing fragments
due to its ease of installation (it's written in Erlang and has no outside dependencies).

However one might want to use an alternative parser due to the following
concerns:

Performance - It can be up to 20 times slower than the alternatives on big HTML
documents.
Correctness - in some cases mochiweb_html will produce different results
from what is specified in HTML5 specification](https://html.spec.whatwg.org/).
For example, a correct parser would parse <title> <b> bold </b> text </title>
as {"title", [], [" <b> bold </b> text "]} since content inside <title> is
to be treated as plaintext.
Albeit mochiweb_html would parse it as {"title", [], [{"b", [], [" bold "]}, " text "]}.

Floki supports the following alternative parsers:

fast_html - A wrapper for lexborisov's myhtml. A pure C HTML parser.
html5ever - A wrapper for html5ever written in Rust, developed as a part of the Servo project.

fast_html is generally faster, according to the
benchmarks conducted by
its developers. Though html5ever does have an advantage on really small
(~4kb) fragments due to it being implemented as a NIF.

Using `html5ever` as the HTML parser

Rust needs to be installed on the system in order to compile html5ever. To do that, please
follow the instruction presented in the official page.

After Rust is set up, you need to add html5ever NIF to your dependency list:

defp deps do
  [
    {:floki, "~> 0.25.0"},
    {:html5ever, "~> 0.7.0"}
  ]
end

Run mix deps.get and compiles the project with mix compile to make sure it works.

Then you need to configure your app to use html5ever:

# in config/config.exs

config :floki, :html_parser, Floki.HTMLParser.Html5ever

For more info, check the article Rustler - Safe Erlang and Elixir NIFs in Rust.

Using `fast_html` as the HTML parser

A C compiler and GNU\Make needs to be installed on the system in order to
compile myhtml. It's likely that your machine has them already.

Note that you also need to have epmd started/available to start due to fast_html relying on a
C-Node worker, usually it will be started automatically, but some distributions
(i.e Gentoo Linux) enforce only being able to start it as a service.

First, add fast_html to your dependencies:

defp deps do
  [
    {:floki, "~> 0.25.0"},
    {:fast_html, "~> 1.0"}
  ]
end

Run mix deps.get and compiles the project with mix compile to make sure it works.

Then you need to configure your app to use fast_html:

# in config/config.exs

config :floki, :html_parser, Floki.HTMLParser.FastHtml

More about Floki API

To parse a HTML document, try:

html = """
  <html>
  <body>
    <div class="example"></div>
  </body>
  </html>
"""

{:ok, document} = Floki.parse_document(html)
# => {:ok, [{"html", [], [{"body", [], [{"div", [{"class", "example"}], []}]}]}]}

To find elements with the class example, try:

Floki.find(document, ".example")
# => [{"div", [{"class", "example"}], []}]

To convert your node tree back to raw HTML (spaces are ignored):

document, > Floki.find(".example"), > Floki.raw_html
# =>  <div class="example"></div>

To fetch some attribute from elements, try:

Floki.attribute(document, ".example", "class")
# => ["example"]

You can get attributes from elements that you already have:

document, > Floki.find(".example"), > Floki.attribute("class")
# => ["example"]

If you want to get the text from an element, try:

document, > Floki.find(".headline"), > Floki.text

# => "Floki"

Supported selectors

Here you find all the CSS selectors supported in the current version:, Pattern, Description, -----------------, ------------------------------, , any element, E, an element of type E, E[foo], an E element with a "foo" attribute, E[foo="bar"], an E element whose "foo" attribute value is exactly equal to "bar", E[foo~="bar"], an E element whose "foo" attribute value is a list of whitespace-separated values, one of which is exactly equal to "bar", E[foo^="bar"], an E element whose "foo" attribute value begins exactly with the string "bar", E[foo$="bar"], an E element whose "foo" attribute value ends exactly with the string "bar", E[foo="bar"], an E element whose "foo" attribute value contains the substring "bar", E[foo, ="en"], an E element whose "foo" attribute has a hyphen-separated list of values beginning (from the left) with "en", E:nth-child(n), an E element, the n-th child of its parent, E:first-child, an E element, first child of its parent, E:last-child, an E element, last child of its parent, E:nth-of-type(n), an E element, the n-th child of its type among its siblings, E:first-of-type, an E element, first child of its type among its siblings, E:last-of-type, an E element, last child of its type among its siblings, E.warning, an E element whose class is "warning", E#myid, an E element with ID equal to "myid", E:not(s), an E element that does not match simple selector s, E F, an F element descendant of an E element, E > F, an F element child of an E element, E + F, an F element immediately preceded by an E element, E ~ F, an F element preceded by an E element, There are also some selectors based on non-standard specifications. They are:, Pattern, Description, ----------------------, -----------------------------------------------------, E:fl-contains('foo'), an E element that contains "foo" inside a text node, ## Special thanks

@arasatasaygin for Floki's logo from the Open Logos project.

License

Floki is under MIT license. Check the LICENSE file for more details.

名称与所有者	philss/floki
主编程语言	Elixir
编程语言	Elixir (语言数: 3)
平台
许可证	MIT License

创建于	2014-11-03 04:49:15
推送于	2025-10-30 16:02:08
最后一次提交
发布数	79
最新版本名称	v0.38.0 (发布于 )
第一版名称	v0.0.3 (发布于 2014-11-09 13:32:42)

星数	2.1k
关注者数	22
派生数	161
提交数	826
已启用问题?
问题数	182
打开的问题数	18
拉请求数	412
打开的拉请求数	6
关闭的拉请求数	42

已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?

Github星跟踪图

Usage

Installation

Dependencies

Alternative HTML parsers

Using `html5ever` as the HTML parser

Using `fast_html` as the HTML parser

More about Floki API

Supported selectors

License

主要指标

floki

Github星跟踪图

Usage

Installation

Dependencies

Alternative HTML parsers

Using html5ever as the HTML parser

Using fast_html as the HTML parser

More about Floki API

Supported selectors

License

主要指标

Using `html5ever` as the HTML parser

Using `fast_html` as the HTML parser