MyHTML

快速 C/C++ HTML 5 解析器。 使用线程。「 Fast C/C++ HTML 5 Parser. Using threads. 」

  • 所有者: lexborisov/myhtml
  • 平台: Linux, Mac, Windows
  • 許可證: GNU Lesser General Public License v2.1
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

MyHTML — a pure C HTML parser

Build Status

MyHTML is a fast HTML Parser using Threads implemented as a pure C99 library with no outside dependencies.

Now

Important announcement!

Please, use HTML parser from the Lexbor project. HTML parser in the Lexbor project is stable and has more features, and — yes — very fast.

This repository will go into read-only mode in 2020-05-01.

Features

  • Asynchronous Parsing, Build Tree and Indexation
  • Fully conformant with the HTML5 specification
  • Two API - high and low-level
  • Manipulation of elements: add, change, delete and other
  • Manipulation of elements attributes: add, change, delete and other
  • Support 39 character encoding by specification encoding.spec.whatwg.org
  • Support detecting character encodings
  • Support Single Mode parsing
  • Support Build without POSIX Threads
  • Support for fragment parsing
  • Support for parsing by chunks
  • No outside dependencies
  • C99 support
  • Passes all tree construction tests from html5lib-tests
  • Tested by 1 billion HTML pages (by commoncrawl.org)

Changes

Please, see CHANGELOG.md file

Further developments

  • Modest — Modest is a fast HTML Render implemented as a pure C99 library with no outside dependencies
  • MyCSS — Fast C/C++ CSS Parser (Cascading Style Sheets Parser)

Support encodings for InputStream

X_USER_DEFINED, UTF_8, UTF_16LE, UTF_16BE, BIG5, EUC_KR, GB18030,
IBM866, ISO_8859_10, ISO_8859_13, ISO_8859_14, ISO_8859_15, ISO_8859_16, ISO_8859_2, ISO_8859_3,
ISO_8859_4, ISO_8859_5, ISO_8859_6, ISO_8859_7, ISO_8859_8, KOI8_R, KOI8_U, MACINTOSH,
WINDOWS_1250, WINDOWS_1251, WINDOWS_1252, WINDOWS_1253, WINDOWS_1254, WINDOWS_1255, WINDOWS_1256,
WINDOWS_1257, WINDOWS_1258, WINDOWS_874, X_MAC_CYRILLIC, ISO_2022_JP, GBK, SHIFT_JIS, EUC_JP, ISO_8859_8_I

Support encodings for output

Program working in UTF-8 and returns all in UTF-8

Detecting character encodings

Now it UTF-8, UTF-16LE, UTF16BE and russian windows-1251, koi8-r, iso-8859-5, x-mac-cyrillic, ibm866

Installation

See INSTALL.md

Introduction

Introduction

Benchmark

Dependencies

None

External Bindings and Wrappers

Examples

See examples directory

Simple example

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <myhtml/api.h>

int main(int argc, const char * argv[])
{
    char html[] = "<div><span>HTML</span></div>";
    
    // basic init
    myhtml_t* myhtml = myhtml_create();
    myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
    
    // first tree init
    myhtml_tree_t* tree = myhtml_tree_create();
    myhtml_tree_init(tree, myhtml);
    
    // parse html
    myhtml_parse(tree, MyENCODING_UTF_8, html, strlen(html));
    
    // print result
    // or see serialization function with callback: myhtml_serialization_tree_callback
    mycore_string_raw_t str = {0};
    myhtml_serialization_tree_buffer(myhtml_tree_get_document(tree), &str);
    printf("%s\n", str.data);
    
    // release resources
    mycore_string_raw_destroy(&str, false);
    myhtml_tree_destroy(tree);
    myhtml_destroy(myhtml);
    
    return 0;
}

AUTHOR

Alexander Borisov lex.borisov@gmail.com

Copyright (C) 2015-2018 Alexander Borisov

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

See the LICENSE file.

主要指標

概覽
名稱與所有者lexborisov/myhtml
主編程語言C
編程語言C (語言數: 3)
平台Linux, Mac, Windows
許可證GNU Lesser General Public License v2.1
所有者活动
創建於2015-11-10 01:40:13
推送於2025-01-15 17:01:14
最后一次提交2025-01-15 20:01:14
發布數14
最新版本名稱v4.0.5 (發布於 )
第一版名稱v1.0.1 (發布於 )
用户参与
星數1.7k
關注者數91
派生數153
提交數394
已啟用問題?
問題數138
打開的問題數20
拉請求數51
打開的拉請求數0
關閉的拉請求數6
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?