HTML Table Extractor

HTML Table Extractor is a python library that uses Beautiful Soup to extract data from complicated and messy html table

Important links

Repository: https://github.com/yuanxu-li/html-table-extractor
Issues: https://github.com/yuanxu-li/html-table-extractor/issues

Installation

pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor

Usage

Example 1 - Simple

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 2 - Transformer

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()

It will print out:

[[1, 2], [3, 4]]

Example 3 - Pass BS4 Tag

from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 4 - Complex

from html_table_extractor.extractor import Extractor
table_doc = """
<table>
  <tr>
    <td rowspan=2>1</td>
    <td>2</td>
    <td>3</td>
  </tr>
  <tr>
    <td colspan=2>4</td>
  </tr>
  <tr>
    <td colspan=3>5</td>
  </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]

Example 5 - Conflicted

from html_table_extractor.extractor import Extractor
table_doc = """
<table>
    <tr>
        <td rowspan=2>1</td>
        <td>2</td>
        <td rowspan=3>3</td>
    </tr>
    <tr>
        <td colspan=2>4</td>
    </tr>
    <tr>
        <td colspan=2>5</td>
    </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]

Example 6 - Write to file

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')

It will write to a given path and create a new csv file called output.csv:

1,2
3,4

Team

@yuanxu-li

Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, report it here

Copyright

Third-party copyright in this distribution is noted where applicable.

Name With Owner	yuanxu-li/html-table-extractor
Primary Language	Python
Program language	Python (Language Count: 1)
Platform
License:	MIT License

Created At	2017-04-10 22:04:42
Pushed At	2020-05-01 18:40:12
Last Commit At	2020-05-01 00:00:09
Release Count	8
Last Release Name	v1.4.1 (Posted on )
First Release Name	v1.0.0 (Posted on )

Stargazers Count	87
Watchers Count	3
Fork Count	22
Commits Count	47
Has Issues Enabled
Issues Count	16
Issue Open Count	6
Pull Requests Count	6
Pull Requests Open Count	1
Pull Requests Close Count	0

Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

html-table-extractor

Github stars Tracking Chart

HTML Table Extractor

Important links

Installation

Usage

Example 1 - Simple

Example 2 - Transformer

Example 3 - Pass BS4 Tag

Example 4 - Complex

Example 5 - Conflicted

Example 6 - Write to file

Team

Errors/ Bugs

Copyright

Main metrics