esbulk

elasticsearch 的批量索引命令行工具。「Bulk indexing command line tool for elasticsearch.」

Github星跟蹤圖

esbulk

Fast parallel command line bulk loading utility for elasticsearch. Data is read from a
newline delimited JSON file or stdin and indexed into elasticsearch in bulk
and in parallel. The shortest command would be:

$ esbulk -index my-index-name < file.ldj

Caveat: If indexing pressure on the bulk API is too high (dozens or hundreds of
parallel workers, large batch sizes, depending on you setup), esbulk will halt
and report an error:

$ esbulk -index my-index-name -w 100 file.ldj
2017/01/02 16:25:25 error during bulk operation, try less workers (lower -w value) or
                    increase thread_pool.bulk.queue_size in your nodes

Please note that, in such a case, some documents are indexed and some are not.
Your index will be in an inconsistent state, since there is no transactional
bracket around the indexing process.

However, using defaults (parallelism: number of cores) on a single node setup
will just work. For larger clusters, increase the number of workers until you
see full CPU utilization. After that, more workers won't buy any more speed.

Currently, esbulk is tested against elasticsearch
versions 2, 5, 6, 7 and 8 using
testcontainers. Originally written for Leipzig University
Library
, project
finc
.

Project Status: Active – The project has reached a stable, usable state and is being actively developed.
GitHub All Releases

Installation

$ go install github.com/miku/esbulk/cmd/esbulk@latest

For deb or rpm packages, see: https://github.com/miku/esbulk/releases

Usage

$ esbulk -h
Usage of esbulk:
  -0    set the number of replicas to 0 during indexing
  -c string
        create index mappings, settings, aliases, https://is.gd/3zszeu
  -cpuprofile string
        write cpu profile to file
  -id string
        name of field to use as id field, by default ids are autogenerated
  -index string
        index name
  -k    skip insecure certificate verification
  -mapping string
        mapping string or filename to apply before indexing
  -memprofile string
        write heap profile to file
  -optype string
        optype (index - will replace existing data,
                create - will only create a new doc,
                update - create new or update existing data) (default "index")
  -p string
        pipeline to use to preprocess documents
  -purge
        purge any existing index before indexing
  -purge-pause duration
        pause after purge (default 1s)
  -r string
        Refresh interval after import (default "1s")
  -server value
        elasticsearch server, this works with https as well
  -size int
        bulk batch size (default 1000)
  -skipbroken
        skip broken json
  -type string
        elasticsearch doc type (deprecated since ES7)
  -u string
        http basic auth username:password, like curl -u
  -v    prints current program version
  -verbose
        output basic progress
  -w int
        number of workers to use (default 12)
  -z    unzip gz'd file on the fly

To index a JSON file, that contains one document
per line, just run:

$ esbulk -index example file.ldj

Where file.ldj is line delimited JSON, like:

{"name": "esbulk", "version": "0.2.4"}
{"name": "estab", "version": "0.1.3"}
...

By default esbulk will use as many parallel
workers, as there are cores. To tweak the indexing
process, adjust the -size and -w parameters.

You can index from gzipped files as well, using
the -z flag:

$ esbulk -z -index example file.ldj.gz

Starting with 0.3.7 the preferred method to set a
non-default server hostport is via -server, e.g.

$ esbulk -server https://0.0.0.0:9201

This way, you can use https as well, which was not
possible before. Options -host and -port are
gone as of esbulk 0.5.0.

Reusing IDs

Since version 0.3.8: If you want to reuse IDs from your documents in elasticsearch, you
can specify the ID field via -id flag:

$ cat file.json
{"x": "doc-1", "db": "mysql"}
{"x": "doc-2", "db": "mongo"}

Here, we would like to reuse the ID from field x.

$ esbulk -id x -index throwaway -verbose file.json
...

$ curl -s http://localhost:9200/throwaway/_search | jq
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "throwaway",
        "_type": "default",
        "_id": "doc-2",
        "_score": 1,
        "_source": {
          "x": "doc-2",
          "db": "mongo"
        }
      },
      {
        "_index": "throwaway",
        "_type": "default",
        "_id": "doc-1",
        "_score": 1,
        "_source": {
          "x": "doc-1",
          "db": "mysql"
        }
      }
    ]
  }
}

Nested ID fields

Version 0.4.3 adds support for nested ID fields:

$ cat fixtures/pr-8-1.json
{"a": {"b": 1}}
{"a": {"b": 2}}
{"a": {"b": 3}}

$ esbulk -index throwaway -id a.b < fixtures/pr-8-1.json
...

Concatenated ID

Version 0.4.3 adds support for IDs that are the concatenation of multiple fields:

$ cat fixtures/pr-8-2.json
{"a": {"b": 1}, "c": "a"}
{"a": {"b": 2}, "c": "b"}
{"a": {"b": 3}, "c": "c"}

$ esbulk -index throwaway -id a.b,c < fixtures/pr-8-1.json
...

      {
        "_index": "xxx",
        "_type": "default",
        "_id": "1a",
        "_score": 1,
        "_source": {
          "a": {
            "b": 1
          },
          "c": "a"
        }
      },

Using X-Pack

Since 0.4.2: support for secured elasticsearch nodes:

$ esbulk -u elastic:changeme -index myindex file.ldj

A similar project has been started for solr, called solrbulk.

Contributors

and others.

Measurements

$ csvlook -I measurements.csv
| es    | esbulk | docs      | avg_b | nodes | cores | total_heap_gb | t_s   | docs_per_s | repl |
|-------|--------|-----------|-------|-------|-------|---------------|-------|------------|------|
| 6.1.2 | 0.4.8  | 138000000 | 2000  | 1     | 32    |  64           |  6420 |  22100     | 1    |
| 6.1.2 | 0.4.8  | 138000000 | 2000  | 1     |  8    |  30           | 27360 |   5100     | 1    |
| 6.1.2 | 0.4.8  |   1000000 | 2000  | 1     |  4    |   1           |   300 |   3300     | 1    |
| 6.1.2 | 0.4.8  |  10000000 |   26  | 1     |  4    |   8           |   122 |  81000     | 1    |
| 6.1.2 | 0.4.8  |  10000000 |   26  | 1     | 32    |  64           |    32 | 307000     | 1    |
| 6.2.3 | 0.4.10 | 142944530 | 2000  | 2     | 64    | 128           | 26253 |   5444     | 1    |
| 6.2.3 | 0.4.10 | 142944530 | 2000  | 2     | 64    | 128           | 11113 |  12831     | 0    |
| 6.2.3 | 0.4.13 |  15000000 | 6000  | 2     | 64    | 128           |  2460 |   6400     | 0    |

Why not add a row?

主要指標

概覽
名稱與所有者miku/esbulk
主編程語言Go
編程語言Makefile (語言數: 5)
平台
許可證GNU General Public License v3.0
所有者活动
創建於2014-08-26 20:50:22
推送於2025-09-07 21:33:43
最后一次提交2025-09-07 23:31:33
發布數57
最新版本名稱v0.7.25 (發布於 )
第一版名稱v0.2.0 (發布於 )
用户参与
星數282
關注者數13
派生數41
提交數338
已啟用問題?
問題數31
打開的問題數9
拉請求數13
打開的拉請求數0
關閉的拉請求數2
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?