Cheatsheets¶
Development URLs¶
INSTALL¶
- Install curl
- Install Java
- Download ElasticSearch
- Optionally change the
cluster.name
in theelasticsearch.yml
configuration
cd elasticsearch-<version>
./bin/elasticsearch -d
# or on Windows
# bin\elasticsearch.bat
curl 'https://localhost:9200/?pretty'
-
Install Kibana
- Open
config/kibana.yml
in an editor - Set the elasticsearch.url to point at your Elasticsearch instance
- Run
./bin/kibana
(orbin\kibana.bat on Windows) - Point your browser at https://localhost:5601
- Open
-
Install Sense
On Windows:
Then go to
https://localhost:5601/app/sense
CURL¶
Verb is GET, POST, PUT, HEAD, or DELETE.
Examples¶
curl -XGET <id>.us-west-2.es.amazonaws.com
curl -XGET 'https://<id>.us-west-2.es.amazonaws.com/_count?pretty' -d '{ "query": { "match_all": {} } }'
curl -XPUT https://<id>.us-west-2.es.amazonaws.com/movies/movie/tt0116996 -d '{"directors" : ["Tim Burton"],"genres" : ["Comedy","Sci-Fi"], "plot": "The Earth is invaded by Martians with irresistible weapons and a cruel sense of humor.", "title" : "Mars Attacks!", "actors" :["Jack Nicholson","Pierce Brosnan","Sarah Jessica Parker"], "year" : 1996}'
Sense¶
Sense syntax is similar to curl:
Index a document
and retrieve it
PLUGINS¶
URL pattern¶
https://yournode:9200/_plugin/<plugin name>
On Debian, the script is in: /usr/share/elasticsearch/bin/plugin
.
Install various plugins¶
./bin/plugin --install mobz/elasticsearch-head
./bin/plugin --install lmenezes/elasticsearch-kopf/1.2
./bin/plugin --install elasticsearch/marvel/latest
Remove a plugin¶
List installed plugins¶
Elasticsearch monitoring and management plugins
Head¶
elasticsearch/bin/plugin -install mobz/elasticsearch-head
- open https://localhost:9200/_plugin/head
BigDesk¶
Live charts and statistics for elasticsearch cluster: BigDesk
Kopf¶
Marvel¶
Integrations (CMS, import/export, hadoop...)¶
Aspire¶
Aspire is a framework and libraries of extensible components designed to enable creation of solutions to acquire data from one or more content repositories (such as file systems, relational databases, cloud storage, or content management systems), extract metadata and text from the documents, analyze, modify and enhance the content and metadata if needed, and then publish each document, together with its metadata, to a search engine or other target application
Integration with Hadoop¶
Bulk loading for elastic search https://infochimps.com
Integration with Spring¶
WordPress¶
TOOLS¶
BI platforms that can use ES as an analytics engine:
- Kibana
- Grafana
- BIRT
- Birt
-
Adminer
- Adminer.org
-
Database management in a single PHP file. Works with MySQL, PostgreSQL, SQLite, MS SQL, Oracle, SimpleDB, Elasticsearch, MongoDB. Needs a webserver + PHP: WAMP
-
Mongolastic
- A tool that migrates data from MongoDB to Elasticsearch and vice versa
-
Elasticsearch-exporter
- Elasticsearch-exporter
Code Examples - developing a Web UI for ES¶
Java API¶
BASICS¶
An Elasticsearch cluster can contain multiple indices, which in turn contain multiple types. These types hold multiple documents, and each document has multiple fields.
Explore (using Sense)¶
GET _stats/
# List indices
# Get info about one index
The available features are _settings,_mappings, _warmers and_aliases
# cluster
# insert data
#search
# Data schema
INSERT DOCUMENTS¶
PUT /index/type/ID
PUT /megacorp/employee/1
{ "first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ]}
PUT /megacorp/employee/2
{ "first_name" : "Jane", "last_name" : "Smith", "age" : 32, "about" : "I like to collect rock albums", "interests": [ "music" ]}
GET /megacorp/employee/1
Field names can be any valid string, but may not include periods. Every document in Elasticsearch has a version number. Every time a change is made to a document (including deleting it), the _version number is incremented.
Optimistic concurrency control¶
PUT /website/blog/1?version=1 { "title": "My first blog entry", "text": "Starting to get the hang of this..."}
We want this update to succeed only if the current _version of this document in our index is version 1
External version:
PUT /website/blog/2?version=5&version_type=external { "title": "My first external blog entry", "text": "Starting to get the hang of this..."}
INSERT DOCUMENTS - AUTOGENERATED IDS¶
POST /website/blog/
{
"title": "My second blog entry",
"text": "Still trying this out...",
"date": "2014/01/01"
}
Response:
{
"_index": "website",
"_type": "blog",
"_id": "AVFgSgVHUP18jI2wRx0w",
"_version": 1,
"created": true
}
# creating an entirely new document and not overwriting an existing one
RETRIEVE DOCUMENTS¶
{ "_index" : "website", "_type" : "blog", "_id" : "123", "_version" : 1, "found" : true, "_source" : { "title": "My first blog entry", "text": "Just trying this out...", "date": "2014/01/01" }}
# Contains just the fields that we requested
# Just get the original doc
# check if doc exists -- HTTP 200 or 404
# Note: HEAD/exists requests do not work in Sense # because they only return HTTP headers, not # a JSON body
# multiple docs at once
UPDATE¶
Documents in Elasticsearch are immutable; we cannot change them. Instead, if we need to update an existing document, we reindex or replace it
# Accepts a partial document as the doc parameter, which just gets merged with the existing document.
# Script
# script with parameters
POST /website/blog/1/_update
{ "script" : "ctx._source.tags+=new_tag", "params" : { "new_tag" : "search" }}
# upsert
DELETE¶
# delete doc based on its contents
POST /website/blog/1/_update { "script" : "ctx.op = ctx._source.views == count ? 'delete' : 'none'", "params" : { "count": 1 }}
BULK¶
POST /_bulk
{"delete":{"_index":"website","_type":"blog","_id":"123"}}
{"create":{"_index":"website","_type":"blog","_id":"123"}} # Create a document only if the document does not already exist
{"title":"My first blog post"}
{"index":{"_index":"website","_type":"blog"}}
{"title":"My second blog post"}
{"update":{"_index":"website","_type":"blog","_id":"123","_retry_on_conflict":3}}
{"doc":{"title":"My updated blog post"}}
Bulk in the same index or index/type
POST /website/_bulk
{"index":{"_type":"log"}}
{"event":"User logged in"}
{"index":{"_type":"blog"}}
{"title":"My second blog post"}
Try around 5-15MB in size.
SEARCH¶
Every field in a document is indexed and can be queried.
# Search for all employees in the megacorp index:
# Search for all employees in the megacorp index # who have "Smith" in the last_name field
# Same query as above, but using the Query DSL
# SEARCH QUERY STRING
Don't forget to URL encode special characters e.g. +name:john +tweet:mary
The + prefix indicates conditions that must be satisfied for our query to match. Similarly a - prefix would indicate conditions that must not match. All conditions without a + or - are optional
QUERY DSL¶
When used in filtering context, the query is said to be a "non-scoring" or "filtering" query. That is, the query simply asks the question: "Does this document match?". The answer is always a simple, binary yes|no. When used in a querying context, the query becomes a "scoring" query.
# Find all employees whose `last_name` is Smith
# and who are older than 30
GET /megacorp/employee/_search
{
"query" : {
"filtered" : {
"filter" : {
"range" : {
"age" : { "gt" : 30 }
}
},
"query" : {
"match" : {
"last_name" : "smith"
}
}
}
}
}
MATCH¶
# Find all employees who enjoy "rock" or "climbing"
The match query should be the standard query that you reach for whenever you want to query for a full-text or exact value in almost any field. If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search If you use it on a field containing an exact value, such as a number, a date, a Boolean, or a not_analyzedstring field, then it will search for that exact value
MATCH ON MULTIPLE FIELDS¶
EXACT SEARCH¶
# Find all employees who enjoy "rock climbing"
# EXACT VALUES
The term query is used to search by exact values, be they numbers, dates, Booleans, or not_analyzed exact-value string fields
The terms query is the same as the term query, but allows you to specify multiple values to match. If the field contains any of the specified values, the document matches
# Compound Queries
{ "bool": { "must": { "match": { "tweet": "elasticsearch" }}, "must_not": { "match": { "name": "mary" }}, "should": { "match": { "tweet": "full text" }}, "filter": { "range": { "age" : { "gt" : 30 }} } } }
# VALIDATE A QUERY
GET /gb/tweet/_validate/query?explain { "query": { "tweet" : { "match" : "really powerful" } }}
# understand why one particular document matched or, more important, why it didn’t match
GET /us/tweet/12/_explain { "query" : { "bool" : { "filter" : { "term" : { "user_id" : 2 }}, "must" : { "match" : { "tweet" : "honeymoon" }} } }}
MULTIPLE INDICES OR TYPES¶
# all documents all indices
/_search
/gb,us/_search
Search all types in the gb and us indices
/g*,u*/_search
Search all types in any indices beginning with g or beginning with u
/gb/user/_search
Search type user in the gb index
/gb,us/user,tweet/_search
Search types user and tweet in the gb and us indices
/_all/user,tweet/_search
Search types user and tweet in all indices
PAGINATION¶
SORTING¶
GET /_search { "query" : { "bool" : { "filter" : { "term" : { "user_id" : 1 }} } }, "sort": { "date": { "order": "desc" }}}
For string sorting, use multi-field mapping:
"tweet": { "type": "string", "analyzer": "english", "fields": { "raw": {"type": "string", "index": "not_analyzed" } }}
The main tweet field is just the same as before: an analyzed full-text field. The new tweet.raw subfield is not_analyzed.
then sort on the new field
HIGHLIGHTS¶
Find all employees who enjoy "rock climbing" - and highlight the matches
GET /megacorp/employee/_search
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
},
"highlight": {
"fields" : {
"about" : {}
}
}
}
ANALYSIS¶
An analyzer is really just a wrapper that combines three functions into a single package:
- Character filters
- Tokenizer
- Token filters
See how text is analyzed¶
test analyzer¶
MAPPINGS (schemas)¶
Every type has its own mapping, or schema definition. A mapping defines the fields within a type, the datatype for each field, and how the field should be handled by Elasticsearch. A mapping is also used to configure metadata associated with the type.
You can control dynamic nature of mappings
Mapping (or schema definition) for the tweet type in the gb index
Elasticsearch supports the following simple field types:
- String: string
- Whole number: byte, short, integer, long
- Floating-point: float, double
- Boolean: boolean
- Date: date
Fields of type string are, by default, considered to contain full text. That is, their value will be passed through an analyzer before being indexed, and a full-text query on the field will pass the query string through an analyzer before searching. The two most important mapping attributes for string fields are index and analyzer.
The index attribute controls how the string will be indexed. It can contain one of three values:
analyzed
: First analyze the string and then index it. In other words, index this field as full text.not_analyzed
: Index this field, so it is searchable, but index the value exactly as specified. Do not analyze it.no
: Don’t index this field at all. This field will not be searchable.
If we want to map the field as an exact value, we need to set it to not_analyzed:
For analyzed string fields, use the analyzer attribute to specify which analyzer to apply both at search time and at index time. By default, Elasticsearch uses the standard analyzer, but you can change this by specifying one of the built-in analyzers, such as whitespace, simple, or english:
Create a new index, specifying that the tweet field should use the english analyzer¶
PUT /gb
{ "mappings":
{ "tweet" :
{ "properties" : {
"tweet" : { "type" : "string", "analyzer": "english" },
"date" : { "type" : "date" },
"name" : { "type" : "string" },
"user_id" : { "type" : "long" }
}}}}
null, arrays, objects: see complex core fields
Parent Child Relationships¶
DELETE /test_index
PUT /test_index
{
"mappings": {
"parent_type": {
"properties": {
"num_prop": {
"type": "integer"
},
"str_prop": {
"type": "string"
}
}
},
"child_type": {
"_parent": {
"type": "parent_type"
},
"properties": {
"child_num": {
"type": "integer"
},
"child_str": {
"type": "string"
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_type":"parent_type","_id":1}}
{"num_prop":1,"str_prop":"hello"}
{"index":{"_type":"child_type","_id":1,"_parent":1}}
{"child_num":11,"child_str":"foo"}
{"index":{"_type":"child_type","_id":2,"_parent":1}}
{"child_num":12,"child_str":"bar"}
{"index":{"_type":"parent_type","_id":2}}
{"num_prop":2,"str_prop":"goodbye"}
{"index":{"_type":"child_type","_id":3,"_parent":2}}
{"child_num":21,"child_str":"baz"}
POST /test_index/child_type/_search
POST /test_index/child_type/2?parent=1
{
"child_num": 13,
"child_str": "bars"
}
POST /test_index/child_type/_search
POST /test_index/child_type/3/_update?parent=2
{
"script": "ctx._source.child_num+=1"
}
POST /test_index/child_type/_search
POST /test_index/child_type/_search
{
"query": {
"term": {
"child_str": {
"value": "foo"
}
}
}
}
POST /test_index/parent_type/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"has_child": {
"type": "child_type",
"filter": {
"term": {
"child_str": "foo"
}
}
}
}
}
}
}
AGGREGATES¶
Aggregations and searches can span multiple indices
Calculate the most popular interests for all employees¶
GET /megacorp/employee/_search
{
"aggs": {
"all_interests": {
"terms": {
"field": "interests"
}
}
}
}
Calculate the most popular interests for employees named "Smith"¶
GET /megacorp/employee/_search
{
"query": {
"match": {
"last_name": "smith"
}
},
"aggs": {
"all_interests": {
"terms": {
"field": "interests"
}
}
}
}
Calculate the average age of employee per interest - hierarchical aggregates¶
GET /megacorp/employee/_search
{
"aggs" : {
"all_interests" : {
"terms" : { "field" : "interests" },
"aggs" : {
"avg_age" : {
"avg" : { "field" : "age" }
}
}
}
}
}
requires in config/elasticsearch.yml
- script.inline: true
- script.indexed: true
GET /tlo/contacts/_search
{
"size" : 0,
"query": {
"constant_score": {
"filter": {
"terms": {
"version": [
"20160301",
"20160401"
]
}
}
}
},
"aggs": {
"counts": {
"cardinality": {
"script": "doc['first_name'].value + ' ' + doc['last_name'].value + ' ' + doc['company'].value",
"missing": "N/A"
}
}
}
}
INDEX MANAGEMENT¶
By default, indices are assigned five primary shards. The number of primary shards can be set only when an index is created and never changed
- Add an index
PUT /blogs { "settings" : { "number_of_shards" : 3, "number_of_replicas" : 1 }}
PUT /blogs/_settings { "number_of_replicas" : 2}
- ElasticSearch Shards should be 50 GB or less in size.
- Use aliases to shelter the underlying index (or indices) and allow index swapping
CLUSTER MANAGEMENT¶
CONFIGURATION¶
- config directory
-
yaml file
-
Sets the JVM heap size to 0.5 memory size. The OS will use it for file system cache
- Prefer not to allocate 30GB !! --> uncompressed pointers
- Never let the JVM swap bootstrap.mlockall = true
- Keep the JVM defaults
- Do not use G1GC alternative garbage collector
- All nodes in the cluster must have the same cluster name
to override the configuration file
- HTTP port: 9200 and successors
- Transport : 9300 (internal communications)
Discovery¶
- AWS plugin available --> also include integration with S3 (snapshot to S3)
- AWS: multi-AZ is OK but replication across far data centers is not recommended
- See: resiliency
Sites plugins -- kopf / head / paramedic / bigdesk / kibana
- contain static web content (JS, HTML....)
Install plugins on ALL machines of the cluster
To install,
One type per index is recommended, except for parent child / nested indexes.
index size optimization:
- can disable
_source
and_all
(the index that captures every field - not needed unless the top search bar changes) - by default, Kibana will search
_all
data types: string, number, bool, datetime, binary, array, object, geo_point, geo_shape, ip, multifield binary should be base64 encoded before storage
MAINTENANCE¶
Steps to restore elastic search data:
- Stop elastic search
- Extract the zip file (dump file)
- Start elastic search
- Reload elastic search
The commands to do the above are as below:
systemctl stop elasticsearch
- extract gz file to destination path
systemctl start elasticsearch
systemctl daemon-reload elasticsearch