Elasticsearch as a NoSQL Database

Published in

Analytics Vidhya

6 min readApr 29, 2021

image source: https://www.pharmathirdpartymanufacturer.com/

Elasticsearch is a distributed open-source search engine and analytics database which was developed by java on Apache Lucene. It allows to store data, search and analyze a large volume of data within seconds. It achieves fast search responses via searching indexes. It serves as a RESTful API for updates, creates, and searches. So as a core Elasticsearch processes JSON requests and the results will be given back in JSON format. And the best thing about Elasticsearch is open-source and it’s free. The terminology of Elasticsearch differs from the traditional database terminology. Here the database is referred to as indices and the table is referred to as Type and a row is referred to as a document.

Key features:

Elasticsearch works in a distributed environment. It provides enterprise-grade security and easy-to-understand APIs to work with.

Where Does Elasticsearch Shine?

Search Queries:

In a search query, it matches the result based on score. For each query, it gives back a collection of results with a score assigned with it. The score indicates how matches the results with the search query parameters and conditions.

Filters:

There’s no uncertainty around scoring, therefore Filters are much faster than queries. It is only a binary result, whether the particular document has the term.

Some of other features of it are;

Scalability and resiliency — Clusters grow with needs.
Clustering and high availability — The shards and replica architecture handling node failures. If a shard fails then the replica will take its place.
Automatic node recovery — When a node fails the master node will assign the replicas associated to shards of the nodes.
Automatic data rebalancing — Master node decides node allocation for shards, and movements of shards between nodes to rebalance the cluster.
Horizontal scalability — When usage increases, Elasticsearch will scales.
Support in apache spark, pig, hive, storm — Via Elasticsearch for Hadoop it provides first class support for spark.
Business Intelligence — Supports for Tableau Desktop. MS powerBI, MS Exel SQL Workbench, and many more.
Plugins and integrations — Number of plugins and integrations are available for free.
Management APIs — Can manage the Elasticsearch with variety of management related APIs.

Above are only few of key points there are many other features in the Elasticsearch.

Basic concepts and terms.

Data Model

In a typical RDBMS, we store data in a form of a table. one row represents one useful piece of information. But in Elasticsearch we save data in the form of JSON string. A JSON string in Elasticsearch is called a document. The JSON field in RDBMS terms is the column and the value itself is the value.

There are several architectural and data modeling terminologies in Elasticsearch and I will explain them in short.

Index

Index in Elasticsearch is similar to a Database in a relational database management system. An index is the largest unit, the largest logical partition of data in the Elasticsearch. Considering a blogging application, you can store all the information related to blogs and authors and publications, and readers inside one index. It is possible to have any number of indices in the Elasticsearch cluster and should assign a unique name for each. These will hold multiple Types (Tables). There are multiple documents(rows) within a Type and each document may have several properties (columns).

Mapping

How a document is indexed and stores documents fields are defined by mappings. The mapping in Elasticsearch is similar to the schema in the world of RDBMS. Mapping describes the properties of the documents and the fields that it holds, the datatype of the field, and how it should be indexed and stored by Lucene. It is very important to define the mappings appropriately after creating an index. the wrong search results can occur by an inappropriate preparatory description and mapping. Metadata fields such as _index and _id are also should be included in the mapping. There are two types of mappings dynamic mappings and explicit mappings. each of them has its own benefits.

Mapping Type

Don’t confuse mapping type with datatypes. Elasticsearch uses types within documents to divide similar types of data into classes where each class defines a unique group of documents. One index may have any number of Types. Documents belonging to these types can be store in each type respectively. Type contains a name and a mapping, and it’s used by adding a type field. When querying in a specific type these fields can be useful for filtering.

Documents

When we formally define the document, the document is the smallest data unit of the information which we store in Elasticsearch. You can equivalent a document of Elasticsearch to a row in a relational database representing one entity. For example, let’s consider a blogging application. There, you could have one document for each blog post. There are no limitations regarding the number of documents inside the particular type (table). Data in a document is defined with fields containing keys and values. The key is the name of the field while the value is the data associated with it. The value can be a string, a number a Boolean value, or any other object such as an array or a list. Some reserved fields such as index, type, and id, are identified as document metadata.

Shards and Replicas

When index size is exceeding the limit of the disk size of the hosting server Elasticsearch may tend to crash. One way to overcome this challenge is using shards. When creating an index, you can define the number of shards and replicas that you want in it. Sharding means dividing the index into multiple pieces. each piece is called a shard. Shard works as a fully functional index and can be hosted on any nodes within the cluster. Sharding allows you to split data volume horizontally, also parallelizing processes via multiple nodes, therefore increasing the performance. Replicas is a copy of shards. It works as a backup. When a particular node crashes the replica of it will start to serve read requests. So, replicas are helping to increase the search performance. The shard and its replica should not be placed on the same node to ensure high availability. The number of replicas can be defined when you’re creating the index and it can be changed if needed later as well.

Nodes

Node holds our data and does contribute to cluster indexing and search capacities. A node can be identified as a single server in the cluster. Random universally unique identifier is assigning to the node at its startup. We needed we can edit it. There are several types of nodes such as master node, data node, and client node, etc. Each of them has their own work to do and different configurations.

Cluster

Elasticsearch is a distributed environment. The collection of one or more nodes that holding the entire data is called a cluster and provides search capabilities and indexing. There can be any number of nodes having the same cluster name. The cluster has a unique identifier and nodes have to use it when joining the cluster. One node in the cluster is identified as the master node and it’s is automatically chosen by the cluster itself. the master node is responsible for the configuration and management of the cluster. If the master node fails another node from the cluster will be chosen as the master node. We can query from any node of the cluster, but nodes also forward the queries to other nodes where the data are being.

Advantages

Since it’s developed by Java, it’s compatible with every platform. One of another advantage of it is the speed and high performance in the search. Since, it’s a Realtime search engine you can store data in it and make your data searchable with its features. Elasticsearch has a distributed architecture, therefore you can scale it when you need. It supports any document type which supports text rendering. Finally, It’s open source and no cost of downloading.

Disadvantages

Technically possible to use Elasticsearch as a central data store, but there is no guarantee about the exactness. Each document will attach with a version number and it will increase monotonically. When two write calls come to the elastic search both will write concurrently, but it will keep only the latest version. Technically speaking The Elasticsearch does not support ACID properties and seldom, a split-brain circumstances problem can happen in Elasticsearch.

Anyone who wants to create a search engine or who wants to analyze data (like transactional data, log data) to extract useful information out of it, can use Elasticsearch. Also, Elasticsearch is useful when implementing a centralized logging system where can capturing logs from different servers, hosted in different locations, to store logs and analyze logs from one location. Elasticsearch documentation is available in many languages with everything in detail. And the support forums and blogs and YouTube videos are available for Elasticsearch. Therefore, it’s easy for peoples who are going to start with it.

If you find anything wrong or if you have anything to add, please feel free to comment. Thank you.