Introduction to Elasticsearch and the ELK stack



In this article, we are discussing Elasticsearch. We would be starting with the introduction of Elasticsearch, will be having a brief discussion over so-called ELK stack. We would then move to the architecture of Elasticsearch and what the heck are nodes, clusters, shards, indexes, documents, replication and so on. So let’s start.

Introduction to Elasticsearch:

Elasticsearch is open source analytics and full-text search engine. It’s often used for enabling search functionality for different applications. For example, a blog for which you want users to be able to search for various kinds of data. That could be blog posts, products, categories. You can actually build complex search functionality with Elasticsearch like auto-completion, handling synonyms, adjusting relevance and so on.

Suppose, you want to implement full-text search while taking a number of other factors like rating, size of posts into account and hence adjusting relevance of the results. Technically, Elasticsearch can do everything and anything you want from a “powerful” search engine and hence it doesn’t limit you to only full-text searches. You can even write queries for structured data and use that to make pie charts hence using Elasticsearch as an analytics platform.

https://www.elastic.co/
Elasticsearch logo

Some of the common use cases where Elasticsearch or ELK(Elasticsearch, Logstash, Kibana) stack can be used are:

  • Keep track of the number of errors in a web application.
  • Keep track of memory usage of the server(s) and show the data on some line chart, this is popularly known as APM(Application Performance Management).
  • Send events to Elasticsearch, where events can be website clicks, sales, new subscriber, login or anything like that under the sun.
  • The list is actually endless…
One thing you might wonder about, how all this is done by Elasticsearch? Actually, In Elasticsearch, data is stored in the form of documents where a document is analogous to a row in a relational database like MySQL. A document then contains fields which are similar to columns a relational database. Technically, a document is nothing but a JSON object. For Example, if you want to add a book object to Elasticsearch, your JSON object for that book may look something like this:

{  
     "name" : "The power of subconscious mind",  
     "price" : $12.99,  
     "categories" : ["self help", "motivational"]  
}  
   

How you will be querying the Elasticsearch? The answer is RESTful APIs:

The queries (which we would surely cover later) are also written as JSON objects. 

Since Elasticsearch is distributed by nature, it scales very well in terms of increasing data volumes and query throughput. Hence, even you have loads of data, the search is still going to be super fast. As a matter of fact, Elasticsearch is written in Java and is built on the top of Apache Lucene.

Now, before moving to our journey of exploring Elasticsearch, let us have a quick overview of ELK stack which is popularly known as Elastic Stack.

Understanding Elastic Stack:

As we have already covered the overview of Elastic search (which is really the heart of Elastic Stack), we would be having a brief overview of other technologies under the Elastic Stack.

Introduction to Elasticsearch and the ELK stack

The technologies we are going to discuss under Elastic Stack generally interact with Elasticsearch in one way or another. Actually, for some of them, interacting with Elasticsearch is not mandatory but there is a strong synergy between the technologies, so they are frequently used together for various purposes.
So, let us start with Kibana :

Kibana:

Kibana is basically an analytics and visualization platform, which lets you easily visualize data from Elasticsearch and analyze it to make sense of it. You can assume Kibana as an Elasticsearch dashboard where you can create visualizations such as pie charts, line charts, and many others.

There are like the infinite number of use cases for Kibana. For example, You can plot your website’s visitors onto a map and show traffic in real time. Kibana is also where you configure change detection and forecasting. You can aggregate website traffic by the browser and find out which browsers are important to support based on your particular audience. Kibana also provides an interface to manage authentication and authorization regarding Elasticsearch. You can literally think of Kibana as a Web interface to the data stored on Elasticsearch.

It uses the data from Elasticsearch and basically just sends queries to Elasticsearch using the same REST API that can otherwise manually do. It just provides an interface for building those queries and lets you configure how to display the results. In this sense, it saves you a hell lot of time because you don’t have to implement all of this by yourself. You can easily create dashboards for System administrators which shows the performance of servers. You can create dashboards for developers which shows application errors and API response times and what not.

As can be seen, you may store a hell lot of different kinds of data in Elasticsearch, apart from data that you want to search and present to your external users. In fact, you might not even use Elasticsearch for implementing search functionality at all. Using Elasticsearch as an analytics platform together with Kibana is a pretty common use case. There exists an official DEMO for what Kibana can do. You must check it out. Now, moving to Logstash.

One thing worth mentioning here is, when you start up the Kibana first time on your machine, it actually creates an index into the elasticsearch cluster just to store various information related to itself.

Logstash:

Traditionally, Logstash has been used to process logs from applications and send them to Elasticsearch, hence the name. That’s still a popular use case, but Logstash has evolved into a more general purpose tool, meaning that Logstash is a data processing pipeline. The data that Logstash receives, will be handled as events, which can be anything of your choice. They could be log file entries, e-commerce orders, customers, chat messages, etc. These events are then processed by Logstash and shipped off to one or more destinations. A couple of examples could be Elasticsearch, a Kafka queue, an e-mail message, or to an HTTP endpoint.

A Logstash pipeline consists of three stages: an Input stage, Filter stage, and output stage. Each stage can make use of a plugin to do its task.

  • Input Stage: Input stage is how the Logstash receives the data. An input plugin could be a file so that the Logstash reads events from a file, It could be an HTTP endpoint or it could be a relational database or even a Kafka queue Logstash can listen to.
  • Filter Stage: Filter stage is all about how Logstash would process the events received from Input stage plugins. Here we can parse CSV, XML, or JSON. We can also do data enrichment, such as looking up an IP address and resolving its geographical location or look up data in a relational database.
  • Output Stage: An output plugin is where we send the processed events to. Formally, those places are called stashes. These places can be a database, a file, an Elasticsearch instance, a Kafka queue and so on.
So, Logstash receives events from one or more inputs plugins at the Input Stage, processes them at Filter Stage, and sends them to one or more stashes at Output Stage. You can have multiple pipelines running within the same Logstash instance if you want to, and Logstash is horizontally scalable

A Logstash pipeline is defined in a proprietary markup format that is somewhat similar to JSON. Technically, It’s not only a markup language, as we can also add conditional statements and make a Logstash pipeline dynamic.

A Sample Logstash Pipeline Configuration:

input {  
    file {  
        path => "/path/to/your/logfile.log"  
    }  
}  
filter {  
    if [request] in ["/robots.txt"] {  
        drop {}  
    }  
}  
output {  
    file {  
        path => "%{type}_%{+yyyy_MM_dd}.log"  
    }  
}  
               

Let us consider a basic use case of Logstash before moving to other components of our ELK stack. Suppose that we want to process access logs from the web server. We can actually configure Logstash to read the log file line by line and consider each line as a separate event. This can be easily done by using the input plugin named “file,” but there is a handy tool named Beats that is far better for this task. We will discuss Beats a bit later.

Once the Logstash received a line, it can process it further, Technically, a line is just a string, a collection of words and we need to parse this string so that we can fetch valuable information out of it like the status code, request path, IP address and so on. We can do so by writing a “Grok” pattern which is somewhat similar to a regular expression, to match pieces of information and save them into fields. Now suppose our “stash” here is Elasticsearch, we can easily save our processed bits of information stored in fields to the Elasticsearch as JSON objects. Now, let us discuss another component of ELK stack, Beats.

Beats:

Beats is basically a collection of data-shippers. Data shippers are basically lightweight agents with a particular purpose. You can install these one or more data-shippers on your servers as per the requirements. They, then send data to Elasticsearch or Logstash. There is n number of data-shippers and each data-shipper is called a beat. Each beat or data-shipper collect different kinds of data and hence serves different purposes.

For example, there is a beat named Filebeat, which is used for collecting log files and sending the log entries off to either Logstash or Elasticsearch. Filebeat ships with modules for common log files, such as nginx, the Apache web server, or MySQL. This is very useful for collecting log files such as access logs or error logs.

Read more about beats here. To summarize, we can look at Elastic Stack as:

ELK Stack : The Summary


Now, Let us put all of the pieces together now. The center of Elastic Stack all is Elasticsearch which contains the data. The process of Ingesting data into Elasticsearch can be done with Beats and/or Logstash, but also directly through Elasticsearch’s API. Kibana is a user interface that sits on top of Elasticsearch and lets you visualize the data that it retrieves from Elasticsearch through the API. There is nothing Kibana does that you cannot build yourself, and all of the data that it retrieves is accessible through the Elasticsearch API. Hence, Kibana is a wonderful tool that can save hell lot of time of yours as now you do not have to build dashboards yourself.

The Architecture of Elasticsearch:

To start things off, we would be discussing Nodes and Clusters which are at the heart of the architecture of Elasticsearch.

Nodes and Clusters:

A Node is nothing but a server that stores the data. The cluster is simply the collection of Nodes i.e, servers. Each Node or server contains a part of Cluster’s data, the data we add to the Cluster. Simply putting, The collection of nodes contains the entire data set for the Cluster.

Cluster and its Nodes

Each Node inside the Cluster actually participates in the searching and indexing capabilities of the Cluster. Hence, a Node contains a part of the Cluster’s data, a Node supports indexing new data and also modifying existing data. Each and Every single Node within a Cluster is capable of handling the HTTP requests for clients that may want to insert/modify data through a REST API exposed by the Cluster. A particular Node is always responsible for receiving the HTTP request and then coordinating the rest of the process.

It is worth mentioning that each Node inside a Cluster knows about every other Node within the same Cluster, and hence able to forward the request to another Node within the Cluster using Transport Layer. Also, each and every Node is capable of becoming the Master Node. The Master Node is always responsible for coordinating the changes to the Cluster, where changes can be adding or removing Nodes, creating or removing Indices e.t.c. Hence, the Master Node basically updates the state of the Cluster and this is the only Node (i.e, the Master Node) which is authorized to do so.

Now, how are these Nodes and Clusters being identified uniquely? The answer is Unique Names for both Nodes and Clusters. The default name of a Cluster is elasticsearch (all in lower case) and the default names for Nodes are Universally Unique Identifiers (UUID). Obviously, as per requirements, a developer can modify or alter this default behavior of names. Also, by default, when you start with some Nodes, they would automatically join the Cluster named elasticsearch, and if at that moment, there is no Cluster with name elasticsearch, then a Cluster with name “elasticsearch” would be automatically formed. As a matter of fact, a Cluster with only one Node is a perfectly valid use case.

Indices and Documents:

Each data item that you may want to store over elasticsearch Nodes is called as a Document. A Document is the smallest unit which can be indexed. As a matter of fact, documents are nothing but JSON objects and are analogous to rows in a relational database like MySQL.

An Index with its Documents

Now, each document can have some properties just like columns in a relational database. Now the question is, where these documents are stored? As you already know, our data is stored across the Nodes within the Cluster, the documents are organized within Indices. Hence, an index is simply a collection of logically related documents and analogous to a table in a relational database. As there is no restriction on the number of rows in a table, you can add any number of documents in an index. Each and every document is uniquely identified by an ID, which either assigned by Elasticsearch automatically or by the developer when adding those documents to index. Hence, each and every document can be uniquely identified by its Index and its ID.

Similar to Nodes and Clusters, Indices are also uniquely identified by their names (again, all in lowercase letters). These names of Indices are then used when searching for documents, in which case the developer would specify the index(name) to search through for matching documents. The same applies for adding, removing and updating documents, i.e, developer would be specifying the name of the index.

Sharding:

Elasticsearch is superbly scalable with all the credit goes to its distributed architecture. It is made possible due to Sharding. Now, before moving further into it, let us consider a simple and very common use case. Let us suppose, you have an index which contains a hell lot of documents, and for the sake of simplicity, consider that the size of that index is 1 TB (i.e, Sum of sizes of each and every document in that index is 1 TB). Also, assume that you have two Nodes each with 512 GB of space available for storing data. As can be seen clearly, our entire index cannot be stored in any of the two nodes available and hence we need to distribute our index among these Nodes. 

In cases like this where the size of an index exceeds the hardware limits of a single node, Sharding comes to the rescue. Sharding solves this problem by dividing the indices into smaller pieces and these pieces are named as Shards.

An Index is sharded into 4 shards

As can be seen, a Shard contains a subset of the index’s data.  When an index is sharded, a given document within that index will only be stored within one of the shards.  The amazing thing about shards is that they can be hosted on any node within the cluster. Now coming to our example, we could easily divide the 1 TB index into four Shards, where each Shard is of 256 GB in size. Now, these four Shards can then be easily distributed across the two Nodes available to us. Hence, we have successfully managed to store our Index of size 1 TB into our 2 Nodes each with the capacity of 512 GB by harnessing the power of Sharding. Hence, even with increasing volumes of data, one can simply tweak the number of Shards to manage that and in this sense, Sharding provides Scalability.

Along with than scalability it provides, there is one more major advantage of using Sharding. The operations (like querying the data) can easily be distributed among multiple nodes and hence we can parallelize our operations which would certainly enhance the performance as now multiple machines can work on the same query at the same time. Now, the question is, when and how do you specify the number of Shards an Index has? You can(optionally) specify this at the time of Index creation, but if you don’t, by default, it would be set to 5.

Technically, 5 Shards per Index would suffice for even large volumes of data. But what if you really started dealing with even larger volumes of data and now you want to increase the number of Shards of already created Indices? Answers is, You can’t, you can not change the number of Shards an Index is having after the creation of that Index. (Why? will see that in upcoming articles). So, if you can not change the number of Shards than how would you handle the situation? The solution is, just create a new Index with whatever number of Shards you want and move your data over to the newly created Index.

Replication:

Even in today’s world, hardware failure or some kind of similar failure is inevitable. So, our System must be prepared to face this by having some means of fault tolerance. That’s where Replications comes for rescue. Elasticsearch natively supports Replication, meaning that Shards are copied. These copied Shards are referred to as Replica Shards or just Replicas. The “original” Shards that have been copied are called as Primary Shards.

How Replication works

As a matter of fact, the Primary Shard and its replicas are referred to as Replication Group. These terminologies may seem daunting and if you are a developer, chances are that you won’t have to deal with Replicas ever in your life. But, it is always better to be aware of what is going behind the scenes. 

There are two major benefits we get from Replication:

  • Fault Tolerance and High Availability: To make Replication even more effective, Replicas are never allocated to the same node as their respective Primary Shards. Hence, even if the entire Node fails, we would be having at least one Replica of each and every Shard (or Primary Shards in case of failed ones being Replicas) that were present in the failed Node.
  • Enhanced Performance: Replication increases the search performance because now, searches can be executed on all Replicas in parallel, meaning that Replicas actually adds to the search capabilities of the Cluster.

Now, as a matter of fact, the number of Replicas is defined at the time of the creation of Index (similar to the case of Shards). By default, each Shard would have 1 Replica. Hence, at the time of creation, if we don’t specify the number of Shards per Index and number of Replicas per Shard, a Cluster consisting of more than 1 Node would have 5 Shards per Index and Each Shard would have one Replica, totaling 10 Shards (5 Primary Shards and 5 Replicas) per Index. The reason why Replication requires at least 2 Nodes is that we never store a replica shard on the same node as its Primary Shard.

Keeping Replicas Synchronized:

Now, Suppose, a Shard is Replicated five times, how are those Replicas updated whenever data gets changed or removed? Clearly, there is a sure shot need of keeping those Replicas into Sync with each other. How does Elasticseatch manage to do so? Actually, Elasticsearch keeps those Replicas in Sync by using a model named primary-backup.

In this Model, the Primary Shard of every Replication Group acts as the entry point for each of the operation that affects Indexes such as adding/updating/removing documents. It means all such operations are sent to Primary Shard. The Primary Shard is then responsible for performing some validations on incoming operations like structural validity of the request. Only if the operation has been accepted by the Primary Shard, that operation is actually performed locally on the Primary Shard. Once the operation on Primary Shard completes, the operation then forwarded to the other Replicas within the Replication Group (Operation executed parallelly to all of the Replicas of that Primary Shard).

When the operation is executed successfully on each of the Replicas, the Primary Shard is informed and hence the Primary Shard finally responds to the Client about the successful execution of the operation.

End Notes:

This section briefly discusses how does Elasticsearch know on which shard to store a new document, and how will it find it when retrieving it by ID? However, As a developer, you most probably won’t be dealing with this but knowing about this a bit is not bad at all. And also, documents should be distributed evenly between nodes by default, so that we won’t have one shard containing way more documents than another (and this is amazingly done by Elasticsearch out of the box). So determining which Shard a given document should be stored in or has been stored is, is called routing. By default, the “routing” value will equal a given document’s ID. This value is then passed through a hash function and hence the exact Shard is determined.

Going into the details of the Routing would not help us in learning Elasticsearch and hence we would not be digging it further.




Source link

https://www.sharethelinks.com/2019/03/23/introduction-to-elasticsearch-and-the-elk-stack/

Comments

Popular posts from this blog

لماذا يحتاج الذكاء الاصطناعي دائمًا إلى الرقابة البشرية ، لا يهم مدى ذكائه

70 في المائة من جميع مجالات الويب فشل في التجديد بعد عام واحد من الشراء

يخطط مارك زوكربيرج لإطلاق Whatsapp Pay Global