Introduction to Information Retrieval (IR) in NLP

Overview

Information retrieval (IR) is the branch of computing that deals with models for finding material of an unstructured nature that satisfies an information need from within large collections of documents usually stored on the web or computers.

The term unstructured data refers to data that does not have a clear format, semantically tough to understand and is not easy for a computer program to understand with simple, obvious rules like text documents, images, video files or graphs also some times.

Introduction

The amount of computation power is increasing day by day and the cost of storage is also decreasing as time progress, the amount of data we deal with daily is growing exponentially due to both of these advancements in technology. So there is a huge need in turn, for a way to retrieve the information and to be able to search it, query it, analyze it and collect insights from this growth in data.

An information retrieval system is a software system that provides access to books, journals and other documents, and stores and manages those documents. Web search engines are the most visible information retrieval applications.
Scale of Information Retrieval Problem: Hundreds of millions of people over the world engage in information retrieval every day when they use a web search engine or search their emails or personal computers or storage devices. Hence information retrieval is fast becoming the dominant form of information access in modern-day scenarios that need to handle scale across both the user base and data size.
Curse of Information Load: Information overload is the excess of information available to a decision maker aiming to complete a task or come to a conclusion. Information overload was also a problem throughout history, particularly during the Renaissance and industrial revolution periods.
- But the problem became severe with the dawn of the information revolution and access to powerful and low-cost data collection avenues on an automated basis which brought to humanity more information than was available at any other point of time in history.
- As a result, the problem of information overload is more relevant than ever before and automated information retrieval systems are the only solutions to alleviate this problem.

Information retrieval systems are also very important to make sense of the data on a day-to-day basis over the web. It would be hard to imagine how we would find any piece of information on the internet without Google or any other search engines out there. All these search engines are examples of information retrieval systems.

Information is not knowledge without information retrieval systems.

What is Information Retrieval?

The general objective of an Information Retrieval System is to minimize the difficulty of a user locating the information they need from the system.

This difficulty or overhead can be expressed as the time a user spends in all of the steps leading to reading an item containing the needed information. Examples of the steps include such as query generation, query execution, scanning results of the query to select items to read, reading non-relevant items and further downstream tasks, if any.
The main problem statement in information retrieval is that decisions must be made for every document or information object regarding whether or not to show it to the person who is retrieving the information.
Information retrieval needs to find relevance of the documents on based on inputs such as keyword or example documents.

We use the word document as a general term that could also include non-textual information, such as multimedia objects.

Defining Relevance of Items in Information Retrieval System The term relevant item is used in information retrieval to represent all the items containing the needed information for the user making the query. From the perspective of the user, both relevant and needed can be considered synonymous.

Relevant document contains the information that a person was looking for when they submitted the query to the information retrieval system.
Topical relevance: This dimension of measuring relevance is about on the topic or on the subject, this will tell us whether the topic of the information retrieved matches the topic of the request.
User relevance: This measure places the focus of relevance with respect to individual's perceptions of information and information environment not in information as represented in a document or some other concrete form.

Classical Problem in Information Retrieval System

The main problem in information retrieval systems is to develop a model for retrieving information from the repositories of documents, one of the classical problems of which is coined ad-hoc retrieval problem.

Ad-hoc retrieval problem in information retrieval systems: Is one of the basic problems and a standard retrieval task in IR systems in which the user specifies his information need through a query that initiates a search (executed by the information system) for documents which are likely to be relevant to the user.

Once such a query is made, the information retrieval system will return the output as the required documents that are related to the desired information.
The problem can be formulated like this: Given a query, a corpus, we need to find the relevant items.
- Query: Textual description of information need.
- Corpus: A collection of textual documents.
- Relevance: Satisfaction of the user’s information need.
- Ad-hoc because the number of possible queries is huge.

Basic Assumptions of Information Retrieval

We need to complete the central goal of extracting information that is relevant to the user’s information in any information retrieval system. Let us discuss the most important implications of query formulation, its reformulation and the assumptions ingrained in the information retrieval process.

Query Formulation: Query handling is the most important step in the field of Information Retrieval. Most of the experiments on IR systems revealed that formulating a query plays a vital role in generating relevant results for the user.
- A user information need is put forward in the form of query, and as most of the users of web environment are considered naive, query formulation cannot also be expected to be effective always.
- Also in cases like very short queries (ranging from two to five words typically), the outcome come out to be worse.
- One main assumption to overcome this problem is that a thought pattern generated by the user is to be put forward as a query.
- Hence no standard defined way is established on how a query is to be written but assumed that the user is looking for some information failing to understand the need behind the query results in documents fetched from the web which will not be of any use for the user
Query Reformulation: The main aim of query reformulation is to find out new meaningful terms that can be added to the initial query. It adds more terms to the original query, which provides more information about the user's need.
- Query Reformulation is considered an effective technique to improve the performance of Information Retrieval systems.
- The user is guided to reformulate queries which enables more relevant results to be generated with With query reformulation.

So for every query formulation and query reformulation, it is assumed that the system is going to give some resultant relevant data and we always have to know whether the results are good.

Hence the central assumptions in the system are that user wants information from a collection of documents and also that there are a set of relevant documents for the query given.
We also need to assert that the results can be measured with a proxy metric for relevance based on the action or satisfaction of the user. We generally use precision and recall as the main measures but there are other variants also.

The flow of the information retrieval system is presented here:

Collection: A set of documents, we can also assume them as a static collection of documents in most systems.
Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.
Information Need: User wants information from a collection of objects, that there is an information need
Query: User formulates the information needed as a query in his language to the information retrieval system
Resulting documents: System finds objects that satisfy the query and presents objects to a user in a useful form
Presentation: User determines which objects from among those presented are relevant as we compute the metrics

Precision

Precision is the ratio of the number of relevant documents retrieved to the total number retrieved. Precision provides an indication of the quality of the answer set.

Precision does not consider the total number of relevant documents. A system might have good precision by retrieving ten documents and finding that nine are relevant (a 0.9 precision), but the total number of relevant documents also matters.
Example: If there were only nine relevant documents, the system would be a huge success. However, if millions of documents were relevant and desired, this would not be a good result set.

Recall

Recall considers the total number of relevant documents, it is the ratio of the number of relevant documents retrieved to the total number of documents in the collection that are believed to be relevant.

Computing the total number of relevant documents is non-trivial among the entire set of items.
For example, for a system that retrieves ten documents and 20 documents which are useful (the entire set of relevant items), we can find that 9 out of 20 relevant ones are retrieved (a 9/20=0.45 recall).

precision-and-recall

Information Retrieval Model

Models, in general, are used in many scientific and mathematic areas with the objective understanding some phenomenon in the real world.

The information retrieval model predicts and explains what a user will find relevant to the given query. The information retrieval model is basically a pattern that defines the most important aspects of the retrieval procedure and consists of a set of entities:

A model for documents, a model for queries, and a matching function that compares queries to documents.
In the mathematical formulation, a retrieval model can be seen as consisting of:
- D − Representation for documents, R − Representation for the queries Q
- F − The modeling framework for documents and queries along with the relationship between documents and queries
- R (q,di) − Similarity function which orders the documents with respect to the query which is also called ranking.

Constructing a model for an information retrieval system

We need to understand how the document and the query are represented and develop the framework based on this representative understanding.
The framework also needs to provide the way in which the documents can be ordered, based on the score generated which in turn represents the level of relevance between the query and document.
- The documents and the basic set operations are available in the framework in the traditional Boolean model.
- The algebra operations on vectors are available in the framework for the vector model.
- The Bayes theorem and probability operations are part of the framework in the probabilistic model.

Types of Information Retrieval Models

The classification of information retrieval models is done based on type of interaction between documents and queries:

The way IR models represent the documents and query statements.
How the information retrieval system matches the query with the documents in the corpus to find out the related documents
How the ranking for the documents is implemented in the system.

The IR models are mainly categorized as Classical Information Retrieval models, Non-Classical Information Retrieval models and Alternative models. Let us learn about them further.

Classical IR model: Classical models in information retrieval are the simplest and easy to implement IR models. The models are mainly based on mathematical knowledge which is easily recognized and understood.

The meaning of the term classic in the name of classical information retrieval systems denotes that they use foundational techniques for documents without any extra information about the structure or content of a document
Boolean, Vector, and Probabilistic are the three main classical information retrieval models.

Non-classical IR model: Non-classical information retrieval models are based on principles of information logic model, situation theory model, and interaction model.

The non-classical model does not base include any concepts from classical models like similarity, probability, Boolean operations, etc.

Alternative IR model: Alternative IR models are the advanced classical information retrieval models making use of specific techniques from other fields like the Cluster model, fuzzy model, and latent semantic indexing (LSI) models.

Design features of Information Retrieval (IR) Systems

Let us look at some design features of information retrieval systems.

Inverted Index

An inverted index is an index data structure storing a mapping from content (content can be words or numbers) to its locations in a document or a set of documents.

We can also say the inverted index is a hashmap-like data structure that directs the user from a word to a document or a web page. The inverted index is also the primary data structure of most information retrieval systems.
Inverted index as a data structure lists for every word, all documents that contain it, and frequency of the occurrences in the document hence making it easy to search for hits of a query word.
Types of inverted indexes: Mainly two types record level inverted index, and word level inverted index.
- Record level inverted index contains a list of references to documents for each word.
- Word level inverted index additionally contains the positions of each word within a document. This form of the inverted index also offers more functionality but needs more processing power and space to be created.
Advantages of inverted index: The main utility of inverted index is that it allows fast full-text searches at the cost of increased processing when a document is added to the database.
- Inverted index is easy to develop.
- Inverted index is also the most popular data structure used in document retrieval systems used on a large scale for example, in search engines.
Disadvantage of inverted index: Inverted index has a large storage overhead with high maintenance costs for the update, delete, and insert statements.

Stop Word Elimination

Stop words are high-frequency words that are deemed unlikely to be useful for searching inside the documents of the information retrieval system.

All the words in the corpus with less semantic weights are kept in a list called a stop list.
Example: Articles like a, an, the, and prepositions like in, of, for, at, etc. are examples of stop words.
Size reduction of the inverted index using stop list: One main pro of eliminating stop words is that the size of the inverted index can be significantly reduced by a stop list.
- As per Zipf’s law, a stop list covering a few dozen words reduces the size of the inverted index by almost half.
One disadvantage of stop word elimination is that sometimes it may cause the elimination of the term that is useful for searching.
- Example: If we eliminate the alphabet A from Vitamin A, then the word will lose its significance.

Stemming

Stemming is the heuristic process of extracting the base form of words by chopping off the ends of words. It is the process of producing morphological variants of a root or base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. Stemming is one of the important steps in information retrieval systems like search engines.

For example, the words laughing, laughs, and laughed would be stemmed from the root word laugh.
Usage of stemming in information retrieval system with an example: If we want to search for the word chocolate in a collection of documents, we want to see all the documents that have information about the word chocolate.
- It may so happen that the words chocolates, chocolatey, and choco may be present in many documents instead of chocolate.
- To relate these many words, we can stem these words into their root word chocolate again so that we can retrieve all the documents containing this base word no matter the way it is represented across documents.
- There are many standard tools for performing this reduction (of stemming into root word) like the Porter’s Stemmer, the Snowball stemmer, the Lancaster stemmer, etc.

Conclusion

Information retrieval is the task of ranking a list of documents or search results in response to a query.
The relevance of the results can be measured in terms of precision and recall either from the perspective of documents or users.
Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances & Recall (sensitivity) is the fraction of relevant instances that were retrieved.
Classical information retrieval models are based on mathematical knowledge that is easily recognized and understood.
Non-Classical Information Retrieval models are opposite to the classical IR models and are based on principles like the information logic model, situation theory model, and interaction models.
Information retrieval systems need to be based on effective design principles like inverted index (primary data structure), stop word removal and stemming for enhanced performance.

Introduction to Information Retrieval in NLP