Lexicalized and Probabilistic Parsing

Overview

Lexicalized and probabilistic parsing are two approaches to natural language processing used to analyze the syntactic structure of sentences.

Probabilistic parsing is a type of syntactic parsing that uses probability and statistics to analyze and generate the most likely syntactic structure for a given sentence.

Pre-requisites

Before we move on to this article's advanced topics, let's briefly cover some basics required for this article.

Introduction to Grammar

Grammar is a system of rules that govern the structure of sentences in a language; this includes the rules for forming words, phrases, clauses, and sentences and specifies the proper order and arrangement of these elements.

Grammar is an important aspect of language because it helps us communicate effectively and understand each other when we speak or write. It also helps to ensure the sentences are clear, correct, and coherent.
Grammar can be complex, but understanding the basics can help us improve the communication of humans and make us more confident and effective language users.

Grammar allows NLP systems to accurately process and understands natural language input essential for tasks such as machine translation, text summarization, and information extraction.

Elements of Semantic Analysis

Semantic analysis is concerned with how words and sentences are understood by speakers and listeners in a language and how the meaning of a sentence is derived from the meanings of its words and their arrangement in the sentence.

The different elements of semantic analysis include the study of word meanings, sentence structure, and the relationship between words and their context.
Semantic analysis also involves studying how words and sentences can have different meanings in different contexts and how speakers and listeners use context and background knowledge to understand the meaning of a sentence.

Semantic analysis is an important aspect of grammar that helps us understand a language's meaning and communicate effectively.

Introduction to Lexicalized Parsing

Lexicalized parsing is a type of parsing that incorporates information from a language's lexicon, which is the complete set of words and their meanings to improve the accuracy of the parse.

The parser in lexicalized parsing uses the words in the sentence as well as their meanings in the lexicon to guide the parsing process, which allows the parser to disambiguate between multiple possible parse trees for a sentence and select the most likely parse based on the meanings of the words in the sentence.
Lexicalized parsing is based on building a tree structure from a set of tokens, which, when a correctly parsed sentence, could return sub-phrases so the syntactical structure of the sentence is identified.

Introduction to Probabilistic Parsing

Probabilistic parsing, in distinction to lexicalized parsing, is a type of parsing that uses probabilities and statistical models to determine the most likely parse for a sentence.

The parser uses a set of rules and probabilities in probabilistic parsing to determine the most likely syntactic structure for a sentence.
This kind of structure allows the parser to take into account the context of the sentence and the likelihood of different syntactic structures and select the most probable parse given the context.

Both lexicalized and probabilistic parsing can improve the accuracy and efficiency of natural language processing systems and are commonly used in many applications in NLP.

Introduction

Now that you have covered a brief about the prerequisites of this article - grammar, elements of semantic analysis, lexicalized parsing, and probabilistic parsing, we can move on to the more advanced topics of the article while also covering Parsing and its types in detail.

What is Parsing?

Parsing is the process of analyzing a string of text, like a sentence or a paragraph, to determine the grammatical structure and identify the relationships between the words that the text contains.

The parsing process involves identifying the parts of speech such as nouns, verbs, and adjectives and determining how they are related to each other through syntactic rules.
Parsing can be used to extract meaning from text and help computers understand and process the natural language input.
Parsing can also be called a procedural interpretation of grammar where it searches through the space of various trees and finds an optimal tree for the given sentence.

Deep Vs. Shallow Parsing: Deep and shallow parsing are two approaches used in natural language processing (NLP) to analyze and understand the structure of sentences in a text.

Shallow parsing: Also known as light parsing or chunking, where the sentence is divided into shorter chunks or phrases such as noun phrases and verb phrases. This approach provides a limited analysis of the sentence structure without considering the dependencies between the words.
Deep parsing: This is also known as full parsing or syntactic parsing and involves analyzing the sentence to determine the dependencies between the words and the roles they play in the sentence. This approach provides a more detailed and accurate analysis of the sentence structure, including the grammatical relationships between the words.

Let us briefly go through some of the parsers commonly used by practitioners and in the industry to see how they work and to understand further concepts in the article better.

Shift-reduce parser: Shift-reduce parser is a bottom-up parser that works by analyzing the structure of sentences in a text.
- It starts with the individual words in the sentence and combines them into more complex structures, such as phrases and clauses until it has a complete parse tree that represents the sentence's syntactic structure.
- The parser performs this process by repeatedly applying two operations known as shifting and reducing to the input sentence.
  - Shifting operation: The parser moves a single word from the input to the parse tree in this operation.
  - Reducing operation: In this operation, the parser combines two or more words in the parse tree into a larger syntactic unit.
- By alternating between the shifting and reducing operations, the shift-reduce parser builds up the parse tree for the sentence in a step-by-step manner.
Recursive descent parser: Recursive descent parser is a top-down parser that starts with the overall structure of the sentence and breaks it down into smaller and smaller pieces until it has identified the individual words in the sentence.
- The parser performs this process by dividing the input sentence into a set of recursive rules or productions that specify the possible sequences of words that can appear in the sentence.
- Each production corresponds to a non-terminal symbol in the language's grammar, representing a part of the sentence structure, such as a noun phrase or a verb phrase.
- To parse a sentence, the parser begins by matching the input against the grammar's start symbol and then recursively applies the appropriate production rules to identify the sentence's constituents and build up a parse tree that represents its structure.
Chart parser: Chart parsers instead uses a graphical model to analyze the syntactic structure of a sentence. Chart parsers represent sentences as a chart or table, with each word occupying a cell in the chart.
- The parser uses a set of rules to analyze the chart and identify the dependencies between words, ultimately producing a parse tree or other representation of the sentence's syntactic structure.
- Chart parsers are often used in other natural language processing systems to analyze the grammatical structure of sentences and extract useful information from them. They are especially useful for handling complex sentences and for dealing with ambiguities in the structure of a sentence.
Regex Parser: Regex parsing or regular expression parsing is a technique for analyzing text data using regular expressions.
- Regular expressions are a powerful tool for matching patterns in text data, and they can be used for a wide range of tasks such as validating input, searching for substrings, and extracting data from text.
- A regex pattern in regex parsing is used to search through a body of text, and any matches to the pattern are identified and returned.
- Regex parsers are a very efficient way of processing text data as it allows for complex pattern matching using a compact and expressive syntax. Therefore, Regex parsing is also commonly used in many programming languages and text-processing tools.
Dependency Parser: Dependency parsing in NLP is a technique used to analyze a sentence's grammatical structure by identifying the dependencies between different words in a sentence and determining their relationships based on the roles they play in the sentence.
- Dependency parsing is useful for many natural languages processing tasks such as information extraction, text summarization, semantic role labeling (SRL), and other applications outside of NLP like syntactic chunking and constituency parsing.
- A sentence is broken into several components with the underlying idea that there is a direct link (termed as dependencies) between every linguistic unit of a sentence.
  - Each word in a sentence is represented as a node in a tree-like structure, and the dependencies between words are represented as edges between the nodes.
  - This kind of hierarchical structure of the sentence allows for a clear representation and provides useful information about the meaning and intent of the sentence.

Morphological Parsing

Morphological parsing in natural language processing is analyzing a word to determine its component morphemes, which are the smallest units of meaning in a language. It is typically done to better understand the meaning and grammar of a sentence.

Morphological Parsing is a challenging task, especially in languages with complex morphology like English.
Example: unbreakable can be morphologically parsed into three morphemes: un-, break, and -able.
- The first morpheme, "un-" indicates negation, the second morpheme, "break," is the root of the word, and the third morpheme, "-able" indicates the ability to do something.
- Together, these three morphemes give the word its overall meaning of "not able to be broken".

Major problems associated with Morphological Parsing

Ambiguity: Many words in English have multiple possible morphological analyses, making it difficult to determine the exact structure of the word.
- Example: The word "unexpectedly" could be analyzed as "un-expect-ed-ly" or as "un-expected-ly".
Unknown words: Morphological parsers must be able to handle unknown words which may not appear in their training data. This can be a very difficult task as the parser must be able to generalize from known words to unknown words to provide an accurate analysis.
Data sparsity: a Large amount of annotated data is needed to train morphological parsers, but annotating the text data for morphological parsing is a time-consuming and labor-intensive task, and it can be difficult to obtain enough data to train a high-quality parser.
Complexity of execution: Morphological parsing involves analyzing the structure of words at a fine-grained level, which can make the task complex and difficult to implement, making it challenging to develop accurate and efficient morphological parsers.

Probabilistic Context Free Grammar (PCFG)

Probabilistic Context Free Grammar (PCFG) is a type of grammar that is used in natural language processing and is a probabilistic version of a standard context-free grammar (CFG), which is a set of rules used to generate strings of words in a particular language where the model parameters are the probabilities assigned to grammar rules.

Probabilities of all productions rewriting a given non-terminal must add to 1, defining a distribution for each non-terminal.
String generation is probabilistic where production probabilities are used to non-deterministically select production for rewriting a given non-terminal.
Each rule is associated with a probability which indicates the likelihood that the rule will be used to generate a given sentence. This allows the PCFG to model the uncertainty inherent in natural language and make more accurate predictions about the structure of sentences.
- Independence assumption is one of the key properties of PCFGs where the probability of a node sequence depends only on the immediate mother node, not any node above that or outside the current constituent.
- Different rules may be used depending on the context allowing the PCFG to model the uncertainty and variability inherent in natural language.

PCFG is a good way to solve ambiguity problems in the syntactic structure field.

Probability Model

Similar to a Context Free Grammar, a probabilistic context-free grammar G can be defined by a quintuple: $G = (M, T, R, S, P)$ where:

M is the set of non-terminal symbols
T is the set of terminal symbols
R is the set of production rules
S is the start symbol
P is the set of probabilities on production rules

Problems with PCFG

PCFGs suffer from various problems, among which the two most crucial weaknesses are a lack of sensitivity to lexical information and a lack of sensitivity to structural preferences.
- Due to these two problems, PCFGs cannot always capture the full range of syntactic variation and ambiguity that exists in natural languages leading to errors and incorrect parses, particularly when working with sentences that are structurally complex or contain multiple possible interpretations.
- Lexicalized PCFGs are developed from a motivation to solve these issues and work better when compared with PCFG.
Complexity: One other major problem with PCFGs is that they can be very complex, making them difficult to understand and work with. The complexity makes it difficult to design a PCFG that accurately represents the structure of a given language and challenging to implement and use a PCFG in a practical application.
Data Availability: PCFGs also have an issue because they often require a large amount of training data to produce accurate results. This can be a problem when working with languages with limited amounts of annotated text or when trying to parse sentences containing novel or unusual constructions.

Probabilistic Lexicalized CFG

Probabilistic Lexicalized Context-Free Grammar (PLCFG) is also a type of grammar used in natural language processing to generate and analyze sentences in a given language.
It is a combination of a lexicalized context-free grammar which uses lexical items that word as the basic units for generating sentences, and probabilistic models which assign probabilities to the different rules and structures in the grammar.
Probabilistic Lexicalized Context Free Grammar solve the somewhat separable weaknesses that stem from the independence assumptions of PCFGs, out of which the most often remarked on one is their lack of lexicalization.
- The chance of a VP expanding as a verb followed by two noun phrases is independent of the choice of verb involved in a PCFG. This is a weakness as the possibility is much more likely with ditransitive verbs like a hand or tell than with other verbs.

Combining lexicalization with probability models allows the probabilistic lexicalized CFG model to take into account the likelihood of different sentences and structures, making it useful for tasks such as language modeling and machine translation.

Human Parsing

Human parsing is a process typically carried out by humans who use their knowledge of grammar and syntax to understand the meaning and intent of a given sentence; this is carried out by analyzing and interpreting the structure of natural language sentences.

Human parsing can be difficult and time-consuming, particularly when working with sentences that are long or complex, but it is an essential skill for anyone who wants to work with natural language data, such as linguists, language researchers, and natural language processing experts.
There are different approaches to human parsing, and the specific method used can vary depending on the goals of the individual or group conducting the analysis.
- Some common techniques for human parsing include manual analysis, where a person reads and interprets the sentence computational methods, where a computer program is used to analyze the sentence, and a combination of manual and computational approaches.

The Chomsky Hierarchy

Chomsky Hierarchy is a classification of formal languages named after the linguist Noam Chomsky. It is a way of organizing and categorizing different types of languages based on their structure and the rules governing their generation.

The Chomsky Hierarchy consists of four levels:

Type 0 languages also known as unrestricted grammars are the most general and powerful type of language. They can generate an infinite number of sentences and can be described by a Turing machine.
Type 1 languages, also known as context-sensitive grammars, are a strict subset of Type 0 languages and are generated by a linear bounded automaton and can be described by the context-sensitive grammar.
Type 2 languages, also known as context-free grammars, are a strict subset of Type 1 languages and are generated by a pushdown automaton and can be described by context-free grammar.
Type 3 languages also known as regular languages are a strict subset of Type 2 languages. They are generated by a finite state machine and can be described by a regular expression.

The Chomsky Hierarchy is important in theoretical computer science and linguistics as it provides a framework for studying the formal properties of languages and the algorithms that can be used to process them.

What is Lexicalization?

Lexicalization is the process of making a word express a concept by turning a phrase or sentence into a single word or a sequence of words.

Lexicalization is done by combining the most important words in the phrase or sentence to maintain the meaning of the original phrase or sentence.
Example: to be honest could be lexicalized as candidly or frankly.
Research on various use cases concerning lexicalization and their results show that it is necessary for various NLP tasks and applications to incorporate lexicalization.
There is a strict correlation between the meaning components involved in a word root and its syntactic properties, the cross-linguistic differences in the meaning components conflation within word roots, data on lexicalization can be useful for Word Sense Disambiguation (WSD) and all connected applications.

Lexicalization is a common feature of many languages that makes communication more efficient and concise.

Example

If we look at a couple of sample sentences - I ate pizza with olives, I ate pizza with friends. Using PCFG in this context has some limitations that stem from being context-free.
- Both the example sentences have a different parse tree - In the first sentence, with olives is a preposition of the noun phrase pizza, and in the second, with friends is a preposition of the verb phrase ate.
- PCFG has no way to assign the correct structure for each sentence because, above the lexical level or the leaves, the sentences look the same and have the same POS sequence.
Lexicalizing the PCFG model adds words to the grammar rules. So for our example, we would have: $VP/ate -> VP/ate PP/friends; NP/pizza -> NP/pizza PP/olives$ .
- The use of these added words is more economical because they are shorter than the corresponding underlying sentences or paraphrases and can be more easily used as elements of sentences.
- Example: Based on the lexicalization, one does not say someone who writes a book for someone else, who then often pretends it is their work; one says ghostwriter instead.
- Lexicalization also adds the rules with lower probability, where the PCFG can further assign the correct structure using the probabilities of the rules.

Computing Lexicalized Rule Probabilities

Lexicalized rule probabilities are probabilities associated with a specific set of rules for a given language; these can be calculated using a corpus of text.

The first step is to identify the set of rules that you want to calculate probabilities for, including rules for grammar, syntax, and semantics.
Once we have identified the rules, we can use a statistical method such as maximum likelihood estimation to calculate the probabilities. This process involves counting the number of times each rule occurs in the corpus and dividing it by the total number of rules in the corpus.
Example: If a particular rule occurs 100 times in a corpus of 1000 total rules, the probability of that rule would be 0.1.

The quality of the probabilities will depend on the size and quality of the corpus used for calculation. A larger and more diverse corpus will generally provide more accurate probabilities.

Conclusion

Parsing is the process of analyzing a string of text, like a sentence or a paragraph, to determine the grammatical structure and identify the relationships between the words that the text contains.
The lexicon of a language is the complete set of words and their meanings to improve the parse's accuracy.
Context-free grammar (CFG) is a set of production rules used to generate all possible strings in a language.
Lexicalized parsing is the process of making a word to express a concept by adding words, set phrases, or word patterns to a language's lexicon.
Probabilistic parsing uses probabilities, and statistical models use a set of rules and probabilities to determine a sentence's most likely syntactic structure.
Probabilistic Lexicalized Context-Free Grammars (PLCFGs) combine traditional CFGs with probability information for each production rule in the grammar, allowing to capture of the likelihood of syntactic structures.