Code Review: How the AllenNLP Vocabulary indexes your text

This content originally appeared on Level Up Coding - Medium and was authored by Xinzhe Li

Although AllenNLP provides a good guide for almost all the modules in the library, I am still confused in many aspects about how the vocabulary is constructed. Specifically, we answer the following questions/discussions.

Construction: How token/id pairs are added into Vocabulary?
Used for a text field: How the vocabulary is consumed for text indexing?
Vocabulary for label indexing?
Any Caveats

Section 1: How token/id pairs are added into Vocabulary?

From instances: Normally, from_instance method can construct a counter: Dict[str, Dict[str, int]] where the outer key is reserved for each namespace, and inner dictionary stores token/id pairs. counter would be further decomposed into self._token_to_index and self._index_to_token attributes by the _extend method.
Specifically, it call count_vocab_items of each Instance object, which in turn call count_vocab_items of each Field object, which in turn (again) call count_vocab_items of each TokenIndexer object. Therefore, the functional code of counting items is indeed in each TokenIndexer object. This count_vocab_items in TokenIndexer would match namespace with outer key name of counter to extend the items or increase the count of items. Below is the code in SingleIdTokenIndexer .

def count_vocab_items(self, token: Token, counter: Dict[str,   
Dict[str, int]]):
        if self.namespace is not None:
            text = self._get_feature_value(token)
            if self.lowercase_tokens:
                text = text.lower()
            counter[self.namespace][text] += 1

Section 2: How the vocabulary is consumed for text indexing?

The Vocabulary would coordinate with TokenIndexer to index tokens in Field objects. Specifically, TokenIndexer.tokens_to_indices method would take a Vocabulary object as the argument to match the namespace and index tokens.

As above, the functional code is actually in TokenIndexer. Below is the code in SingleIdTokenIndexer .

def tokens_to_indices(
        self, tokens: List[Token], vocabulary: Vocabulary
    ) -> Dict[str, List[int]]:
        indices: List[int] = []
    for token in itertools.chain(self._start_tokens, tokens, self._end_tokens):
            text = self._get_feature_value(token)
            if self.namespace is None:
                indices.append(text)  # type: ignore
            else:
                if self.lowercase_tokens:
                    text = text.lower()
                indices.append(vocabulary.get_token_index(text,   self.namespace))

    return {"tokens": indices}

tokens_to_indices method would be Textfield.index method, as shown below. Notice the difference between Textfield.index and LabelField.index discussed in the next section.

def index(self, vocab: Vocabulary):
        self._indexed_tokens = {}
        for indexer_name, indexer in self.token_indexers.items():
            self._indexed_tokens[indexer_name] = indexer.tokens_to_indices(self.tokens, vocab)

Section 3: The vocabulary for label indexing

It differs from text indexing in both construction and consumption.

No IDs reserved for padding and unknown token during construction
No TokenIndexder : The LabelField itself contains namespace (commonly hardcoded as “labels”) to extract the label index from the token-to-index dictionary (i.e., Vocabulary._token_to_index[“labels”] ), as shown in the following code in LabelField .

def index(self, vocab: Vocabulary):
        if not self._skip_indexing:
            self._label_id = vocab.get_token_index(
                self.label, self._label_namespace  # type: ignore
            )

Section 4: Caveats

Forget to apply TokenIndexer to TextField: This is one of the most common mistakes when we process data with AllenNLP, because logically the token/id mapping in Vocabulary is enough for indexing tokens in TextField . However, there are many reasons why we need TokenIndexer .

One common reason is to add special tokens (starting or ending tokens )
Another reason is that token in Textfield may not match the granularity of the token/id mapping in Vocabulary. For example, we tokenize text into words but the vocabulary contains token/id mapping for character. This sounds wired: why we tokenize text into word rather than characters if we want to index them on the character-level. As far as I know, I guess that it benefits us to combine both word indexing (using SingleIdTokenIndexer) and character indexing (TokenCharacterIndexer).

So, remember to apply TokenIndexer. Below is the code to use both word indexing and character indexing within one text field.

text_field.token_indexers={
"tokens":
   SingleIdTokenIndexer(namespace="token_vocab"),

"token_characters":
   TokenCharactersIndexer(namespace="character_vocab"),

Code Review: How the AllenNLP Vocabulary indexes your text was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding - Medium and was authored by Xinzhe Li

Print Share Comment Cite Upload Translate Updates

APA

Xinzhe Li | Sciencx (2022-06-13T22:21:47+00:00) Code Review: How the AllenNLP Vocabulary indexes your text. Retrieved from https://www.scien.cx/2022/06/13/code-review-how-the-allennlp-vocabulary-indexes-your-text/

MLA

" » Code Review: How the AllenNLP Vocabulary indexes your text." Xinzhe Li | Sciencx - Monday June 13, 2022, https://www.scien.cx/2022/06/13/code-review-how-the-allennlp-vocabulary-indexes-your-text/

HARVARD

Xinzhe Li | Sciencx Monday June 13, 2022 » Code Review: How the AllenNLP Vocabulary indexes your text., viewed ,<https://www.scien.cx/2022/06/13/code-review-how-the-allennlp-vocabulary-indexes-your-text/>

VANCOUVER

Xinzhe Li | Sciencx - » Code Review: How the AllenNLP Vocabulary indexes your text. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2022/06/13/code-review-how-the-allennlp-vocabulary-indexes-your-text/

CHICAGO

" » Code Review: How the AllenNLP Vocabulary indexes your text." Xinzhe Li | Sciencx - Accessed . https://www.scien.cx/2022/06/13/code-review-how-the-allennlp-vocabulary-indexes-your-text/

IEEE

" » Code Review: How the AllenNLP Vocabulary indexes your text." Xinzhe Li | Sciencx [Online]. Available: https://www.scien.cx/2022/06/13/code-review-how-the-allennlp-vocabulary-indexes-your-text/. [Accessed: ]

rf:citation

» Code Review: How the AllenNLP Vocabulary indexes your text | Xinzhe Li | Sciencx | https://www.scien.cx/2022/06/13/code-review-how-the-allennlp-vocabulary-indexes-your-text/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Section 1: How token/id pairs are added into Vocabulary?

Section 2: How the vocabulary is consumed for text indexing?

Section 3: The vocabulary for label indexing

Section 4: Caveats

Related Posts