Tokens vs Chunks

Tokens vs Chunks

When reading articles or documentations, you’ll see that sometimes, “tokens” and “chunks” are treated as synonyms, but usually they represent different granularity levels. Demonstration by definitions:

 

Tokens


This content originally appeared on DEV Community and was authored by OdyAsh

Tokens vs Chunks

When reading articles or documentations, you'll see that sometimes, "tokens" and "chunks" are treated as synonyms, but usually they represent different granularity levels. Demonstration by definitions:

 

Tokens

  • A token is the smallest unit of data that the NLP model processes, such as a sentence, a word, or a character (s1).
  • It's a way to break down and analyze text into manageable components (s2).

 

Chunks

  • A chunk is a group of tokens (s3)
  • For example, if we have this text: "Hello there! My name is OdyAsh (new paragraph) I like astronomy!", then depending on how you want to process this text, you might have one of these configurations:
    • tokens ⟺ sentences, chunks ⟺ paragraphs
    • tokens ⟺ words, chunks ⟺ sentences
    • tokens ⟺ words, chunks ⟺ group of nouns only (i.e., process the tokens so that they are grouped into chunks of noun phrases. Example: s4.
    • tokens ⟺ characters, chunks ⟺ words
    • tokens ⟺ characters, chunks ⟺ chunk size (i.e., 200 characters)
    • Examples: here: s5

So, one might treat a chunk as a unit of data which the NLP model gains useful info from (s4), and by chunking down, we get to the details of each chunk, i.e., the tokens which form this chunk (s6).

 

Summary

  • Usually:
    • A chunk: a unit of data with a low granularity level.
    • A token: a unit of data with a high granularity level.
  • Occasionally:
    • They are treated as the same thing.

 

If you have any questions/suggestions...

Your participation is most welcome! 🔥🙌

 

And If I made a mistake

Then kindly correct me :] <3

 

Sources


This content originally appeared on DEV Community and was authored by OdyAsh


Print Share Comment Cite Upload Translate Updates
APA

OdyAsh | Sciencx (2024-09-14T13:09:01+00:00) Tokens vs Chunks. Retrieved from https://www.scien.cx/2024/09/14/tokens-vs-chunks/

MLA
" » Tokens vs Chunks." OdyAsh | Sciencx - Saturday September 14, 2024, https://www.scien.cx/2024/09/14/tokens-vs-chunks/
HARVARD
OdyAsh | Sciencx Saturday September 14, 2024 » Tokens vs Chunks., viewed ,<https://www.scien.cx/2024/09/14/tokens-vs-chunks/>
VANCOUVER
OdyAsh | Sciencx - » Tokens vs Chunks. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/09/14/tokens-vs-chunks/
CHICAGO
" » Tokens vs Chunks." OdyAsh | Sciencx - Accessed . https://www.scien.cx/2024/09/14/tokens-vs-chunks/
IEEE
" » Tokens vs Chunks." OdyAsh | Sciencx [Online]. Available: https://www.scien.cx/2024/09/14/tokens-vs-chunks/. [Accessed: ]
rf:citation
» Tokens vs Chunks | OdyAsh | Sciencx | https://www.scien.cx/2024/09/14/tokens-vs-chunks/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.