Tokenization

Natural Language Processing, generally abbreviated as NLP, is a field of computer science and artificial intelligence, concerned with the interaction of computers with human language. It generates and formulates the idea of programming computers to convert human speech and text to machine-readable form and vice-versa.

Tokenization is one of the first steps in NLP pipeline. It is a technique to split a sentence, phrase, paragraph or an entire document to smaller units. These smaller units are called tokens. Tokens must not always be words. It can be anything — a word, a subword, or a character.

Comments