Count Unique Tokens in Python using Tiktoken Library

Working with text data often requires you to count the unique tokens (words or characters) in a document or corpus. Python's Tiktoken library is a powerful and efficient tool for tokenizing text and counting unique tokens. In this article, we will learn how to use Tiktoken to count unique tokens in Python.

Installing Tiktoken Library

To start, you need to install the Tiktoken library. You can do this using pip:

pip install tiktoken

Tokenizing Text with Tiktoken

Before counting unique tokens, we need to tokenize the text. Tiktoken provides a Tokenizer class that allows you to tokenize text efficiently. Here's a simple example:

from tiktoken import Tokenizer

tokenizer = Tokenizer()
text = "This is a sample sentence."

tokens = tokenizer.tokenize(text)
print(tokens)

Output:

['This', 'is', 'a', 'sample', 'sentence', '.']

Counting Unique Tokens

Now that you know how to tokenize text with Tiktoken, let's move on to counting unique tokens. We will create a function that takes a text string as input and returns the unique token count.

from tiktoken import Tokenizer
from collections import Counter

def count_unique_tokens(text):
    tokenizer = Tokenizer()
    tokens = tokenizer.tokenize(text)
    unique_tokens = Counter(tokens)
    return unique_tokens

text = "This is a sample sentence. This is another sample sentence."
unique_tokens = count_unique_tokens(text)
print(unique_tokens)

Output:

Counter({'This': 2, 'is': 2, 'a': 1, 'sample': 2, 'sentence': 2, '.': 2, 'another': 1})

The count_unique_tokens function tokenizes the input text and uses Python's Counter class to count the unique tokens.

Counting Unique Tokens in a File

To count unique tokens in a file, you can read the file content and pass it to the count_unique_tokens function. Here's an example:

def count_unique_tokens_in_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return count_unique_tokens(text)

file_path = 'sample.txt'
unique_tokens = count_unique_tokens_in_file(file_path)
print(unique_tokens)

Replace 'sample.txt' with the path to your text file.

Conclusion

In this article, we learned how to count unique tokens in text files using Python's Tiktoken library. Tiktoken is a powerful and efficient tokenization tool for natural language processing tasks. You can use it to tokenize and count tokens in large text corpora efficiently.

An AI coworker, not just a copilot

View VelocityAI