tiktoken: get number of tokens from string + openai pricing

Get the number of tokens

The encoding cl100k_base is used by gpt-4, gpt-3.5-turbo, text-embedding-ada-002.
This is how you encode a string into the tokens.

Python

import tiktoken

"""
| encoding name | models
| -- | -- |
| cl100k_base |	gpt-4, gpt-3.5-turbo, text-embedding-ada-002 |
"""

encoding = tiktoken.get_encoding("cl100k_base")
encoding.encode("tiktoken is great!")
# [83, 1609, 5963, 374, 2294, 0]

len(encoding.encode("tiktoken is great!"))
# 6

You can abstract this into a separate function. Source: openai-cookbook (github)

Python

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string("tiktoken is great!", "cl100k_base")
# 6

OpenAI Pricing

As of November 22, 2023

Python

def get_pricing_for_tokens(number_of_tokens: int, per_1000_rate: float) -> float:
    return (number_of_tokens / 1000) * per_1000_rate
  
def get_pricing_from_string(string: str, encoding_name: str) -> float:
    num_tokens = num_tokens_from_string(string, encoding_name)
    
    # choose based on the pricing
    rate = 0.0010 # gpt-3.5-turbo-1106
    pricing = get_pricing_for_tokens(num_tokens, rate)
    return pricing

Splitting chunks for to manage Context Window Limits

gpt-3.5-turbo-1106 has a limit of 16,385 tokens (openai)

So what if you want to index a document larger than that? You need to create chunks.

A token is roughly 2-3 characters so 3000 tokens ~= 9000 characters

Python

def split_string(text: str, segment_length: int, overlap_length: int) -> list[str]:
    # Split the text into words naively on a space
    words = text.split(" ")

    # Check if the input is valid
    if segment_length <= 0 or overlap_length < 0 or segment_length < overlap_length:
        raise ValueError("Invalid values for segment_length and overlap_length")

    # Split the words into segments
    segments = []
    i = 0
    while i < len(words):
        segments.append(' '.join(words[i:i + segment_length]))
        # Move to the next segment considering the overlap
        i += segment_length - overlap_length  

    return segments