Get the number of tokens
The encoding cl100k_base
is used by gpt-4, gpt-3.5-turbo, text-embedding-ada-002.
This is how you encode a string into the tokens.
Python
import tiktoken
"""
| encoding name | models
| -- | -- |
| cl100k_base | gpt-4, gpt-3.5-turbo, text-embedding-ada-002 |
"""
encoding = tiktoken.get_encoding("cl100k_base")
encoding.encode("tiktoken is great!")
# [83, 1609, 5963, 374, 2294, 0]
len(encoding.encode("tiktoken is great!"))
# 6
You can abstract this into a separate function. Source: openai-cookbook (github)
Python
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
num_tokens_from_string("tiktoken is great!", "cl100k_base")
# 6
OpenAI Pricing
As of November 22, 2023
Python
def get_pricing_for_tokens(number_of_tokens: int, per_1000_rate: float) -> float:
return (number_of_tokens / 1000) * per_1000_rate
def get_pricing_from_string(string: str, encoding_name: str) -> float:
num_tokens = num_tokens_from_string(string, encoding_name)
# choose based on the pricing
rate = 0.0010 # gpt-3.5-turbo-1106
pricing = get_pricing_for_tokens(num_tokens, rate)
return pricing
Splitting chunks for to manage Context Window Limits
gpt-3.5-turbo-1106 has a limit of 16,385 tokens (openai)
So what if you want to index a document larger than that? You need to create chunks.
A token is roughly 2-3 characters so 3000 tokens ~= 9000 characters
Python
def split_string(text: str, segment_length: int, overlap_length: int) -> list[str]:
# Split the text into words naively on a space
words = text.split(" ")
# Check if the input is valid
if segment_length <= 0 or overlap_length < 0 or segment_length < overlap_length:
raise ValueError("Invalid values for segment_length and overlap_length")
# Split the words into segments
segments = []
i = 0
while i < len(words):
segments.append(' '.join(words[i:i + segment_length]))
# Move to the next segment considering the overlap
i += segment_length - overlap_length
return segments