Model File Formats: GGUF vs HF Metadata and Token Dictionaries

Introduction

Machine learning models can be saved in different formats depending on the use case. Hugging Face models are typically saved in a set of files (e.g. pytorch_model.bin or .safetensors for weights, plus config.json and tokenizer files) that together define the model’s architecture and vocabulary (Understanding Hugging Face Model File Formats, GGML, and GGUF! | by Rajesh | DevOps.dev). In contrast, GGUF (GPT Generic Unified Format) is a single-file binary format introduced in 2023 as an evolution of GGML. GGUF files bundle model weights and metadata (like architecture details and vocabulary) into one container. This research compares the two formats with a focus on extracting key information such as metadata, token dictionaries, model architecture details, and training parameters. We provide Python code examples using common libraries to demonstrate how to read these details from both GGUF and Hugging Face model files.

Comparison at a Glance: The table below summarizes key differences between GGUF and the standard Hugging Face format:

Feature GGUF (Unified File) Hugging Face Format
File Structure Single binary file (.gguf) containing all data ([Understanding Hugging Face Model File Formats, GGML, and GGUF! by Rajesh
Metadata (Architecture, etc.) Embedded key-value metadata (architecture, hyperparams, etc.) ([Understanding Hugging Face Model File Formats, GGML, and GGUF! by Rajesh
Token Vocabulary Typically embedded in-file (e.g. tokenizer.ggml.tokens list) (GGUF and interaction with Transformers) (gguf-parser · PyPI) Provided via separate tokenizer files (e.g. tokenizer.json, vocab.txt, merges.txt) ([Understanding Hugging Face Model File Formats, GGML, and GGUF!
Number of Parameters Not explicitly stored (can be derived from tensor shapes) Not explicitly stored (compute from model state or use model.num_parameters())
Optimizer/Training Info Not typically included (inference-focused format) Usually saved separately if at all (e.g. training_args.bin, optimizer.pt in checkpoints)
Primary Use Efficient inference (often with quantization, e.g. llama.cpp) (GGUF and interaction with Transformers) Training and inference in frameworks (PyTorch/TF), flexible for fine-tuning ([Understanding Hugging Face Model File Formats, GGML, and GGUF!

Next, we delve into how to extract metadata, vocabularies, and model information from each format with code examples.

Reading Model Metadata

Both GGUF and Hugging Face formats contain metadata about the model, but they store it differently. GGUF files include a rich set of key-value metadata within the binary file itself (e.g. model architecture, context length, etc.) (Understanding Hugging Face Model File Formats, GGML, and GGUF! | by Rajesh | DevOps.dev). Hugging Face models store metadata in a separate JSON config. Below, we show how to read these metadata for each format.

GGUF: Extracting Metadata

GGUF’s internal metadata covers general information (architecture type, model name, etc.) and architecture-specific hyperparameters. For example, a LLaMA model converted to GGUF might have metadata keys like general.architecture: "llama", llama.block_count: 80 (number of transformer blocks), llama.attention.head_count: 64, etc., embedded in the file (gguf-parser · PyPI). To read this in Python, we can use a GGUF parsing library. One convenient library is gguf-parser (Python) which can load a GGUF file and expose its metadata.

Below is a code example using gguf-parser to load a GGUF file and access some metadata fields:

from gguf_parser import GGUFParser

# Initialize parser for the GGUF file
parser = GGUFParser("path/to/model.gguf")
parser.parse()  # Parse the file structure and metadata

# Retrieve general metadata fields
arch = None
model_name = None
if hasattr(parser, "metadata"):
    # In gguf-parser, parsed metadata may be available as an attribute or dict
    meta = parser.metadata  # Assume parser.metadata returns a dict of key-values
    arch = meta.get("general.architecture")
    model_name = meta.get("general.name")
else:
    # Fallback: use parser.print() to display or manually parse if needed
    parser.print()  # This will print all metadata and tensor info to console
    # (In practice, gguf-parser's parser.print() already shows metadata fields)
    
print(f"Model architecture: {arch}")        # e.g., "llama"
print(f"Model name (if any): {model_name}")  # e.g., "Llama-2-13B-chat-hf")

Comments:

  • We use GGUFParser.parse() to read the .gguf file structure. The parser then holds metadata in a structure (for example, parser.metadata might be a dictionary of all metadata fields).
  • We attempt to fetch general.architecture and general.name from the metadata, which indicate the model's architecture type and name respectively (gguf-parser · PyPI). For instance, general.architecture could be "llama" or "gptneox", etc., and general.name might be a human-friendly model name if provided.
  • If the library doesn’t directly expose a metadata dict, we could alternatively use parser.print() to output all metadata. (In this example, we assume parser.metadata exists for clarity. Depending on the library version, you might need to inspect parser internals or use provided methods to get metadata fields.)

After parsing, the GGUF parser provides a rich set of metadata. For example, printing the metadata of a LLaMA GGUF might show:

general.architecture: llama  
general.name: Llama2-13B-chat  
llama.context_length: 4096  
llama.embedding_length: 5120  
llama.block_count: 80  
llama.attention.head_count: 64  
... (other architecture-specific keys) ...  
tokenizer.ggml.tokens: [...]  (list of vocabulary tokens)  

This confirms that the GGUF file contains all necessary config info (context length, layer counts, etc.) inside the file (gguf-parser · PyPI). You can access these programmatically via the parser.

Hugging Face: Extracting Metadata

Hugging Face model repositories save model configuration metadata in a separate JSON (commonly config.json). This config file defines the architecture parameters such as number of layers, hidden size, number of attention heads, activation function, etc. (Understanding Hugging Face Model File Formats, GGML, and GGUF! | by Rajesh | DevOps.dev). Instead of manually reading JSON, the Hugging Face Transformers library provides convenient classes. We can load the config or the model, and then inspect its config attribute.

Below is a code example using Transformers library to load a model’s metadata:

from transformers import AutoConfig, AutoModel

model_name = "bert-base-uncased"  # example model
# Load the configuration without weights
config = AutoConfig.from_pretrained(model_name)

# Access some key metadata fields from the config
architecture = config.model_type           # e.g., "bert"
num_layers = config.num_hidden_layers      # e.g., 12 for BERT-base
hidden_size = config.hidden_size           # e.g., 768 for BERT-base
activation = config.hidden_act             # e.g., "gelu"

print(f"Architecture: {architecture}")
print(f"Number of layers: {num_layers}")
print(f"Hidden size (embedding): {hidden_size}")
print(f"Activation function: {activation}")

Comments:

  • We use AutoConfig.from_pretrained to fetch the config.json for the given model. This returns a PretrainedConfig object (in this case a BertConfig since the model is BERT) populated with architecture details.
  • We then retrieve fields: model_type (architecture name), num_hidden_layers, hidden_size, and hidden_act (the activation function used in the layers). Most Transformer configs provide these or similarly named attributes. For example, BERT’s config might specify hidden_act="gelu" and num_hidden_layers=12 (HuggingFace Config Params Explained).
  • We could also load the model directly with AutoModel.from_pretrained and then do model.config to get the same config object. Loading just the config (without the large weight file) is more efficient when we only need metadata.

Running this code for bert-base-uncased would output something like:

Architecture: bert  
Number of layers: 12  
Hidden size (embedding): 768  
Activation function: gelu

These values match the entries in the model’s config.json (e.g., BERT-base’s config has 12 layers, hidden size 768, and uses GELU activation (HuggingFace Config Params Explained)). Hugging Face’s config covers similar info to GGUF metadata: e.g., vocab_size, num_attention_heads, dropout rates, etc., are all available via config attributes (HuggingFace Config Params Explained).

Note: The Hugging Face config does not directly store the number of parameters in the model – that must be computed from the weights. We will address that in a later section.

Extracting Token Dictionaries (Vocabulary)

A crucial part of any language model is the tokenizer vocabulary – the mapping of tokens (subword pieces or words) to token IDs. Both GGUF and Hugging Face formats include this information, but again in different ways:

  • GGUF: Can embed the entire token dictionary inside the .gguf file as part of the metadata (under keys like tokenizer.ggml.tokens, tokenizer.ggml.scores, etc.) (huggingface.co) (huggingface.co). This makes the GGUF self-contained for inference since it knows how to tokenize inputs.
  • Hugging Face: Provides vocabulary files separately (e.g. vocab.txt for WordPiece, or tokenizer.json and merges.txt for BPE). When using the Transformers library, the vocabulary is loaded via AutoTokenizer.

We will demonstrate how to retrieve the vocabulary from each format.

GGUF: Reading the Vocabulary

If a GGUF model has an embedded vocabulary, the parser will expose it. For instance, a GPT-2 style model in GGUF might have tokenizer.ggml.tokens as an array of strings (the tokens), and possibly tokenizer.ggml.merges for BPE merges if applicable (huggingface.co) (huggingface.co). Let’s extend our GGUF parsing example to extract the token list:

# Continuing from the earlier GGUFParser usage
# After parser.parse()
vocab = None
if hasattr(parser, "metadata"):
    meta = parser.metadata
    vocab = meta.get("tokenizer.ggml.tokens")
    
if vocab is not None:
    print(f"Vocabulary size (GGUF): {len(vocab)} tokens")
    # Print first 10 tokens as sample
    print("Sample tokens:", vocab[:10])

Comments:

  • We assume the parser.metadata dict contains the key tokenizer.ggml.tokens which holds a list of all tokens in order of their token IDs (gguf-parser · PyPI). In the example metadata above, this list had 128256 entries (for a LLaMA 70B vocabulary) and included tokens like "!", "\"" etc.
  • We print the length of this list to get the vocabulary size. This should match the model’s vocab_size (which is also likely present as a metadata field, e.g., llama.vocab_size) (gguf-parser · PyPI).
  • We also print a sample of the first 10 tokens to verify content. (Typically the first few tokens might be special tokens or punctuation.)

If the tokens are stored differently (for example, some GGUF files might store a raw tokenizer JSON under tokenizer.huggingface.json (huggingface.co)), additional parsing would be needed. But commonly, for LLMs, the simple list of tokens is available. The gguf-parser library we use should have already decoded the byte arrays to Python strings for us. If using the official gguf library GGUFReader, one would retrieve the ReaderField for tokenizer.ggml.tokens and convert bytes to strings manually.

Hugging Face: Reading the Vocabulary

In the Hugging Face format, the vocabulary is handled by the tokenizer. To get the token dictionary (i.e., token-to-index mapping), we use the AutoTokenizer class. The tokenizer will load from the files (tokenizer.json, vocab.txt, etc.) provided in the model repository. Most tokenizers have a method get_vocab() that returns a dict of token -> ID mappings.

Here’s how we can load a tokenizer and extract its vocabulary in Python:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
vocab_dict = tokenizer.get_vocab()

print(f"Vocabulary size (HF): {len(vocab_dict)} tokens")
# Print a few sample tokens and their IDs
for token, idx in list(vocab_dict.items())[:10]:
    print(f"{token!r}: {idx}")

Comments:

  • AutoTokenizer.from_pretrained(model_name) will automatically download or load the tokenizer files for the given model. For example, for GPT-2 it will load vocab.json and merges.txt; for BERT it will load vocab.txt; for LLaMA it might load a SentencePiece model file. The AutoTokenizer handles the details internally.
  • We then call tokenizer.get_vocab(). This returns a dictionary where keys are token strings and values are their integer IDs (Understanding the Llama2 Tokenizer: Working with the ... - Medium). (If using a “fast” tokenizer, get_vocab() might raise an error for some models; in that case, one can use tokenizer.vocab for WordPiece tokenizers or access the internal tokenizer.decoder for BPE. But get_vocab() works for most standard cases.)
  • We print the total len(vocab_dict) to get the vocabulary size. This should correspond to the vocab_size field in the model config. For instance, bert-base-uncased has a vocab size of 30522 (BERT - Hugging Face), and indeed the tokenizer’s vocab dict will have 30522 entries.
  • Finally, we iterate over a slice of the vocabulary dictionary to print sample tokens. The !r in the print format ensures special characters are visible (for example, it might show tokens like '[CLS]' or whitespace tokens).

For a BPE-based tokenizer like GPT-2, the vocabulary might include byte-level tokens. For example, running the above on "gpt2" yields tokens like '!': 0, '"': 1, '#': 2, ... etc., mapping punctuation and letters to IDs. The merges (BPE rules) are not directly shown in get_vocab() but are used internally by the tokenizer. In contrast, the GGUF format might store those merges explicitly under a metadata key like tokenizer.ggml.merges (as a list of merge rule strings) (huggingface.co).

In summary, obtaining the token dictionary in Hugging Face is straightforward via the tokenizer API, whereas in GGUF you extract it from the file’s metadata. In both cases, you can get the vocabulary size and contents for further analysis or verification.

Model Architecture Details and Number of Parameters

Understanding a model’s architecture (layer structure, activation functions, etc.) and its scale (number of parameters) is often necessary for research and deployment. We’ve seen how architecture hyperparameters (layers, heads, etc.) appear in metadata/config. Here we demonstrate retrieving a bit more detail, including calculating the number of parameters in the model and identifying layer structures, for both formats.

GGUF: Architecture and Parameters

The GGUF metadata provides the fundamental architecture parameters as keys (especially under the [architecture].* namespace, e.g. llama. for LLaMA models) (huggingface.co). This includes things like number of layers (block_count), hidden dimension (embedding_length), intermediate feed-forward size (feed_forward_length), number of attention heads (head_count), type of rotary embedding (rope.* keys), etc. Using these, one can infer the layer structure (e.g., an LLM with N blocks, each containing self-attention and feed-forward sublayers). Activation function information is sometimes implicit (for instance, many models use GELU or ReLU by convention; some configs might include an explicit key if non-standard).

To get the total number of parameters from a GGUF file, we usually need to sum up the sizes of all tensors. Each tensor’s shape is available via the GGUF parser’s tensor list. We can iterate through parser.tensors (or use the official GGUFReader.tensors) and sum up the product of each tensor’s dimensions. Below is a code example to compute the parameter count and list some layer weights:

# Assuming parser.parse() was done earlier
total_params = 0
for tensor in parser.tensors:  # each tensor has .shape attribute (e.g., [768, 3072])
    # Calculate number of elements in this tensor
    count = 1
    for dim in tensor.shape:
        count *= dim
    total_params += count

print(f"Total model parameters (approx): {total_params}")
# For example, might print ~6.7 billion for a 7B model (because embedding matrices and such contribute to total)

# You can also examine specific tensor names to infer architecture details:
for t in parser.tensors[:5]:
    print(f"Tensor: {t.name}, shape: {tuple(t.shape)}, type: {t.tensor_type}")

Comments:

  • parser.tensors is assumed to be a list of tensor metadata objects, each with attributes like name (the layer weight name, e.g., "transformer.h.0.attn.q_proj.weight" in a GPT model or similar), shape (dimensions of the tensor), and possibly tensor_type (data type or quantization type, e.g., GGML_TYPE_Q4_K for a quantized 4-bit weight) (gguf-parser · PyPI). This is based on the gguf-parser which prints tensor info after metadata.
  • We loop through each tensor to accumulate total_params. The product of dimensions gives the number of elements (parameters) in that weight. Summing these yields the total parameter count. Note this counts all parameters including non-trainable ones like embeddings and does not distinguish between encoder/decoder if any – it’s a raw count.
  • We print out a few tensor names and shapes to see the layer structure. The naming conventions in GGUF (inherited from the original model) often reveal the model architecture. For example, you might see names like layers.0.attention.wq.weight or decoder.block.0.ffn.weight which tell you the model has a layered structure and what each tensor represents.

Using this method on a known model should match the expected parameter count. For instance, if you sum parameters of a LLaMA-7B GGUF, you should get roughly 6.7 billion, aligning with the model’s advertised size. Keep in mind quantized GGUF models store weights in compressed form, but the count of logical parameters remains the same (just stored with fewer bits).

Also, if the GGUF is a sharded model (split into multiple files), you would need to parse each shard to get the full parameter count. GGUF supports sharding, indicated by filename and metadata, but each file would be parsed similarly (huggingface.co).

Hugging Face: Architecture and Parameters

For Hugging Face models, the architecture details are readily available via the config as shown before. To reiterate, typical fields include:

These allow us to understand the layer structure. For example, a BERT config might say 12 layers, 12 heads, hidden_size 768, intermediate_size 3072, activation "gelu" (HuggingFace Config Params Explained), meaning each of the 12 layers has a self-attention with 12 heads and a 2-layer feed-forward network with GELU nonlinearity.

To get the total parameter count in Hugging Face, we can load the model weights and use the .num_parameters() method provided by the model (or sum manually). Here’s an example:

model = AutoModel.from_pretrained(model_name)
param_count = model.num_parameters()
print(f"Total model parameters: {param_count}")

If we use bert-base-uncased as model_name, this should output about 110 million parameters (which is known for BERT-base). Indeed, Hugging Face’s documentation demonstrates this: DistilBERT has ~67M and BERT-base has ~110M parameters (Fine-tuning a masked language model - Hugging Face NLP Course). The num_parameters() function conveniently includes all model parameters by default (How to get model size? - Hugging Face Forums).

Comments:

  • Loading the full model can be memory heavy for very large models. If you only need the count, you might avoid loading optimizer states or unnecessary components. However, .num_parameters() is straightforward and widely used for Hugging Face models.
  • Under the hood, model.num_parameters() iterates through model.parameters() and sums their .numel(). You could replicate this manually: sum(p.numel() for p in model.parameters()) gives the same result.
  • The param_count includes all weights (embeddings, transformer layers, output heads, etc.). If you want only trainable params or a subset, you could filter by p.requires_grad or layer name. By default, Hugging Face models mark all model weights as trainable (unless you freeze some layers).

Using .num_parameters() on a known model provides a sanity check against the config. For example, if a config says 12 layers, 768 hidden size, etc., one can roughly estimate parameters (there are formulae for Transformers) and the actual count will align. For BERT-base: 110M parameters as expected (Fine-tuning a masked language model - Hugging Face NLP Course).

Activation Functions and Layer Details: The config’s hidden_act tells us the activation in the feed-forward layers (e.g., GELU for BERT, sometimes config.activation_function for GPT-2 which might be "gelu_new"). Also, layer_norm_eps in config gives the epsilon used in layer normalization layers, etc. All such details are accessible via config. So, for Hugging Face models, reading the config is usually enough to know the architectural hyperparameters. In GGUF, these details are in metadata keys (for example, llama.attention.layer_norm_rms_epsilon might be a key for LLaMA’s RMSNorm epsilon (huggingface.co)).

Optimizer and Training Configuration Data

Finally, we consider optimizer states and training configuration. These are generally not part of a model’s core serialization for inference, but when saving checkpoints during training, one might save optimizer momentum, learning rate schedules, or training hyperparameters. The availability of this data depends on how the model was saved:

  • GGUF: Since GGUF is designed for inference, it typically does not include optimizer states or training arguments. It contains only what is needed to load and run the model (plus metadata about how it was trained, like maybe the original training context length or dataset if provided in description). There are no standard keys in GGUF for optimizer states. (At most, you might see a general.training_* field if the converter recorded something, but this is not common.) So for GGUF, parsing optimizer info is usually not applicable. You would refer back to the original training logs or model card for those details.

  • Hugging Face: If a model checkpoint is saved via the Trainer API or similar, you may have auxiliary files like optimizer.pt, scheduler.pt, and training_args.bin in the output directory. These contain the optimizer’s internal state (e.g., Adam moments), the learning rate scheduler state, and the training arguments (hyperparameters used for training). They are not loaded by from_pretrained by default, but you can load them manually with PyTorch.

Parsing training configuration (Hyperparameters): Often, training_args.bin (or .json) holds the training hyperparameters (like learning rate, epochs, batch size, etc.). In many cases this is a binary pickled TrainingArguments object. You can load it with torch.load:

import torch

# Load training arguments (if available)
train_args = torch.load("path/to/training_args.bin", map_location="cpu")
print("Training arguments:", train_args)
# This might print a dataclass TrainingArguments with fields like learning_rate, num_train_epochs, etc.

# Load optimizer state dict (if available)
opt_state = torch.load("path/to/optimizer.pt", map_location="cpu")
# For example, opt_state might be a dict with keys 'state' (per-parameter states) and 'param_groups'
print("Optimizer state keys:", opt_state.keys())
if "param_groups" in opt_state:
    lr = opt_state["param_groups"][0].get('lr', None)
    print(f"Optimizer learning rate: {lr}")

Comments:

  • We use torch.load to deserialize the objects. This requires that the environment has the same class definitions (for TrainingArguments) if the object is not a plain dictionary. In practice, training_args.bin is often just a pickled TrainingArguments (which is a simple dataclass from Hugging Face). After loading, printing it will show something like:
    TrainingArguments(output_dir='...', num_train_epochs=3, learning_rate=5e-5, per_device_train_batch_size=8, ... )
    giving all the training hyperparameters.
  • The optimizer state (optimizer.pt) when loaded is usually a state dict (a Python dict). Typically it has two main entries: "state" (a dict of parameter-specific states like momentum vectors) and "param_groups" (which contains the hyperparameters for the optimizer, like the learning rate for each group) – this is how PyTorch saves optimizer.state_dict(). We print the keys to confirm and then, if available, extract the learning rate from the first param group as an example.
  • Note that these files are optional. If the model was not saved mid-training or the uploader only provided the final weights, you may not have any training_args.bin or optimizer.pt. On the Hugging Face Hub, usually only the model weights and config are uploaded, not the optimizer. The training arguments might be documented in the model card instead.

By loading these, you can programmatically inspect how the model was trained. For instance, verifying the learning rate or number of epochs can be done via the TrainingArguments object. This information is external to the model’s architecture but important for reproducibility.

In summary, GGUF files focus on model inference data and omit optimizer/training specifics, whereas Hugging Face training checkpoints can include optimizer states and training configs as separate files. We demonstrated how to load those with PyTorch for completeness.

Performance Considerations (Brief)

GGUF models are optimized for efficient inference, often using quantized weights to reduce size, which leads to faster loading and lower memory usage for CPU-bound deployments (GGUF and interaction with Transformers). The single-file design and memory-mappable format means startup is quick and all necessary data (weights + vocab + config) is loaded in one go. However, using GGUF in frameworks like PyTorch may require converting back to full precision, incurring some overhead (GGUF and interaction with Transformers). Hugging Face format models (in PyTorch or TensorFlow) are typically in higher precision and may leverage GPUs for faster computation, which can give better throughput on supported hardware. In practice, GGUF (with llama.cpp or similar executors) can deliver excellent CPU inference performance with minimal resources, while Hugging Face models excel in flexibility (fine-tuning, GPU acceleration) at the cost of larger memory footprints. Ultimately, the choice of format can affect load time and inference speed: GGUF offers a portable, compact model for deployment, whereas the standard Hugging Face format integrates seamlessly with training pipelines and broad hardware acceleration, making each format preferable for different stages of the model lifecycle (Understanding Hugging Face Model File Formats, GGML, and GGUF! | by Rajesh | DevOps.dev) (Understanding Hugging Face Model File Formats, GGML, and GGUF! | by Rajesh | DevOps.dev).

Prepared with OpenAI o1-pro & deep-research