Model File Formats: GGUF vs HF Metadata and Token Dictionaries
Introduction
Machine learning models can be saved in different formats depending on the use case. Hugging Face models are typically saved in a set of files (e.g. pytorch_model.bin
or .safetensors
for weights, plus config.json
and tokenizer files) that together define the model’s architecture and vocabulary (Understanding Hugging Face Model File Formats, GGML, and GGUF! | by Rajesh | DevOps.dev). In contrast, GGUF (GPT Generic Unified Format) is a single-file binary format introduced in 2023 as an evolution of GGML. GGUF files bundle model weights and metadata (like architecture details and vocabulary) into one container. This research compares the two formats with a focus on extracting key information such as metadata, token dictionaries, model architecture details, and training parameters. We provide Python code examples using common libraries to demonstrate how to read these details from both GGUF and Hugging Face model files.
Comparison at a Glance: The table below summarizes key differences between GGUF and the standard Hugging Face format:
Feature | GGUF (Unified File) | Hugging Face Format |
---|---|---|
File Structure | Single binary file (.gguf ) containing all data ([Understanding Hugging Face Model File Formats, GGML, and GGUF! |
by Rajesh |
Metadata (Architecture, etc.) | Embedded key-value metadata (architecture, hyperparams, etc.) ([Understanding Hugging Face Model File Formats, GGML, and GGUF! | by Rajesh |
Token Vocabulary | Typically embedded in-file (e.g. tokenizer.ggml.tokens list) (GGUF and interaction with Transformers) (gguf-parser · PyPI) |
Provided via separate tokenizer files (e.g. tokenizer.json , vocab.txt , merges.txt ) ([Understanding Hugging Face Model File Formats, GGML, and GGUF! |
Number of Parameters | Not explicitly stored (can be derived from tensor shapes) | Not explicitly stored (compute from model state or use model.num_parameters() ) |
Optimizer/Training Info | Not typically included (inference-focused format) | Usually saved separately if at all (e.g. training_args.bin , optimizer.pt in checkpoints) |
Primary Use | Efficient inference (often with quantization, e.g. llama.cpp) (GGUF and interaction with Transformers) | Training and inference in frameworks (PyTorch/TF), flexible for fine-tuning ([Understanding Hugging Face Model File Formats, GGML, and GGUF! |
Next, we delve into how to extract metadata, vocabularies, and model information from each format with code examples.
Reading Model Metadata
Both GGUF and Hugging Face formats contain metadata about the model, but they store it differently. GGUF files include a rich set of key-value metadata within the binary file itself (e.g. model architecture, context length, etc.) (Understanding Hugging Face Model File Formats, GGML, and GGUF! | by Rajesh | DevOps.dev). Hugging Face models store metadata in a separate JSON config. Below, we show how to read these metadata for each format.
GGUF: Extracting Metadata
GGUF’s internal metadata covers general information (architecture type, model name, etc.) and architecture-specific hyperparameters. For example, a LLaMA model converted to GGUF might have metadata keys like general.architecture: "llama"
, llama.block_count: 80
(number of transformer blocks), llama.attention.head_count: 64
, etc., embedded in the file (gguf-parser · PyPI). To read this in Python, we can use a GGUF parsing library. One convenient library is gguf-parser
(Python) which can load a GGUF file and expose its metadata.
Below is a code example using gguf-parser to load a GGUF file and access some metadata fields:
from gguf_parser import GGUFParser
# Initialize parser for the GGUF file
parser = GGUFParser("path/to/model.gguf")
parser.parse() # Parse the file structure and metadata
# Retrieve general metadata fields
arch = None
model_name = None
if hasattr(parser, "metadata"):
# In gguf-parser, parsed metadata may be available as an attribute or dict
meta = parser.metadata # Assume parser.metadata returns a dict of key-values
arch = meta.get("general.architecture")
model_name = meta.get("general.name")
else:
# Fallback: use parser.print() to display or manually parse if needed
parser.print() # This will print all metadata and tensor info to console
# (In practice, gguf-parser's parser.print() already shows metadata fields)
print(f"Model architecture: {arch}") # e.g., "llama"
print(f"Model name (if any): {model_name}") # e.g., "Llama-2-13B-chat-hf")
Comments:
- We use
GGUFParser.parse()
to read the.gguf
file structure. The parser then holds metadata in a structure (for example,parser.metadata
might be a dictionary of all metadata fields). - We attempt to fetch
general.architecture
andgeneral.name
from the metadata, which indicate the model's architecture type and name respectively (gguf-parser · PyPI). For instance,general.architecture
could be"llama"
or"gptneox"
, etc., andgeneral.name
might be a human-friendly model name if provided. - If the library doesn’t directly expose a metadata dict, we could alternatively use
parser.print()
to output all metadata. (In this example, we assumeparser.metadata
exists for clarity. Depending on the library version, you might need to inspect parser internals or use provided methods to get metadata fields.)
After parsing, the GGUF parser provides a rich set of metadata. For example, printing the metadata of a LLaMA GGUF might show:
general.architecture: llama
general.name: Llama2-13B-chat
llama.context_length: 4096
llama.embedding_length: 5120
llama.block_count: 80
llama.attention.head_count: 64
... (other architecture-specific keys) ...
tokenizer.ggml.tokens: [...] (list of vocabulary tokens)
This confirms that the GGUF file contains all necessary config info (context length, layer counts, etc.) inside the file (gguf-parser · PyPI). You can access these programmatically via the parser.
Hugging Face: Extracting Metadata
Hugging Face model repositories save model configuration metadata in a separate JSON (commonly config.json
). This config file defines the architecture parameters such as number of layers, hidden size, number of attention heads, activation function, etc. (Understanding Hugging Face Model File Formats, GGML, and GGUF! | by Rajesh | DevOps.dev). Instead of manually reading JSON, the Hugging Face Transformers library provides convenient classes. We can load the config or the model, and then inspect its config
attribute.
Below is a code example using Transformers library to load a model’s metadata:
from transformers import AutoConfig, AutoModel
model_name = "bert-base-uncased" # example model
# Load the configuration without weights
config = AutoConfig.from_pretrained(model_name)
# Access some key metadata fields from the config
architecture = config.model_type # e.g., "bert"
num_layers = config.num_hidden_layers # e.g., 12 for BERT-base
hidden_size = config.hidden_size # e.g., 768 for BERT-base
activation = config.hidden_act # e.g., "gelu"
print(f"Architecture: {architecture}")
print(f"Number of layers: {num_layers}")
print(f"Hidden size (embedding): {hidden_size}")
print(f"Activation function: {activation}")
Comments:
- We use
AutoConfig.from_pretrained
to fetch theconfig.json
for the given model. This returns aPretrainedConfig
object (in this case aBertConfig
since the model is BERT) populated with architecture details. - We then retrieve fields:
model_type
(architecture name),num_hidden_layers
,hidden_size
, andhidden_act
(the activation function used in the layers). Most Transformer configs provide these or similarly named attributes. For example, BERT’s config might specifyhidden_act="gelu"
andnum_hidden_layers=12
(HuggingFace Config Params Explained). - We could also load the model directly with
AutoModel.from_pretrained
and then domodel.config
to get the same config object. Loading just the config (without the large weight file) is more efficient when we only need metadata.
Running this code for bert-base-uncased
would output something like:
Architecture: bert
Number of layers: 12
Hidden size (embedding): 768
Activation function: gelu
These values match the entries in the model’s config.json
(e.g., BERT-base’s config has 12 layers, hidden size 768, and uses GELU activation (HuggingFace Config Params Explained)). Hugging Face’s config covers similar info to GGUF metadata: e.g., vocab_size
, num_attention_heads
, dropout rates, etc., are all available via config
attributes (HuggingFace Config Params Explained).
Note: The Hugging Face config does not directly store the number of parameters in the model – that must be computed from the weights. We will address that in a later section.
Extracting Token Dictionaries (Vocabulary)
A crucial part of any language model is the tokenizer vocabulary – the mapping of tokens (subword pieces or words) to token IDs. Both GGUF and Hugging Face formats include this information, but again in different ways:
- GGUF: Can embed the entire token dictionary inside the
.gguf
file as part of the metadata (under keys liketokenizer.ggml.tokens
,tokenizer.ggml.scores
, etc.) (huggingface.co) (huggingface.co). This makes the GGUF self-contained for inference since it knows how to tokenize inputs. - Hugging Face: Provides vocabulary files separately (e.g.
vocab.txt
for WordPiece, ortokenizer.json
andmerges.txt
for BPE). When using the Transformers library, the vocabulary is loaded viaAutoTokenizer
.
We will demonstrate how to retrieve the vocabulary from each format.
GGUF: Reading the Vocabulary
If a GGUF model has an embedded vocabulary, the parser will expose it. For instance, a GPT-2 style model in GGUF might have tokenizer.ggml.tokens
as an array of strings (the tokens), and possibly tokenizer.ggml.merges
for BPE merges if applicable (huggingface.co) (huggingface.co). Let’s extend our GGUF parsing example to extract the token list:
# Continuing from the earlier GGUFParser usage
# After parser.parse()
vocab = None
if hasattr(parser, "metadata"):
meta = parser.metadata
vocab = meta.get("tokenizer.ggml.tokens")
if vocab is not None:
print(f"Vocabulary size (GGUF): {len(vocab)} tokens")
# Print first 10 tokens as sample
print("Sample tokens:", vocab[:10])
Comments:
- We assume the
parser.metadata
dict contains the keytokenizer.ggml.tokens
which holds a list of all tokens in order of their token IDs (gguf-parser · PyPI). In the example metadata above, this list had 128256 entries (for a LLaMA 70B vocabulary) and included tokens like"!"
,"\""
etc. - We print the length of this list to get the vocabulary size. This should match the model’s
vocab_size
(which is also likely present as a metadata field, e.g.,llama.vocab_size
) (gguf-parser · PyPI). - We also print a sample of the first 10 tokens to verify content. (Typically the first few tokens might be special tokens or punctuation.)
If the tokens are stored differently (for example, some GGUF files might store a raw tokenizer JSON under tokenizer.huggingface.json
(huggingface.co)), additional parsing would be needed. But commonly, for LLMs, the simple list of tokens is available. The gguf-parser library we use should have already decoded the byte arrays to Python strings for us. If using the official gguf
library GGUFReader
, one would retrieve the ReaderField
for tokenizer.ggml.tokens
and convert bytes to strings manually.
Hugging Face: Reading the Vocabulary
In the Hugging Face format, the vocabulary is handled by the tokenizer. To get the token dictionary (i.e., token-to-index mapping), we use the AutoTokenizer
class. The tokenizer will load from the files (tokenizer.json
, vocab.txt
, etc.) provided in the model repository. Most tokenizers have a method get_vocab()
that returns a dict of token -> ID mappings.
Here’s how we can load a tokenizer and extract its vocabulary in Python:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
vocab_dict = tokenizer.get_vocab()
print(f"Vocabulary size (HF): {len(vocab_dict)} tokens")
# Print a few sample tokens and their IDs
for token, idx in list(vocab_dict.items())[:10]:
print(f"{token!r}: {idx}")
Comments:
AutoTokenizer.from_pretrained(model_name)
will automatically download or load the tokenizer files for the given model. For example, for GPT-2 it will loadvocab.json
andmerges.txt
; for BERT it will loadvocab.txt
; for LLaMA it might load a SentencePiece model file. TheAutoTokenizer
handles the details internally.- We then call
tokenizer.get_vocab()
. This returns a dictionary where keys are token strings and values are their integer IDs (Understanding the Llama2 Tokenizer: Working with the ... - Medium). (If using a “fast” tokenizer,get_vocab()
might raise an error for some models; in that case, one can usetokenizer.vocab
for WordPiece tokenizers or access the internaltokenizer.decoder
for BPE. Butget_vocab()
works for most standard cases.) - We print the total
len(vocab_dict)
to get the vocabulary size. This should correspond to thevocab_size
field in the model config. For instance,bert-base-uncased
has a vocab size of 30522 (BERT - Hugging Face), and indeed the tokenizer’s vocab dict will have 30522 entries. - Finally, we iterate over a slice of the vocabulary dictionary to print sample tokens. The
!r
in the print format ensures special characters are visible (for example, it might show tokens like'[CLS]'
or whitespace tokens).
For a BPE-based tokenizer like GPT-2, the vocabulary might include byte-level tokens. For example, running the above on "gpt2"
yields tokens like '!': 0, '"': 1, '#': 2, ...
etc., mapping punctuation and letters to IDs. The merges (BPE rules) are not directly shown in get_vocab()
but are used internally by the tokenizer. In contrast, the GGUF format might store those merges explicitly under a metadata key like tokenizer.ggml.merges
(as a list of merge rule strings) (huggingface.co).
In summary, obtaining the token dictionary in Hugging Face is straightforward via the tokenizer API, whereas in GGUF you extract it from the file’s metadata. In both cases, you can get the vocabulary size and contents for further analysis or verification.
Model Architecture Details and Number of Parameters
Understanding a model’s architecture (layer structure, activation functions, etc.) and its scale (number of parameters) is often necessary for research and deployment. We’ve seen how architecture hyperparameters (layers, heads, etc.) appear in metadata/config. Here we demonstrate retrieving a bit more detail, including calculating the number of parameters in the model and identifying layer structures, for both formats.
GGUF: Architecture and Parameters
The GGUF metadata provides the fundamental architecture parameters as keys (especially under the [architecture].*
namespace, e.g. llama.
for LLaMA models) (huggingface.co). This includes things like number of layers (block_count
), hidden dimension (embedding_length
), intermediate feed-forward size (feed_forward_length
), number of attention heads (head_count
), type of rotary embedding (rope.*
keys), etc. Using these, one can infer the layer structure (e.g., an LLM with N blocks, each containing self-attention and feed-forward sublayers). Activation function information is sometimes implicit (for instance, many models use GELU or ReLU by convention; some configs might include an explicit key if non-standard).
To get the total number of parameters from a GGUF file, we usually need to sum up the sizes of all tensors. Each tensor’s shape is available via the GGUF parser’s tensor list. We can iterate through parser.tensors
(or use the official GGUFReader.tensors
) and sum up the product of each tensor’s dimensions. Below is a code example to compute the parameter count and list some layer weights:
# Assuming parser.parse() was done earlier
total_params = 0
for tensor in parser.tensors: # each tensor has .shape attribute (e.g., [768, 3072])
# Calculate number of elements in this tensor
count = 1
for dim in tensor.shape:
count *= dim
total_params += count
print(f"Total model parameters (approx): {total_params}")
# For example, might print ~6.7 billion for a 7B model (because embedding matrices and such contribute to total)
# You can also examine specific tensor names to infer architecture details:
for t in parser.tensors[:5]:
print(f"Tensor: {t.name}, shape: {tuple(t.shape)}, type: {t.tensor_type}")
Comments:
parser.tensors
is assumed to be a list of tensor metadata objects, each with attributes likename
(the layer weight name, e.g.,"transformer.h.0.attn.q_proj.weight"
in a GPT model or similar),shape
(dimensions of the tensor), and possiblytensor_type
(data type or quantization type, e.g.,GGML_TYPE_Q4_K
for a quantized 4-bit weight) (gguf-parser · PyPI). This is based on thegguf-parser
which prints tensor info after metadata.- We loop through each tensor to accumulate
total_params
. The product of dimensions gives the number of elements (parameters) in that weight. Summing these yields the total parameter count. Note this counts all parameters including non-trainable ones like embeddings and does not distinguish between encoder/decoder if any – it’s a raw count. - We print out a few tensor names and shapes to see the layer structure. The naming conventions in GGUF (inherited from the original model) often reveal the model architecture. For example, you might see names like
layers.0.attention.wq.weight
ordecoder.block.0.ffn.weight
which tell you the model has a layered structure and what each tensor represents.
Using this method on a known model should match the expected parameter count. For instance, if you sum parameters of a LLaMA-7B GGUF, you should get roughly 6.7 billion, aligning with the model’s advertised size. Keep in mind quantized GGUF models store weights in compressed form, but the count of logical parameters remains the same (just stored with fewer bits).
Also, if the GGUF is a sharded model (split into multiple files), you would need to parse each shard to get the full parameter count. GGUF supports sharding, indicated by filename and metadata, but each file would be parsed similarly (huggingface.co).
Hugging Face: Architecture and Parameters
For Hugging Face models, the architecture details are readily available via the config
as shown before. To reiterate, typical fields include:
config.num_hidden_layers
– number of transformer layers (e.g., 12 for BERT-base, 24 for BERT-large) (HuggingFace Config Params Explained).config.num_attention_heads
– number of attention heads per layer (HuggingFace Config Params Explained) (HuggingFace Config Params Explained).config.hidden_size
– the dimensionality of embeddings and layer inputs/outputs.config.intermediate_size
– the size of the feed-forward layer inner dimension (for Transformer models) (HuggingFace Config Params Explained).config.hidden_act
– activation function for feed-forward layers (e.g., "gelu", "relu") (HuggingFace Config Params Explained).
These allow us to understand the layer structure. For example, a BERT config might say 12 layers, 12 heads, hidden_size 768, intermediate_size 3072, activation "gelu" (HuggingFace Config Params Explained), meaning each of the 12 layers has a self-attention with 12 heads and a 2-layer feed-forward network with GELU nonlinearity.
To get the total parameter count in Hugging Face, we can load the model weights and use the .num_parameters()
method provided by the model (or sum manually). Here’s an example:
model = AutoModel.from_pretrained(model_name)
param_count = model.num_parameters()
print(f"Total model parameters: {param_count}")
If we use bert-base-uncased
as model_name
, this should output about 110 million parameters (which is known for BERT-base). Indeed, Hugging Face’s documentation demonstrates this: DistilBERT has ~67M and BERT-base has ~110M parameters (Fine-tuning a masked language model - Hugging Face NLP Course). The num_parameters()
function conveniently includes all model parameters by default (How to get model size? - Hugging Face Forums).
Comments:
- Loading the full model can be memory heavy for very large models. If you only need the count, you might avoid loading optimizer states or unnecessary components. However,
.num_parameters()
is straightforward and widely used for Hugging Face models. - Under the hood,
model.num_parameters()
iterates throughmodel.parameters()
and sums their.numel()
. You could replicate this manually:sum(p.numel() for p in model.parameters())
gives the same result. - The
param_count
includes all weights (embeddings, transformer layers, output heads, etc.). If you want only trainable params or a subset, you could filter byp.requires_grad
or layer name. By default, Hugging Face models mark all model weights as trainable (unless you freeze some layers).
Using .num_parameters()
on a known model provides a sanity check against the config. For example, if a config says 12 layers, 768 hidden size, etc., one can roughly estimate parameters (there are formulae for Transformers) and the actual count will align. For BERT-base: 110M parameters as expected (Fine-tuning a masked language model - Hugging Face NLP Course).
Activation Functions and Layer Details: The config’s hidden_act
tells us the activation in the feed-forward layers (e.g., GELU for BERT, sometimes config.activation_function
for GPT-2 which might be "gelu_new"). Also, layer_norm_eps
in config gives the epsilon used in layer normalization layers, etc. All such details are accessible via config. So, for Hugging Face models, reading the config is usually enough to know the architectural hyperparameters. In GGUF, these details are in metadata keys (for example, llama.attention.layer_norm_rms_epsilon
might be a key for LLaMA’s RMSNorm epsilon (huggingface.co)).
Optimizer and Training Configuration Data
Finally, we consider optimizer states and training configuration. These are generally not part of a model’s core serialization for inference, but when saving checkpoints during training, one might save optimizer momentum, learning rate schedules, or training hyperparameters. The availability of this data depends on how the model was saved:
GGUF: Since GGUF is designed for inference, it typically does not include optimizer states or training arguments. It contains only what is needed to load and run the model (plus metadata about how it was trained, like maybe the original training context length or dataset if provided in description). There are no standard keys in GGUF for optimizer states. (At most, you might see a
general.training_*
field if the converter recorded something, but this is not common.) So for GGUF, parsing optimizer info is usually not applicable. You would refer back to the original training logs or model card for those details.Hugging Face: If a model checkpoint is saved via the
Trainer
API or similar, you may have auxiliary files likeoptimizer.pt
,scheduler.pt
, andtraining_args.bin
in the output directory. These contain the optimizer’s internal state (e.g., Adam moments), the learning rate scheduler state, and the training arguments (hyperparameters used for training). They are not loaded byfrom_pretrained
by default, but you can load them manually with PyTorch.
Parsing training configuration (Hyperparameters): Often, training_args.bin
(or .json
) holds the training hyperparameters (like learning rate, epochs, batch size, etc.). In many cases this is a binary pickled TrainingArguments
object. You can load it with torch.load
:
import torch
# Load training arguments (if available)
train_args = torch.load("path/to/training_args.bin", map_location="cpu")
print("Training arguments:", train_args)
# This might print a dataclass TrainingArguments with fields like learning_rate, num_train_epochs, etc.
# Load optimizer state dict (if available)
opt_state = torch.load("path/to/optimizer.pt", map_location="cpu")
# For example, opt_state might be a dict with keys 'state' (per-parameter states) and 'param_groups'
print("Optimizer state keys:", opt_state.keys())
if "param_groups" in opt_state:
lr = opt_state["param_groups"][0].get('lr', None)
print(f"Optimizer learning rate: {lr}")
Comments:
- We use
torch.load
to deserialize the objects. This requires that the environment has the same class definitions (forTrainingArguments
) if the object is not a plain dictionary. In practice,training_args.bin
is often just a pickledTrainingArguments
(which is a simple dataclass from Hugging Face). After loading, printing it will show something like:TrainingArguments(output_dir='...', num_train_epochs=3, learning_rate=5e-5, per_device_train_batch_size=8, ... )
giving all the training hyperparameters. - The optimizer state (
optimizer.pt
) when loaded is usually a state dict (a Python dict). Typically it has two main entries:"state"
(a dict of parameter-specific states like momentum vectors) and"param_groups"
(which contains the hyperparameters for the optimizer, like the learning rate for each group) – this is how PyTorch savesoptimizer.state_dict()
. We print the keys to confirm and then, if available, extract the learning rate from the first param group as an example. - Note that these files are optional. If the model was not saved mid-training or the uploader only provided the final weights, you may not have any
training_args.bin
oroptimizer.pt
. On the Hugging Face Hub, usually only the model weights and config are uploaded, not the optimizer. The training arguments might be documented in the model card instead.
By loading these, you can programmatically inspect how the model was trained. For instance, verifying the learning rate or number of epochs can be done via the TrainingArguments
object. This information is external to the model’s architecture but important for reproducibility.
In summary, GGUF files focus on model inference data and omit optimizer/training specifics, whereas Hugging Face training checkpoints can include optimizer states and training configs as separate files. We demonstrated how to load those with PyTorch for completeness.
Performance Considerations (Brief)
GGUF models are optimized for efficient inference, often using quantized weights to reduce size, which leads to faster loading and lower memory usage for CPU-bound deployments (GGUF and interaction with Transformers). The single-file design and memory-mappable format means startup is quick and all necessary data (weights + vocab + config) is loaded in one go. However, using GGUF in frameworks like PyTorch may require converting back to full precision, incurring some overhead (GGUF and interaction with Transformers). Hugging Face format models (in PyTorch or TensorFlow) are typically in higher precision and may leverage GPUs for faster computation, which can give better throughput on supported hardware. In practice, GGUF (with llama.cpp or similar executors) can deliver excellent CPU inference performance with minimal resources, while Hugging Face models excel in flexibility (fine-tuning, GPU acceleration) at the cost of larger memory footprints. Ultimately, the choice of format can affect load time and inference speed: GGUF offers a portable, compact model for deployment, whereas the standard Hugging Face format integrates seamlessly with training pipelines and broad hardware acceleration, making each format preferable for different stages of the model lifecycle (Understanding Hugging Face Model File Formats, GGML, and GGUF! | by Rajesh | DevOps.dev) (Understanding Hugging Face Model File Formats, GGML, and GGUF! | by Rajesh | DevOps.dev).
Prepared with OpenAI o1-pro & deep-research