CLIP (Contrastive Language–Image Pre-training)

CLIP (Contrastive Language–Image Pre-training) is a neural network architecture that learns visual concepts from natural language supervision. It is trained on a large dataset of image-text pairs to create a unified vision-language model that can understand both images and text in a shared semantic space.

CLIP consists of two main components: 1. A vision encoder (Vision Transformer) that processes images into visual features 2. A text encoder (Transformer) that processes text into textual features

The model is trained using contrastive learning, where it learns to maximize the cosine similarity between the embeddings of matching image-text pairs while minimizing it for non-matching pairs. This allows CLIP to perform zero-shot classification by comparing image embeddings with text embeddings of potential labels.

CLIP was introduced in the paper "Learning Transferable Visual Models From Natural Language Supervision" and has shown remarkable zero-shot generalization capabilities across a wide range of visual classification tasks. The CLIP model combines a Vision Transformer and a Text Transformer to learn joint representations of images and text. It is trained to maximize the similarity between matching image-text pairs while minimizing similarity between non-matching pairs.

Flash / Splash Attention

CLIP supports hardware-accelerated attention via Tokamax. Pass an attention_fn at construction time:

Backend	Hardware	Notes
`"mosaic"`	NVIDIA H100 (SM90) / B100 (SM100)	Pallas Mosaic GPU kernel
`"triton"`	Any NVIDIA GPU	Pallas Triton kernel
`"cudnn"`	NVIDIA GPU	Via JAX-NN / cuDNN
`"mosaic_tpu"`	TPU v5 / v7	Splash attention (block-sparse)
`"xla_chunked"`	GPU / TPU	Flash-style chunked XLA
`"xla"`	Any	Standard XLA fallback

import jimm

# GPU: try H100 Mosaic kernel, fall back to Triton, then XLA
model = jimm.CLIP.from_pretrained("openai/clip-vit-large-patch14",
                                   attention_fn=jimm.make_tokamax_attention(["mosaic", "triton", "xla"]))

# TPU: try Splash attention, fall back to chunked XLA
model = jimm.CLIP.from_pretrained("openai/clip-vit-large-patch14",
                                   attention_fn=jimm.make_tokamax_attention(["mosaic_tpu", "xla_chunked"]))

You can also apply different kernels to each encoder via vision_attention_fn and text_attention_fn.

Note: Flash/Splash attention does not provide a speedup at typical CLIP context lengths (256 image tokens, 77 text tokens). The primary benefit is memory reduction at longer sequence lengths.

FSDP / Explicit Sharding

CLIP supports JAX explicit sharding (FSDP-style) out of the box via CLIPSharding. Large weight matrices are sharded on the contracting (in_features) dimension so that activations carry only the batch-axis sharding, avoiding duplicate-axis conflicts.

from jax.experimental import mesh_utils
from jax.sharding import AxisType, Mesh, NamedSharding, PartitionSpec as P
import jax

n_devices = jax.device_count()
mesh = Mesh(
    mesh_utils.create_device_mesh((1, n_devices)),
    ("data", "fsdp"),
    axis_types=(AxisType.Explicit, AxisType.Explicit),
)
jax.set_mesh(mesh)

model = jimm.CLIP.from_pretrained("openai/clip-vit-large-patch14")
# model params are automatically sharded across fsdp axis

CLIPSharding specs represent per-layer shapes. The Transformer stack prepends None for the scan axis to the Variable metadata after nnx.vmap, so the optimizer (e.g. nnx.Optimizer with AdamW) receives the correct stacked spec and initialises its state without any manual fixups.

To disable sharding, pass sharding=jimm.common.sharding.NoSharding().

`jimm.models.clip.CLIPVisionModel`

Bases: Module

Source code in src/jimm/models/clip/clip_model.py

class CLIPVisionModel(nnx.Module):
    def __init__(
        self,
        image_resolution: int,
        vision_layers: int,
        vision_hidden_size: int,
        vision_patch_size: int,
        projection_dim: int,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = CLIPSharding,
    ):
        """Initialize the Vision Encoder with projection.

        Args:
            image_resolution (int): The resolution of the input images.
            vision_layers (int): The number of layers in the vision transformer.
            vision_hidden_size (int): The hidden dimension size of the vision transformer.
            vision_patch_size (int): The patch size of the vision transformer.
            projection_dim (int): The output dimension after projection.
            use_gradient_checkpointing (bool, optional): Whether to use gradient checkpointing. Defaults to False.
            attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.
            rngs (rnglib.Rngs | None, optional): The random number generator state. If None, initializes to nnx.Rngs(0).
            dtype (DTypeLike, optional): The data type for computations. Defaults to jnp.float32.
            param_dtype (DTypeLike, optional): The data type for parameters. Defaults to jnp.float32.
            sharding (ShardingSpec, optional): Sharding specification for parameters. Defaults to CLIPSharding.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        self.vision_layers = vision_layers
        self.vision_hidden_size = vision_hidden_size
        self.vision_patch_size = vision_patch_size
        self.projection_dim = projection_dim
        self.dtype = dtype

        vision_heads = vision_hidden_size // 64

        self.encoder = VisionTransformerBase(
            img_size=image_resolution,
            patch_size=vision_patch_size,
            in_channels=3,
            hidden_size=vision_hidden_size,
            num_layers=vision_layers,
            num_heads=vision_heads,
            mlp_dim=vision_hidden_size * 4,
            use_pre_norm=True,
            use_patch_bias=False,
            use_quick_gelu=True,
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=attention_fn,
            pooling_type="CLS",
            layernorm_epsilon=1e-5,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )
        self.visual_projection = nnx.Linear(
            vision_hidden_size,
            projection_dim,
            use_bias=False,
            dtype=dtype,
            param_dtype=param_dtype,
            rngs=rngs,
            kernel_init=nnx.with_partitioning(
                nnx.initializers.xavier_uniform(),
                sharding.proj_kernel,
            ),
        )

    def __call__(self, image: Float[Array, "batch height width channels"], do_projection: bool = True) -> Float[Array, "batch vision_hidden_size_or_projection_dim"]:
        """Encode images into embeddings.

        Args:
            image (Float[Array, "batch height width channels"]): Batch of input images.
            do_projection (bool): Whether to apply the visual projection layer. Defaults to True.

        Returns:
            Float[Array, "batch vision_hidden_size_or_projection_dim"]: Image embeddings.
            Shape depends on do_projection: vision_hidden_size if False, projection_dim if True.
        """
        features = self.encoder(image)
        if do_projection:
            out_shard = named_sharding_like(features, P(sharding_of(features).spec[0], sharding_of(self.visual_projection.kernel[...]).spec[-1]))
            return self.visual_projection(features, out_sharding=out_shard)
        return features

    @classmethod
    def from_pretrained(
        cls,
        model_name_or_path: str,
        use_pytorch: bool = False,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = CLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
    ) -> "CLIPVisionModel":
        """Load a pretrained vision encoder from a CLIP checkpoint.

        Args:
            model_name_or_path (str): Path to local weights or HuggingFace model ID.
            use_pytorch (bool): Whether to load from PyTorch weights. Defaults to False.
            rngs (rnglib.Rngs | None): Random number generator keys. If None, initializes to nnx.Rngs(0).
            dtype (DTypeLike): Data type for computations. Defaults to jnp.float32.
            param_dtype (DTypeLike): Data type for parameters. Defaults to jnp.float32.
            sharding (ShardingSpec): Sharding specification for parameters. Defaults to CLIPSharding.
            use_gradient_checkpointing (bool): Whether to use gradient checkpointing. Defaults to False.
            attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.

        Returns:
            CLIPVisionModel: Pretrained CLIP vision model
        """
        from .params import load_vision_from_pretrained

        return load_vision_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

    @classmethod
    def from_config(
        cls,
        config: dict[str, Any],
        *,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = CLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
    ) -> "CLIPVisionModel":
        """Create model from HuggingFace-compatible config dict.

        Args:
            config: Configuration with "vision_config" and "text_config" keys.
            rngs: Random number generator state.
            dtype: Data type for computations.
            param_dtype: Data type for parameters.
            sharding: Sharding specification for parameters.
            use_gradient_checkpointing: Enable gradient checkpointing.
            attention_fn: Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.

        Returns:
            CLIPVisionModel with random weights.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        vision_config = config["vision_config"]
        text_config = config["text_config"]

        return cls(
            image_resolution=vision_config["image_size"],
            vision_layers=vision_config["num_hidden_layers"],
            vision_hidden_size=vision_config["hidden_size"],
            vision_patch_size=vision_config["patch_size"],
            projection_dim=text_config["hidden_size"],
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

    def save_pretrained(self, save_directory: str) -> None:
        """Save model weights and config in HuggingFace format.

        Args:
            save_directory (str): Directory path where the model will be saved.
        """
        from .params import save_vision_pretrained

        save_vision_pretrained(self, save_directory)

`call(image, do_projection=True)`

Encode images into embeddings.

Parameters:

Name	Type	Description	Default
`image`	`Float[Array, 'batch height width channels']`	Batch of input images.	required
`do_projection`	`bool`	Whether to apply the visual projection layer. Defaults to True.	`True`

Returns:

Type	Description
`Float[Array, 'batch vision_hidden_size_or_projection_dim']`	Float[Array, "batch vision_hidden_size_or_projection_dim"]: Image embeddings.
`Float[Array, 'batch vision_hidden_size_or_projection_dim']`	Shape depends on do_projection: vision_hidden_size if False, projection_dim if True.

Source code in src/jimm/models/clip/clip_model.py

def __call__(self, image: Float[Array, "batch height width channels"], do_projection: bool = True) -> Float[Array, "batch vision_hidden_size_or_projection_dim"]:
    """Encode images into embeddings.

    Args:
        image (Float[Array, "batch height width channels"]): Batch of input images.
        do_projection (bool): Whether to apply the visual projection layer. Defaults to True.

    Returns:
        Float[Array, "batch vision_hidden_size_or_projection_dim"]: Image embeddings.
        Shape depends on do_projection: vision_hidden_size if False, projection_dim if True.
    """
    features = self.encoder(image)
    if do_projection:
        out_shard = named_sharding_like(features, P(sharding_of(features).spec[0], sharding_of(self.visual_projection.kernel[...]).spec[-1]))
        return self.visual_projection(features, out_sharding=out_shard)
    return features

`init(image_resolution, vision_layers, vision_hidden_size, vision_patch_size, projection_dim, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding)`

Initialize the Vision Encoder with projection.

Parameters:

Name	Type	Description	Default
`image_resolution`	`int`	The resolution of the input images.	required
`vision_layers`	`int`	The number of layers in the vision transformer.	required
`vision_hidden_size`	`int`	The hidden dimension size of the vision transformer.	required
`vision_patch_size`	`int`	The patch size of the vision transformer.	required
`projection_dim`	`int`	The output dimension after projection.	required
`use_gradient_checkpointing`	`bool`	Whether to use gradient checkpointing. Defaults to False.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.	`None`
`rngs`	`Rngs \| None`	The random number generator state. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	The data type for computations. Defaults to jnp.float32.	`float32`
`param_dtype`	`DTypeLike`	The data type for parameters. Defaults to jnp.float32.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters. Defaults to CLIPSharding.	`CLIPSharding`

Source code in src/jimm/models/clip/clip_model.py

def __init__(
    self,
    image_resolution: int,
    vision_layers: int,
    vision_hidden_size: int,
    vision_patch_size: int,
    projection_dim: int,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = CLIPSharding,
):
    """Initialize the Vision Encoder with projection.

    Args:
        image_resolution (int): The resolution of the input images.
        vision_layers (int): The number of layers in the vision transformer.
        vision_hidden_size (int): The hidden dimension size of the vision transformer.
        vision_patch_size (int): The patch size of the vision transformer.
        projection_dim (int): The output dimension after projection.
        use_gradient_checkpointing (bool, optional): Whether to use gradient checkpointing. Defaults to False.
        attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.
        rngs (rnglib.Rngs | None, optional): The random number generator state. If None, initializes to nnx.Rngs(0).
        dtype (DTypeLike, optional): The data type for computations. Defaults to jnp.float32.
        param_dtype (DTypeLike, optional): The data type for parameters. Defaults to jnp.float32.
        sharding (ShardingSpec, optional): Sharding specification for parameters. Defaults to CLIPSharding.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    self.vision_layers = vision_layers
    self.vision_hidden_size = vision_hidden_size
    self.vision_patch_size = vision_patch_size
    self.projection_dim = projection_dim
    self.dtype = dtype

    vision_heads = vision_hidden_size // 64

    self.encoder = VisionTransformerBase(
        img_size=image_resolution,
        patch_size=vision_patch_size,
        in_channels=3,
        hidden_size=vision_hidden_size,
        num_layers=vision_layers,
        num_heads=vision_heads,
        mlp_dim=vision_hidden_size * 4,
        use_pre_norm=True,
        use_patch_bias=False,
        use_quick_gelu=True,
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=attention_fn,
        pooling_type="CLS",
        layernorm_epsilon=1e-5,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )
    self.visual_projection = nnx.Linear(
        vision_hidden_size,
        projection_dim,
        use_bias=False,
        dtype=dtype,
        param_dtype=param_dtype,
        rngs=rngs,
        kernel_init=nnx.with_partitioning(
            nnx.initializers.xavier_uniform(),
            sharding.proj_kernel,
        ),
    )

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

Create model from HuggingFace-compatible config dict.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration with "vision_config" and "text_config" keys.	required
`rngs`	`Rngs \| None`	Random number generator state.	`None`
`dtype`	`DTypeLike`	Data type for computations.	`float32`
`param_dtype`	`DTypeLike`	Data type for parameters.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters.	`CLIPSharding`
`use_gradient_checkpointing`	`bool`	Enable gradient checkpointing.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.	`None`

Returns:

Type	Description
`CLIPVisionModel`	CLIPVisionModel with random weights.

Source code in src/jimm/models/clip/clip_model.py

@classmethod
def from_config(
    cls,
    config: dict[str, Any],
    *,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = CLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
) -> "CLIPVisionModel":
    """Create model from HuggingFace-compatible config dict.

    Args:
        config: Configuration with "vision_config" and "text_config" keys.
        rngs: Random number generator state.
        dtype: Data type for computations.
        param_dtype: Data type for parameters.
        sharding: Sharding specification for parameters.
        use_gradient_checkpointing: Enable gradient checkpointing.
        attention_fn: Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.

    Returns:
        CLIPVisionModel with random weights.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    vision_config = config["vision_config"]
    text_config = config["text_config"]

    return cls(
        image_resolution=vision_config["image_size"],
        vision_layers=vision_config["num_hidden_layers"],
        vision_hidden_size=vision_config["hidden_size"],
        vision_patch_size=vision_config["patch_size"],
        projection_dim=text_config["hidden_size"],
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

Load a pretrained vision encoder from a CLIP checkpoint.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	Path to local weights or HuggingFace model ID.	required
`use_pytorch`	`bool`	Whether to load from PyTorch weights. Defaults to False.	`False`
`rngs`	`Rngs \| None`	Random number generator keys. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	Data type for computations. Defaults to jnp.float32.	`float32`
`param_dtype`	`DTypeLike`	Data type for parameters. Defaults to jnp.float32.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters. Defaults to CLIPSharding.	`CLIPSharding`
`use_gradient_checkpointing`	`bool`	Whether to use gradient checkpointing. Defaults to False.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.	`None`

Returns:

Name	Type	Description
`CLIPVisionModel`	`CLIPVisionModel`	Pretrained CLIP vision model

Source code in src/jimm/models/clip/clip_model.py

@classmethod
def from_pretrained(
    cls,
    model_name_or_path: str,
    use_pytorch: bool = False,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = CLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
) -> "CLIPVisionModel":
    """Load a pretrained vision encoder from a CLIP checkpoint.

    Args:
        model_name_or_path (str): Path to local weights or HuggingFace model ID.
        use_pytorch (bool): Whether to load from PyTorch weights. Defaults to False.
        rngs (rnglib.Rngs | None): Random number generator keys. If None, initializes to nnx.Rngs(0).
        dtype (DTypeLike): Data type for computations. Defaults to jnp.float32.
        param_dtype (DTypeLike): Data type for parameters. Defaults to jnp.float32.
        sharding (ShardingSpec): Sharding specification for parameters. Defaults to CLIPSharding.
        use_gradient_checkpointing (bool): Whether to use gradient checkpointing. Defaults to False.
        attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.

    Returns:
        CLIPVisionModel: Pretrained CLIP vision model
    """
    from .params import load_vision_from_pretrained

    return load_vision_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

`save_pretrained(save_directory)`

Save model weights and config in HuggingFace format.

Parameters:

Name	Type	Description	Default
`save_directory`	`str`	Directory path where the model will be saved.	required

Source code in src/jimm/models/clip/clip_model.py

def save_pretrained(self, save_directory: str) -> None:
    """Save model weights and config in HuggingFace format.

    Args:
        save_directory (str): Directory path where the model will be saved.
    """
    from .params import save_vision_pretrained

    save_vision_pretrained(self, save_directory)

`jimm.models.clip.CLIPTextModel`

Bases: Module

Source code in src/jimm/models/clip/clip_model.py

class CLIPTextModel(nnx.Module):
    def __init__(
        self,
        context_length: int,
        vocab_size: int,
        text_hidden_size: int,
        num_text_heads: int,
        num_text_layers: int,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = CLIPSharding,
    ):
        """Initialize CLIP text encoder.

        Args:
            context_length (int): Maximum sequence length.
            vocab_size (int): Size of vocabulary.
            text_hidden_size (int): Hidden dimension size of the text transformer.
            num_text_heads (int): Number of attention heads in the text transformer.
            num_text_layers (int): Number of transformer layers in the text transformer.
            use_gradient_checkpointing (bool): Enable gradient checkpointing.
            attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.
            rngs (rnglib.Rngs | None): RNG state. If None, initializes to nnx.Rngs(0).
            dtype (DTypeLike): Computation dtype.
            param_dtype (DTypeLike): Parameter dtype.
            sharding (ShardingSpec): Sharding specification for parameters.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        self.context_length = context_length
        self.vocab_size = vocab_size
        self.text_hidden_size = text_hidden_size
        self.num_text_heads = num_text_heads
        self.num_text_layers = num_text_layers
        self.dtype = dtype

        self.token_embedding = nnx.Embed(
            num_embeddings=vocab_size,
            features=text_hidden_size,
            dtype=dtype,
            param_dtype=param_dtype,
            rngs=rngs,
            embedding_init=nnx.with_partitioning(
                nnx.initializers.xavier_uniform(),
                sharding.embed,
            ),
        )
        self.positional_embedding = nnx.Param(
            nnx.with_partitioning(
                nnx.initializers.truncated_normal(stddev=0.02),
                sharding.text_pos_embed,
            )(rngs.params(), (context_length, text_hidden_size))
        )

        attn_mask = jnp.tril(jnp.ones((context_length, context_length), dtype=dtype))
        self.transformer = Transformer(
            hidden_size=text_hidden_size,
            mlp_dim=text_hidden_size * 4,
            num_layers=num_text_layers,
            num_heads=num_text_heads,
            dropout_rate=0.0,
            attn_mask=attn_mask,
            use_quick_gelu=True,
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

        self.ln_final = nnx.LayerNorm(
            text_hidden_size,
            epsilon=1e-5,
            dtype=dtype,
            param_dtype=param_dtype,
            rngs=rngs,
            scale_init=nnx.with_partitioning(
                nnx.initializers.ones_init(),
                sharding.layernorm,
            ),
            bias_init=nnx.with_partitioning(
                nnx.initializers.zeros_init(),
                sharding.layernorm,
            ),
        )

        self.text_projection = nnx.Linear(
            text_hidden_size,
            text_hidden_size,
            use_bias=False,
            dtype=dtype,
            param_dtype=param_dtype,
            rngs=rngs,
            kernel_init=nnx.with_partitioning(
                nnx.initializers.xavier_uniform(),
                sharding.proj_kernel,
            ),
        )

    def __call__(self, text: Int[Array, "batch context_length"], do_projection: bool = True) -> Float[Array, "batch text_hidden_size"]:
        """Encode text tokens into embeddings.

        Args:
            text (Int[Array, "batch context_length"]): Token sequences.
            do_projection (bool): Apply text projection layer.

        Returns:
            Float[Array, "batch text_hidden_size"]: Text embeddings.
        """
        seq_len = text.shape[1]
        text_sharding = sharding_of(text)
        embed_sharding = named_sharding_like(text, P(*text_sharding.spec, None))
        x = self.token_embedding.embedding[...].at[text].get(out_sharding=embed_sharding)
        pos_embed = jnp.broadcast_to(self.positional_embedding[...][:seq_len], x.shape)
        x = x + reshard_like(pos_embed, x)
        x = self.transformer(x)
        x = self.ln_final(x)

        eot_mask = jax.nn.one_hot(jnp.argmax(text, axis=-1), x.shape[1])
        x_spec = sharding_of(x).spec
        pooled_sharding = named_sharding_like(x, P(x_spec[0], x_spec[2]))
        x = jnp.einsum("bsh,bs->bh", x, eot_mask, out_sharding=pooled_sharding)

        if do_projection:
            out_shard = named_sharding_like(x, P(sharding_of(x).spec[0], sharding_of(self.text_projection.kernel[...]).spec[-1]))
            x = self.text_projection(x, out_sharding=out_shard)
        return x

    @classmethod
    def from_pretrained(
        cls,
        model_name_or_path: str,
        use_pytorch: bool = False,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = CLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
    ) -> "CLIPTextModel":
        """Load pretrained text encoder from CLIP checkpoint.

        Args:
            model_name_or_path (str): Local path or HuggingFace model ID.
            use_pytorch (bool): Load from PyTorch weights.
            rngs (rnglib.Rngs | None): RNG state. If None, initializes to nnx.Rngs(0).
            dtype (DTypeLike): Computation dtype.
            param_dtype (DTypeLike): Parameter dtype.
            sharding (ShardingSpec): Sharding specification for parameters.
            use_gradient_checkpointing (bool): Enable gradient checkpointing.
            attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.

        Returns:
            CLIPTextModel: Pretrained text model.
        """
        from .params import load_text_from_pretrained

        return load_text_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

    @classmethod
    def from_config(
        cls,
        config: dict[str, Any],
        *,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = CLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
    ) -> "CLIPTextModel":
        """Create model from HuggingFace-compatible config dict.

        Args:
            config: Configuration with "text_config" key.
            rngs: Random number generator state.
            dtype: Data type for computations.
            param_dtype: Data type for parameters.
            sharding: Sharding specification for parameters.
            use_gradient_checkpointing: Enable gradient checkpointing.
            attention_fn: Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.

        Returns:
            CLIPTextModel with random weights.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        text_config = config["text_config"]

        return cls(
            context_length=text_config["max_position_embeddings"],
            vocab_size=text_config["vocab_size"],
            text_hidden_size=text_config["hidden_size"],
            num_text_heads=text_config["num_attention_heads"],
            num_text_layers=text_config["num_hidden_layers"],
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

    def save_pretrained(self, save_directory: str) -> None:
        """Save model weights and config in HuggingFace format.

        Args:
            save_directory (str): Directory path where the model will be saved.
        """
        from .params import save_text_pretrained

        save_text_pretrained(self, save_directory)

`call(text, do_projection=True)`

Encode text tokens into embeddings.

Parameters:

Name	Type	Description	Default
`text`	`Int[Array, 'batch context_length']`	Token sequences.	required
`do_projection`	`bool`	Apply text projection layer.	`True`

Returns:

Type	Description
`Float[Array, 'batch text_hidden_size']`	Float[Array, "batch text_hidden_size"]: Text embeddings.

Source code in src/jimm/models/clip/clip_model.py

def __call__(self, text: Int[Array, "batch context_length"], do_projection: bool = True) -> Float[Array, "batch text_hidden_size"]:
    """Encode text tokens into embeddings.

    Args:
        text (Int[Array, "batch context_length"]): Token sequences.
        do_projection (bool): Apply text projection layer.

    Returns:
        Float[Array, "batch text_hidden_size"]: Text embeddings.
    """
    seq_len = text.shape[1]
    text_sharding = sharding_of(text)
    embed_sharding = named_sharding_like(text, P(*text_sharding.spec, None))
    x = self.token_embedding.embedding[...].at[text].get(out_sharding=embed_sharding)
    pos_embed = jnp.broadcast_to(self.positional_embedding[...][:seq_len], x.shape)
    x = x + reshard_like(pos_embed, x)
    x = self.transformer(x)
    x = self.ln_final(x)

    eot_mask = jax.nn.one_hot(jnp.argmax(text, axis=-1), x.shape[1])
    x_spec = sharding_of(x).spec
    pooled_sharding = named_sharding_like(x, P(x_spec[0], x_spec[2]))
    x = jnp.einsum("bsh,bs->bh", x, eot_mask, out_sharding=pooled_sharding)

    if do_projection:
        out_shard = named_sharding_like(x, P(sharding_of(x).spec[0], sharding_of(self.text_projection.kernel[...]).spec[-1]))
        x = self.text_projection(x, out_sharding=out_shard)
    return x

`init(context_length, vocab_size, text_hidden_size, num_text_heads, num_text_layers, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding)`

Initialize CLIP text encoder.

Parameters:

Name	Type	Description	Default
`context_length`	`int`	Maximum sequence length.	required
`vocab_size`	`int`	Size of vocabulary.	required
`text_hidden_size`	`int`	Hidden dimension size of the text transformer.	required
`num_text_heads`	`int`	Number of attention heads in the text transformer.	required
`num_text_layers`	`int`	Number of transformer layers in the text transformer.	required
`use_gradient_checkpointing`	`bool`	Enable gradient checkpointing.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.	`None`
`rngs`	`Rngs \| None`	RNG state. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	Computation dtype.	`float32`
`param_dtype`	`DTypeLike`	Parameter dtype.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters.	`CLIPSharding`

Source code in src/jimm/models/clip/clip_model.py

def __init__(
    self,
    context_length: int,
    vocab_size: int,
    text_hidden_size: int,
    num_text_heads: int,
    num_text_layers: int,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = CLIPSharding,
):
    """Initialize CLIP text encoder.

    Args:
        context_length (int): Maximum sequence length.
        vocab_size (int): Size of vocabulary.
        text_hidden_size (int): Hidden dimension size of the text transformer.
        num_text_heads (int): Number of attention heads in the text transformer.
        num_text_layers (int): Number of transformer layers in the text transformer.
        use_gradient_checkpointing (bool): Enable gradient checkpointing.
        attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.
        rngs (rnglib.Rngs | None): RNG state. If None, initializes to nnx.Rngs(0).
        dtype (DTypeLike): Computation dtype.
        param_dtype (DTypeLike): Parameter dtype.
        sharding (ShardingSpec): Sharding specification for parameters.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    self.context_length = context_length
    self.vocab_size = vocab_size
    self.text_hidden_size = text_hidden_size
    self.num_text_heads = num_text_heads
    self.num_text_layers = num_text_layers
    self.dtype = dtype

    self.token_embedding = nnx.Embed(
        num_embeddings=vocab_size,
        features=text_hidden_size,
        dtype=dtype,
        param_dtype=param_dtype,
        rngs=rngs,
        embedding_init=nnx.with_partitioning(
            nnx.initializers.xavier_uniform(),
            sharding.embed,
        ),
    )
    self.positional_embedding = nnx.Param(
        nnx.with_partitioning(
            nnx.initializers.truncated_normal(stddev=0.02),
            sharding.text_pos_embed,
        )(rngs.params(), (context_length, text_hidden_size))
    )

    attn_mask = jnp.tril(jnp.ones((context_length, context_length), dtype=dtype))
    self.transformer = Transformer(
        hidden_size=text_hidden_size,
        mlp_dim=text_hidden_size * 4,
        num_layers=num_text_layers,
        num_heads=num_text_heads,
        dropout_rate=0.0,
        attn_mask=attn_mask,
        use_quick_gelu=True,
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

    self.ln_final = nnx.LayerNorm(
        text_hidden_size,
        epsilon=1e-5,
        dtype=dtype,
        param_dtype=param_dtype,
        rngs=rngs,
        scale_init=nnx.with_partitioning(
            nnx.initializers.ones_init(),
            sharding.layernorm,
        ),
        bias_init=nnx.with_partitioning(
            nnx.initializers.zeros_init(),
            sharding.layernorm,
        ),
    )

    self.text_projection = nnx.Linear(
        text_hidden_size,
        text_hidden_size,
        use_bias=False,
        dtype=dtype,
        param_dtype=param_dtype,
        rngs=rngs,
        kernel_init=nnx.with_partitioning(
            nnx.initializers.xavier_uniform(),
            sharding.proj_kernel,
        ),
    )

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

Create model from HuggingFace-compatible config dict.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration with "text_config" key.	required
`rngs`	`Rngs \| None`	Random number generator state.	`None`
`dtype`	`DTypeLike`	Data type for computations.	`float32`
`param_dtype`	`DTypeLike`	Data type for parameters.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters.	`CLIPSharding`
`use_gradient_checkpointing`	`bool`	Enable gradient checkpointing.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.	`None`

Returns:

Type	Description
`CLIPTextModel`	CLIPTextModel with random weights.

Source code in src/jimm/models/clip/clip_model.py

@classmethod
def from_config(
    cls,
    config: dict[str, Any],
    *,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = CLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
) -> "CLIPTextModel":
    """Create model from HuggingFace-compatible config dict.

    Args:
        config: Configuration with "text_config" key.
        rngs: Random number generator state.
        dtype: Data type for computations.
        param_dtype: Data type for parameters.
        sharding: Sharding specification for parameters.
        use_gradient_checkpointing: Enable gradient checkpointing.
        attention_fn: Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.

    Returns:
        CLIPTextModel with random weights.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    text_config = config["text_config"]

    return cls(
        context_length=text_config["max_position_embeddings"],
        vocab_size=text_config["vocab_size"],
        text_hidden_size=text_config["hidden_size"],
        num_text_heads=text_config["num_attention_heads"],
        num_text_layers=text_config["num_hidden_layers"],
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

Load pretrained text encoder from CLIP checkpoint.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	Local path or HuggingFace model ID.	required
`use_pytorch`	`bool`	Load from PyTorch weights.	`False`
`rngs`	`Rngs \| None`	RNG state. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	Computation dtype.	`float32`
`param_dtype`	`DTypeLike`	Parameter dtype.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters.	`CLIPSharding`
`use_gradient_checkpointing`	`bool`	Enable gradient checkpointing.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.	`None`

Returns:

Name	Type	Description
`CLIPTextModel`	`CLIPTextModel`	Pretrained text model.

Source code in src/jimm/models/clip/clip_model.py

@classmethod
def from_pretrained(
    cls,
    model_name_or_path: str,
    use_pytorch: bool = False,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = CLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
) -> "CLIPTextModel":
    """Load pretrained text encoder from CLIP checkpoint.

    Args:
        model_name_or_path (str): Local path or HuggingFace model ID.
        use_pytorch (bool): Load from PyTorch weights.
        rngs (rnglib.Rngs | None): RNG state. If None, initializes to nnx.Rngs(0).
        dtype (DTypeLike): Computation dtype.
        param_dtype (DTypeLike): Parameter dtype.
        sharding (ShardingSpec): Sharding specification for parameters.
        use_gradient_checkpointing (bool): Enable gradient checkpointing.
        attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.

    Returns:
        CLIPTextModel: Pretrained text model.
    """
    from .params import load_text_from_pretrained

    return load_text_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

`save_pretrained(save_directory)`

Save model weights and config in HuggingFace format.

Parameters:

Name	Type	Description	Default
`save_directory`	`str`	Directory path where the model will be saved.	required

Source code in src/jimm/models/clip/clip_model.py

def save_pretrained(self, save_directory: str) -> None:
    """Save model weights and config in HuggingFace format.

    Args:
        save_directory (str): Directory path where the model will be saved.
    """
    from .params import save_text_pretrained

    save_text_pretrained(self, save_directory)

`jimm.models.clip.CLIP`

Bases: Module

Source code in src/jimm/models/clip/clip_model.py

class CLIP(nnx.Module):
    def __init__(
        self,
        image_resolution: int,
        vision_layers: int,
        vision_hidden_size: int,
        vision_patch_size: int,
        context_length: int,
        vocab_size: int,
        text_hidden_size: int,
        num_text_heads: int,
        num_text_layers: int,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
        vision_attention_fn: Callable[..., Any] | None = None,
        text_attention_fn: Callable[..., Any] | None = None,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = CLIPSharding,
    ):
        """Initialize the CLIP model.

        Args:
            image_resolution (int): The resolution of the input images.
            vision_layers (int): The number of layers in the vision transformer.
            vision_hidden_size (int): The hidden dimension size of the vision transformer.
            vision_patch_size (int): The patch size of the vision transformer.
            context_length (int): The maximum sequence length for text.
            vocab_size (int): The size of the vocabulary.
            text_hidden_size (int): The hidden dimension size of the text transformer.
            num_text_heads (int): The number of attention heads in the text transformer.
            num_text_layers (int): The number of layers in the text transformer.
            use_gradient_checkpointing (bool, optional): Whether to use gradient checkpointing. Defaults to False.
            attention_fn (Callable[..., Any] | None, optional): Custom attention function applied to both encoders. Defaults to None.
            vision_attention_fn (Callable[..., Any] | None, optional): Override attention_fn for the vision encoder only. Defaults to None.
            text_attention_fn (Callable[..., Any] | None, optional): Override attention_fn for the text encoder only. Defaults to None.
            rngs (rnglib.Rngs | None, optional): The random number generator state. If None, initializes to nnx.Rngs(0).
            dtype (DTypeLike, optional): The data type for computations. Defaults to jnp.float32.
            param_dtype (DTypeLike, optional): The data type for parameters. Defaults to jnp.float32.
            sharding (ShardingSpec): Sharding specification for parameters. Defaults to CLIPSharding.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        self.vision_layers = vision_layers
        self.vision_hidden_size = vision_hidden_size
        self.vision_patch_size = vision_patch_size
        self.context_length = context_length
        self.vocab_size = vocab_size
        self.text_hidden_size = text_hidden_size
        self.num_text_heads = num_text_heads
        self.num_text_layers = num_text_layers
        self.dtype = dtype
        self._original_config = None

        self.vision_model = CLIPVisionModel(
            image_resolution=image_resolution,
            vision_layers=vision_layers,
            vision_hidden_size=vision_hidden_size,
            vision_patch_size=vision_patch_size,
            projection_dim=text_hidden_size,
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=vision_attention_fn or attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

        self.text_model = CLIPTextModel(
            context_length=context_length,
            vocab_size=vocab_size,
            text_hidden_size=text_hidden_size,
            num_text_heads=num_text_heads,
            num_text_layers=num_text_layers,
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=text_attention_fn or attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )
        self.logit_scale = nnx.Param(nnx.with_partitioning(nnx.initializers.ones_init(), ())(rngs.params(), ()))

    def encode_image(self, image: Float[Array, "batch height width channels"], do_projection: bool = True) -> Float[Array, "batch text_hidden_size"]:
        """Encode images into embeddings.

        Args:
            image (Float[Array, "batch height width channels"]): Batch of input images.
            do_projection (bool): Whether the image encoder should do the visual projection layer. Defaults to true.

        Returns:
            Float[Array, "batch text_hidden_size"]: Image embeddings.
        """
        return self.vision_model(image, do_projection)

    def encode_text(self, text: Int[Array, "batch context_length"]) -> Float[Array, "batch text_hidden_size"]:
        """Encode text tokens into embeddings.

        Args:
            text (Int[Array, "batch context_length"]): Batch of token sequences.

        Returns:
            Float[Array, "batch text_hidden_size"]: Text embeddings.
        """
        return self.text_model(text, do_projection=True)

    def __call__(self, image: Float[Array, "batch height width channels"], text: Int[Array, "batch context_length"]) -> Float[Array, "batch batch"]:
        """Calculate similarity between image and text embeddings.

        Args:
            image (Float[Array, "batch height width channels"]): Batch of input images.
            text (Int[Array, "batch context_length"]): Batch of token sequences.

        Returns:
            Float[Array, "batch batch"]: Similarity scores between all pairs of images and texts.
        """
        image_features: Float[Array, "batch text_hidden_size"] = self.encode_image(image, do_projection=True)
        text_features: Float[Array, "batch text_hidden_size"] = self.encode_text(text)

        image_features: Float[Array, "batch text_hidden_size"] = image_features / jnp.linalg.norm(image_features, axis=-1, keepdims=True)
        text_features: Float[Array, "batch text_hidden_size"] = text_features / jnp.linalg.norm(text_features, axis=-1, keepdims=True)

        logit_scale: Float[Array, ""] = jnp.exp(self.logit_scale[...])
        image_spec = sharding_of(image_features).spec
        logits_sharding = named_sharding_like(image_features, P(image_spec[0], None))
        logits: Float[Array, "batch batch"] = logit_scale * jnp.matmul(image_features, text_features.T, out_sharding=logits_sharding)
        return logits

    @classmethod
    def from_pretrained(
        cls,
        model_name_or_path: str,
        use_pytorch: bool = False,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = CLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
    ) -> "CLIP":
        """Load a pretrained CLIP model from a local path or HuggingFace Hub.

        Args:
            model_name_or_path (str): Path to local weights or HuggingFace model ID.
            use_pytorch (bool): Whether to load from PyTorch weights. Defaults to False.
            rngs (rnglib.Rngs | None): Random number generator keys. If None, initializes to nnx.Rngs(0).
            dtype (DTypeLike): Data type for computations. Defaults to jnp.float32.
            param_dtype (DTypeLike): Data type for parameters. Defaults to jnp.float32.
            sharding (ShardingSpec): Sharding specification for parameters. Defaults to CLIPSharding.
            use_gradient_checkpointing (bool): Whether to use gradient checkpointing. Defaults to False.
            attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.

        Returns:
            CLIP: Pretrained CLIP model
        """
        from .params import load_from_pretrained

        return load_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

    @classmethod
    def from_config(
        cls,
        config: dict[str, Any],
        *,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = CLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
        vision_attention_fn: Callable[..., Any] | None = None,
        text_attention_fn: Callable[..., Any] | None = None,
    ) -> "CLIP":
        """Create model from HuggingFace-compatible config dict.

        Args:
            config: Configuration with "text_config" and "vision_config" keys.
            rngs: Random number generator state.
            dtype: Data type for computations.
            param_dtype: Data type for parameters.
            sharding: Sharding specification for parameters.
            use_gradient_checkpointing: Enable gradient checkpointing.
            attention_fn: Custom attention function applied to both encoders. Defaults to None.
            vision_attention_fn: Override attention_fn for the vision encoder only. Defaults to None.
            text_attention_fn: Override attention_fn for the text encoder only. Defaults to None.

        Returns:
            CLIP model with random weights.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        text_config = config["text_config"]
        vision_config = config["vision_config"]

        return cls(
            image_resolution=vision_config["image_size"],
            vision_layers=vision_config["num_hidden_layers"],
            vision_hidden_size=vision_config["hidden_size"],
            vision_patch_size=vision_config["patch_size"],
            context_length=text_config["max_position_embeddings"],
            vocab_size=text_config["vocab_size"],
            text_hidden_size=text_config["hidden_size"],
            num_text_heads=text_config["num_attention_heads"],
            num_text_layers=text_config["num_hidden_layers"],
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=attention_fn,
            vision_attention_fn=vision_attention_fn,
            text_attention_fn=text_attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

    def save_pretrained(self, save_directory: str) -> None:
        """Save the model weights and config in HuggingFace format.

        Args:
            save_directory (str): Directory path where the model will be saved.
        """
        from .params import save_pretrained

        save_pretrained(self, save_directory)

`call(image, text)`

Calculate similarity between image and text embeddings.

Parameters:

Name	Type	Description	Default
`image`	`Float[Array, 'batch height width channels']`	Batch of input images.	required
`text`	`Int[Array, 'batch context_length']`	Batch of token sequences.	required

Returns:

Type	Description
`Float[Array, 'batch batch']`	Float[Array, "batch batch"]: Similarity scores between all pairs of images and texts.

Source code in src/jimm/models/clip/clip_model.py

def __call__(self, image: Float[Array, "batch height width channels"], text: Int[Array, "batch context_length"]) -> Float[Array, "batch batch"]:
    """Calculate similarity between image and text embeddings.

    Args:
        image (Float[Array, "batch height width channels"]): Batch of input images.
        text (Int[Array, "batch context_length"]): Batch of token sequences.

    Returns:
        Float[Array, "batch batch"]: Similarity scores between all pairs of images and texts.
    """
    image_features: Float[Array, "batch text_hidden_size"] = self.encode_image(image, do_projection=True)
    text_features: Float[Array, "batch text_hidden_size"] = self.encode_text(text)

    image_features: Float[Array, "batch text_hidden_size"] = image_features / jnp.linalg.norm(image_features, axis=-1, keepdims=True)
    text_features: Float[Array, "batch text_hidden_size"] = text_features / jnp.linalg.norm(text_features, axis=-1, keepdims=True)

    logit_scale: Float[Array, ""] = jnp.exp(self.logit_scale[...])
    image_spec = sharding_of(image_features).spec
    logits_sharding = named_sharding_like(image_features, P(image_spec[0], None))
    logits: Float[Array, "batch batch"] = logit_scale * jnp.matmul(image_features, text_features.T, out_sharding=logits_sharding)
    return logits

`init(image_resolution, vision_layers, vision_hidden_size, vision_patch_size, context_length, vocab_size, text_hidden_size, num_text_heads, num_text_layers, use_gradient_checkpointing=False, attention_fn=None, vision_attention_fn=None, text_attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding)`

Initialize the CLIP model.

Parameters:

Name	Type	Description	Default
`image_resolution`	`int`	The resolution of the input images.	required
`vision_layers`	`int`	The number of layers in the vision transformer.	required
`vision_hidden_size`	`int`	The hidden dimension size of the vision transformer.	required
`vision_patch_size`	`int`	The patch size of the vision transformer.	required
`context_length`	`int`	The maximum sequence length for text.	required
`vocab_size`	`int`	The size of the vocabulary.	required
`text_hidden_size`	`int`	The hidden dimension size of the text transformer.	required
`num_text_heads`	`int`	The number of attention heads in the text transformer.	required
`num_text_layers`	`int`	The number of layers in the text transformer.	required
`use_gradient_checkpointing`	`bool`	Whether to use gradient checkpointing. Defaults to False.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function applied to both encoders. Defaults to None.	`None`
`vision_attention_fn`	`Callable[..., Any] \| None`	Override attention_fn for the vision encoder only. Defaults to None.	`None`
`text_attention_fn`	`Callable[..., Any] \| None`	Override attention_fn for the text encoder only. Defaults to None.	`None`
`rngs`	`Rngs \| None`	The random number generator state. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	The data type for computations. Defaults to jnp.float32.	`float32`
`param_dtype`	`DTypeLike`	The data type for parameters. Defaults to jnp.float32.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters. Defaults to CLIPSharding.	`CLIPSharding`

Source code in src/jimm/models/clip/clip_model.py

def __init__(
    self,
    image_resolution: int,
    vision_layers: int,
    vision_hidden_size: int,
    vision_patch_size: int,
    context_length: int,
    vocab_size: int,
    text_hidden_size: int,
    num_text_heads: int,
    num_text_layers: int,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
    vision_attention_fn: Callable[..., Any] | None = None,
    text_attention_fn: Callable[..., Any] | None = None,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = CLIPSharding,
):
    """Initialize the CLIP model.

    Args:
        image_resolution (int): The resolution of the input images.
        vision_layers (int): The number of layers in the vision transformer.
        vision_hidden_size (int): The hidden dimension size of the vision transformer.
        vision_patch_size (int): The patch size of the vision transformer.
        context_length (int): The maximum sequence length for text.
        vocab_size (int): The size of the vocabulary.
        text_hidden_size (int): The hidden dimension size of the text transformer.
        num_text_heads (int): The number of attention heads in the text transformer.
        num_text_layers (int): The number of layers in the text transformer.
        use_gradient_checkpointing (bool, optional): Whether to use gradient checkpointing. Defaults to False.
        attention_fn (Callable[..., Any] | None, optional): Custom attention function applied to both encoders. Defaults to None.
        vision_attention_fn (Callable[..., Any] | None, optional): Override attention_fn for the vision encoder only. Defaults to None.
        text_attention_fn (Callable[..., Any] | None, optional): Override attention_fn for the text encoder only. Defaults to None.
        rngs (rnglib.Rngs | None, optional): The random number generator state. If None, initializes to nnx.Rngs(0).
        dtype (DTypeLike, optional): The data type for computations. Defaults to jnp.float32.
        param_dtype (DTypeLike, optional): The data type for parameters. Defaults to jnp.float32.
        sharding (ShardingSpec): Sharding specification for parameters. Defaults to CLIPSharding.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    self.vision_layers = vision_layers
    self.vision_hidden_size = vision_hidden_size
    self.vision_patch_size = vision_patch_size
    self.context_length = context_length
    self.vocab_size = vocab_size
    self.text_hidden_size = text_hidden_size
    self.num_text_heads = num_text_heads
    self.num_text_layers = num_text_layers
    self.dtype = dtype
    self._original_config = None

    self.vision_model = CLIPVisionModel(
        image_resolution=image_resolution,
        vision_layers=vision_layers,
        vision_hidden_size=vision_hidden_size,
        vision_patch_size=vision_patch_size,
        projection_dim=text_hidden_size,
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=vision_attention_fn or attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

    self.text_model = CLIPTextModel(
        context_length=context_length,
        vocab_size=vocab_size,
        text_hidden_size=text_hidden_size,
        num_text_heads=num_text_heads,
        num_text_layers=num_text_layers,
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=text_attention_fn or attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )
    self.logit_scale = nnx.Param(nnx.with_partitioning(nnx.initializers.ones_init(), ())(rngs.params(), ()))

`encode_image(image, do_projection=True)`

Encode images into embeddings.

Parameters:

Name	Type	Description	Default
`image`	`Float[Array, 'batch height width channels']`	Batch of input images.	required
`do_projection`	`bool`	Whether the image encoder should do the visual projection layer. Defaults to true.	`True`

Returns:

Type	Description
`Float[Array, 'batch text_hidden_size']`	Float[Array, "batch text_hidden_size"]: Image embeddings.

Source code in src/jimm/models/clip/clip_model.py

def encode_image(self, image: Float[Array, "batch height width channels"], do_projection: bool = True) -> Float[Array, "batch text_hidden_size"]:
    """Encode images into embeddings.

    Args:
        image (Float[Array, "batch height width channels"]): Batch of input images.
        do_projection (bool): Whether the image encoder should do the visual projection layer. Defaults to true.

    Returns:
        Float[Array, "batch text_hidden_size"]: Image embeddings.
    """
    return self.vision_model(image, do_projection)

`encode_text(text)`

Encode text tokens into embeddings.

Parameters:

Name	Type	Description	Default
`text`	`Int[Array, 'batch context_length']`	Batch of token sequences.	required

Returns:

Type	Description
`Float[Array, 'batch text_hidden_size']`	Float[Array, "batch text_hidden_size"]: Text embeddings.

Source code in src/jimm/models/clip/clip_model.py

def encode_text(self, text: Int[Array, "batch context_length"]) -> Float[Array, "batch text_hidden_size"]:
    """Encode text tokens into embeddings.

    Args:
        text (Int[Array, "batch context_length"]): Batch of token sequences.

    Returns:
        Float[Array, "batch text_hidden_size"]: Text embeddings.
    """
    return self.text_model(text, do_projection=True)

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None, vision_attention_fn=None, text_attention_fn=None)` `classmethod`

Create model from HuggingFace-compatible config dict.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration with "text_config" and "vision_config" keys.	required
`rngs`	`Rngs \| None`	Random number generator state.	`None`
`dtype`	`DTypeLike`	Data type for computations.	`float32`
`param_dtype`	`DTypeLike`	Data type for parameters.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters.	`CLIPSharding`
`use_gradient_checkpointing`	`bool`	Enable gradient checkpointing.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function applied to both encoders. Defaults to None.	`None`
`vision_attention_fn`	`Callable[..., Any] \| None`	Override attention_fn for the vision encoder only. Defaults to None.	`None`
`text_attention_fn`	`Callable[..., Any] \| None`	Override attention_fn for the text encoder only. Defaults to None.	`None`

Returns:

Type	Description
`CLIP`	CLIP model with random weights.

Source code in src/jimm/models/clip/clip_model.py

@classmethod
def from_config(
    cls,
    config: dict[str, Any],
    *,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = CLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
    vision_attention_fn: Callable[..., Any] | None = None,
    text_attention_fn: Callable[..., Any] | None = None,
) -> "CLIP":
    """Create model from HuggingFace-compatible config dict.

    Args:
        config: Configuration with "text_config" and "vision_config" keys.
        rngs: Random number generator state.
        dtype: Data type for computations.
        param_dtype: Data type for parameters.
        sharding: Sharding specification for parameters.
        use_gradient_checkpointing: Enable gradient checkpointing.
        attention_fn: Custom attention function applied to both encoders. Defaults to None.
        vision_attention_fn: Override attention_fn for the vision encoder only. Defaults to None.
        text_attention_fn: Override attention_fn for the text encoder only. Defaults to None.

    Returns:
        CLIP model with random weights.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    text_config = config["text_config"]
    vision_config = config["vision_config"]

    return cls(
        image_resolution=vision_config["image_size"],
        vision_layers=vision_config["num_hidden_layers"],
        vision_hidden_size=vision_config["hidden_size"],
        vision_patch_size=vision_config["patch_size"],
        context_length=text_config["max_position_embeddings"],
        vocab_size=text_config["vocab_size"],
        text_hidden_size=text_config["hidden_size"],
        num_text_heads=text_config["num_attention_heads"],
        num_text_layers=text_config["num_hidden_layers"],
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=attention_fn,
        vision_attention_fn=vision_attention_fn,
        text_attention_fn=text_attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

Load a pretrained CLIP model from a local path or HuggingFace Hub.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	Path to local weights or HuggingFace model ID.	required
`use_pytorch`	`bool`	Whether to load from PyTorch weights. Defaults to False.	`False`
`rngs`	`Rngs \| None`	Random number generator keys. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	Data type for computations. Defaults to jnp.float32.	`float32`
`param_dtype`	`DTypeLike`	Data type for parameters. Defaults to jnp.float32.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters. Defaults to CLIPSharding.	`CLIPSharding`
`use_gradient_checkpointing`	`bool`	Whether to use gradient checkpointing. Defaults to False.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.	`None`

Returns:

Name	Type	Description
`CLIP`	`CLIP`	Pretrained CLIP model

Source code in src/jimm/models/clip/clip_model.py

@classmethod
def from_pretrained(
    cls,
    model_name_or_path: str,
    use_pytorch: bool = False,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = CLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
) -> "CLIP":
    """Load a pretrained CLIP model from a local path or HuggingFace Hub.

    Args:
        model_name_or_path (str): Path to local weights or HuggingFace model ID.
        use_pytorch (bool): Whether to load from PyTorch weights. Defaults to False.
        rngs (rnglib.Rngs | None): Random number generator keys. If None, initializes to nnx.Rngs(0).
        dtype (DTypeLike): Data type for computations. Defaults to jnp.float32.
        param_dtype (DTypeLike): Data type for parameters. Defaults to jnp.float32.
        sharding (ShardingSpec): Sharding specification for parameters. Defaults to CLIPSharding.
        use_gradient_checkpointing (bool): Whether to use gradient checkpointing. Defaults to False.
        attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention or jimm.make_tokamax_attention("mosaic_tpu")). Defaults to None.

    Returns:
        CLIP: Pretrained CLIP model
    """
    from .params import load_from_pretrained

    return load_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

`save_pretrained(save_directory)`

Save the model weights and config in HuggingFace format.

Parameters:

Name	Type	Description	Default
`save_directory`	`str`	Directory path where the model will be saved.	required

Source code in src/jimm/models/clip/clip_model.py

def save_pretrained(self, save_directory: str) -> None:
    """Save the model weights and config in HuggingFace format.

    Args:
        save_directory (str): Directory path where the model will be saved.
    """
    from .params import save_pretrained

    save_pretrained(self, save_directory)

CLIP (Contrastive Language–Image Pre-training)

Flash / Splash Attention

FSDP / Explicit Sharding

jimm.models.clip.CLIPVisionModel

__call__(image, do_projection=True)

__init__(image_resolution, vision_layers, vision_hidden_size, vision_patch_size, projection_dim, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding)

from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None) classmethod

from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None) classmethod

save_pretrained(save_directory)

jimm.models.clip.CLIPTextModel

__call__(text, do_projection=True)

__init__(context_length, vocab_size, text_hidden_size, num_text_heads, num_text_layers, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding)

from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None) classmethod

from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None) classmethod

save_pretrained(save_directory)

jimm.models.clip.CLIP

__call__(image, text)

encode_image(image, do_projection=True)

encode_text(text)

from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None, vision_attention_fn=None, text_attention_fn=None) classmethod

from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None) classmethod

save_pretrained(save_directory)

`jimm.models.clip.CLIPVisionModel`

`call(image, do_projection=True)`

`init(image_resolution, vision_layers, vision_hidden_size, vision_patch_size, projection_dim, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding)`

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

`save_pretrained(save_directory)`

`jimm.models.clip.CLIPTextModel`

`call(text, do_projection=True)`

`init(context_length, vocab_size, text_hidden_size, num_text_heads, num_text_layers, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding)`

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

`save_pretrained(save_directory)`

`jimm.models.clip.CLIP`

`call(image, text)`

`encode_image(image, do_projection=True)`

`encode_text(text)`

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None, vision_attention_fn=None, text_attention_fn=None)` `classmethod`

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=CLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

`save_pretrained(save_directory)`