SigLIP (Sigmoid Loss for Language Image Pre-Training)

SigLIP (Sigmoid Loss for Language Image Pre-Training) is a vision-language model that builds upon the principles of CLIP but introduces a key architectural change: it uses a sigmoid loss function instead of the softmax-based contrastive loss. Additionally, there are some slight implementation differences (no attention_mask for the text encoder, padding the text inputs, multihead attention pooling for the vision encoder rather than a linear projection layer).

This modification simplifies the training objective by treating the problem as a binary classification for each image-text pair (i.e., are they a positive or negative match?). This approach avoids the need for a global normalization over all pairs in a batch, which makes it more scalable and robust to noisy, web-scale data.

Key features of SigLIP: 1. Vision Encoder: A Vision Transformer (ViT) with a Multi-Head Attention Pooling (MAP) head. 2. Text Encoder: A standard Transformer model. 3. Sigmoid Loss: Enables training on larger batches and noisier datasets without requiring careful data curation or complex negative sampling strategies.

SigLIP was introduced in the paper "Sigmoid Loss for Language Image Pre-Training" and has demonstrated improved performance and training efficiency.

Flash / Splash Attention

SigLIP supports hardware-accelerated attention via Tokamax. Pass an attention_fn at construction time:

Backend	Hardware	Notes
`"mosaic"`	NVIDIA H100 (SM90) / B100 (SM100)	Pallas Mosaic GPU kernel
`"triton"`	Any NVIDIA GPU	Pallas Triton kernel
`"cudnn"`	NVIDIA GPU	Via JAX-NN / cuDNN
`"mosaic_tpu"`	TPU v5 / v7	Splash attention (block-sparse)
`"xla_chunked"`	GPU / TPU	Flash-style chunked XLA
`"xla"`	Any	Standard XLA fallback

import jimm

# GPU: try H100 Mosaic kernel, fall back to Triton, then XLA
model = jimm.SigLIP.from_pretrained("google/siglip-base-patch16-256",
                                     attention_fn=jimm.make_tokamax_attention(["mosaic", "triton", "xla"]))

# TPU: try Splash attention, fall back to chunked XLA
model = jimm.SigLIP.from_pretrained("google/siglip-base-patch16-256",
                                     attention_fn=jimm.make_tokamax_attention(["mosaic_tpu", "xla_chunked"]))

Note: Flash/Splash attention does not provide a speedup at typical SigLIP context lengths (256 image tokens, 64 text tokens). The primary benefit is memory reduction at longer sequence lengths.

FSDP / Explicit Sharding

SigLIP supports JAX explicit sharding (FSDP-style) out of the box via SigLIPSharding. Large weight matrices are sharded on the contracting (in_features) dimension so that activations carry only the batch-axis sharding.

from jax.experimental import mesh_utils
from jax.sharding import AxisType, Mesh
import jax

n_devices = jax.device_count()
mesh = Mesh(
    mesh_utils.create_device_mesh((1, n_devices)),
    ("data", "fsdp"),
    axis_types=(AxisType.Explicit, AxisType.Explicit),
)
jax.set_mesh(mesh)

model = jimm.SigLIP.from_pretrained("google/siglip-base-patch16-256")

SigLIPSharding specs represent per-layer shapes. The Transformer stack prepends None for the scan axis to the Variable metadata after nnx.vmap, so the optimizer receives the correct stacked spec natively without any manual fixups.

To disable sharding, pass sharding=jimm.common.sharding.NoSharding().

`jimm.models.siglip.SigLIPVisionModel`

Bases: Module

Source code in src/jimm/models/siglip/siglip_model.py

class SigLIPVisionModel(nnx.Module):
    def __init__(
        self,
        image_resolution: int,
        vision_layers: int,
        vision_hidden_size: int,
        vision_patch_size: int,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = SigLIPSharding,
    ):
        """Initialize the SigLIP Vision Encoder.

        Args:
            image_resolution (int): The resolution of the input images.
            vision_layers (int): The number of layers in the vision transformer.
            vision_hidden_size (int): The hidden dimension size of the vision transformer.
            vision_patch_size (int): The patch size of the vision transformer.
            use_gradient_checkpointing (bool, optional): Whether to use gradient checkpointing. Defaults to False.
            attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.
            rngs (rnglib.Rngs | None, optional): The random number generator state. If None, initializes to nnx.Rngs(0).
            dtype (DTypeLike, optional): The data type for computations. Defaults to jnp.float32.
            param_dtype (DTypeLike, optional): The data type for parameters. Defaults to jnp.float32.
            sharding (ShardingSpec, optional): Sharding specification for parameters. Defaults to SigLIPSharding.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        self.vision_layers = vision_layers
        self.vision_hidden_size = vision_hidden_size
        self.vision_patch_size = vision_patch_size
        self.dtype = dtype

        vision_heads = vision_hidden_size // 64

        self.encoder = VisionTransformerBase(
            img_size=image_resolution,
            patch_size=vision_patch_size,
            in_channels=3,
            hidden_size=vision_hidden_size,
            num_layers=vision_layers,
            num_heads=vision_heads,
            mlp_dim=vision_hidden_size * 4,
            use_pre_norm=False,
            use_patch_bias=True,
            use_quick_gelu=False,
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=attention_fn,
            pooling_type="MAP",
            layernorm_epsilon=1e-6,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

    def __call__(self, image: Float[Array, "batch height width channels"], do_projection: bool = True) -> Float[Array, "batch vision_hidden_size"]:
        """Encode images into embeddings.

        Args:
            image (Float[Array, "batch height width channels"]): Batch of input images.
            do_projection (bool): Included for API compatibility with CLIP. SigLIP vision model doesn't have a projection layer. Defaults to True.

        Returns:
            Float[Array, "batch vision_hidden_size"]: Image embeddings.
        """
        return self.encoder(image)

    @classmethod
    def from_pretrained(
        cls,
        model_name_or_path: str,
        use_pytorch: bool = False,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = SigLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
    ) -> "SigLIPVisionModel":
        """Load a pretrained vision encoder from a SigLIP checkpoint.

        Args:
            model_name_or_path (str): Path to local weights or HuggingFace model ID.
            use_pytorch (bool): Whether to load from PyTorch weights. Defaults to False.
            rngs (rnglib.Rngs | None): Random number generator keys. If None, initializes to nnx.Rngs(0).
            dtype (DTypeLike): Data type for computations. Defaults to jnp.float32.
            param_dtype (DTypeLike): Data type for parameters. Defaults to jnp.float32.
            sharding (ShardingSpec): Sharding specification for parameters. Defaults to SigLIPSharding.
            use_gradient_checkpointing (bool): Whether to use gradient checkpointing. Defaults to False.
            attention_fn (Callable[..., Any] | None): Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.

        Returns:
            SigLIPVisionModel: Pretrained SigLIP vision model
        """
        from .params import load_vision_from_pretrained

        return load_vision_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

    @classmethod
    def from_config(
        cls,
        config: dict[str, Any],
        *,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = SigLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
    ) -> "SigLIPVisionModel":
        """Create model from HuggingFace-compatible config dict.

        Args:
            config: Configuration with "vision_config" key.
            rngs: Random number generator state. If None, initializes to nnx.Rngs(0).
            dtype: Data type for computations.
            param_dtype: Data type for parameters.
            sharding: Sharding specification for parameters.
            use_gradient_checkpointing: Enable gradient checkpointing.
            attention_fn: Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.

        Returns:
            SigLIPVisionModel with random weights.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        vision_config = config["vision_config"]

        return cls(
            image_resolution=vision_config["image_size"],
            vision_layers=vision_config["num_hidden_layers"],
            vision_hidden_size=vision_config["hidden_size"],
            vision_patch_size=vision_config["patch_size"],
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

    def save_pretrained(self, save_directory: str) -> None:
        """Save model weights and config in HuggingFace format.

        Args:
            save_directory (str): Directory path where the model will be saved.
        """
        from .params import save_vision_pretrained

        save_vision_pretrained(self, save_directory)

`call(image, do_projection=True)`

Encode images into embeddings.

Parameters:

Name	Type	Description	Default
`image`	`Float[Array, 'batch height width channels']`	Batch of input images.	required
`do_projection`	`bool`	Included for API compatibility with CLIP. SigLIP vision model doesn't have a projection layer. Defaults to True.	`True`

Returns:

Type	Description
`Float[Array, 'batch vision_hidden_size']`	Float[Array, "batch vision_hidden_size"]: Image embeddings.

Source code in src/jimm/models/siglip/siglip_model.py

def __call__(self, image: Float[Array, "batch height width channels"], do_projection: bool = True) -> Float[Array, "batch vision_hidden_size"]:
    """Encode images into embeddings.

    Args:
        image (Float[Array, "batch height width channels"]): Batch of input images.
        do_projection (bool): Included for API compatibility with CLIP. SigLIP vision model doesn't have a projection layer. Defaults to True.

    Returns:
        Float[Array, "batch vision_hidden_size"]: Image embeddings.
    """
    return self.encoder(image)

`init(image_resolution, vision_layers, vision_hidden_size, vision_patch_size, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding)`

Initialize the SigLIP Vision Encoder.

Parameters:

Name	Type	Description	Default
`image_resolution`	`int`	The resolution of the input images.	required
`vision_layers`	`int`	The number of layers in the vision transformer.	required
`vision_hidden_size`	`int`	The hidden dimension size of the vision transformer.	required
`vision_patch_size`	`int`	The patch size of the vision transformer.	required
`use_gradient_checkpointing`	`bool`	Whether to use gradient checkpointing. Defaults to False.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.	`None`
`rngs`	`Rngs \| None`	The random number generator state. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	The data type for computations. Defaults to jnp.float32.	`float32`
`param_dtype`	`DTypeLike`	The data type for parameters. Defaults to jnp.float32.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters. Defaults to SigLIPSharding.	`SigLIPSharding`

Source code in src/jimm/models/siglip/siglip_model.py

def __init__(
    self,
    image_resolution: int,
    vision_layers: int,
    vision_hidden_size: int,
    vision_patch_size: int,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = SigLIPSharding,
):
    """Initialize the SigLIP Vision Encoder.

    Args:
        image_resolution (int): The resolution of the input images.
        vision_layers (int): The number of layers in the vision transformer.
        vision_hidden_size (int): The hidden dimension size of the vision transformer.
        vision_patch_size (int): The patch size of the vision transformer.
        use_gradient_checkpointing (bool, optional): Whether to use gradient checkpointing. Defaults to False.
        attention_fn (Callable[..., Any] | None, optional): Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.
        rngs (rnglib.Rngs | None, optional): The random number generator state. If None, initializes to nnx.Rngs(0).
        dtype (DTypeLike, optional): The data type for computations. Defaults to jnp.float32.
        param_dtype (DTypeLike, optional): The data type for parameters. Defaults to jnp.float32.
        sharding (ShardingSpec, optional): Sharding specification for parameters. Defaults to SigLIPSharding.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    self.vision_layers = vision_layers
    self.vision_hidden_size = vision_hidden_size
    self.vision_patch_size = vision_patch_size
    self.dtype = dtype

    vision_heads = vision_hidden_size // 64

    self.encoder = VisionTransformerBase(
        img_size=image_resolution,
        patch_size=vision_patch_size,
        in_channels=3,
        hidden_size=vision_hidden_size,
        num_layers=vision_layers,
        num_heads=vision_heads,
        mlp_dim=vision_hidden_size * 4,
        use_pre_norm=False,
        use_patch_bias=True,
        use_quick_gelu=False,
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=attention_fn,
        pooling_type="MAP",
        layernorm_epsilon=1e-6,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

Create model from HuggingFace-compatible config dict.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration with "vision_config" key.	required
`rngs`	`Rngs \| None`	Random number generator state. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	Data type for computations.	`float32`
`param_dtype`	`DTypeLike`	Data type for parameters.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters.	`SigLIPSharding`
`use_gradient_checkpointing`	`bool`	Enable gradient checkpointing.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.	`None`

Returns:

Type	Description
`SigLIPVisionModel`	SigLIPVisionModel with random weights.

Source code in src/jimm/models/siglip/siglip_model.py

@classmethod
def from_config(
    cls,
    config: dict[str, Any],
    *,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = SigLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
) -> "SigLIPVisionModel":
    """Create model from HuggingFace-compatible config dict.

    Args:
        config: Configuration with "vision_config" key.
        rngs: Random number generator state. If None, initializes to nnx.Rngs(0).
        dtype: Data type for computations.
        param_dtype: Data type for parameters.
        sharding: Sharding specification for parameters.
        use_gradient_checkpointing: Enable gradient checkpointing.
        attention_fn: Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.

    Returns:
        SigLIPVisionModel with random weights.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    vision_config = config["vision_config"]

    return cls(
        image_resolution=vision_config["image_size"],
        vision_layers=vision_config["num_hidden_layers"],
        vision_hidden_size=vision_config["hidden_size"],
        vision_patch_size=vision_config["patch_size"],
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

Load a pretrained vision encoder from a SigLIP checkpoint.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	Path to local weights or HuggingFace model ID.	required
`use_pytorch`	`bool`	Whether to load from PyTorch weights. Defaults to False.	`False`
`rngs`	`Rngs \| None`	Random number generator keys. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	Data type for computations. Defaults to jnp.float32.	`float32`
`param_dtype`	`DTypeLike`	Data type for parameters. Defaults to jnp.float32.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters. Defaults to SigLIPSharding.	`SigLIPSharding`
`use_gradient_checkpointing`	`bool`	Whether to use gradient checkpointing. Defaults to False.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.	`None`

Returns:

Name	Type	Description
`SigLIPVisionModel`	`SigLIPVisionModel`	Pretrained SigLIP vision model

Source code in src/jimm/models/siglip/siglip_model.py

@classmethod
def from_pretrained(
    cls,
    model_name_or_path: str,
    use_pytorch: bool = False,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = SigLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
) -> "SigLIPVisionModel":
    """Load a pretrained vision encoder from a SigLIP checkpoint.

    Args:
        model_name_or_path (str): Path to local weights or HuggingFace model ID.
        use_pytorch (bool): Whether to load from PyTorch weights. Defaults to False.
        rngs (rnglib.Rngs | None): Random number generator keys. If None, initializes to nnx.Rngs(0).
        dtype (DTypeLike): Data type for computations. Defaults to jnp.float32.
        param_dtype (DTypeLike): Data type for parameters. Defaults to jnp.float32.
        sharding (ShardingSpec): Sharding specification for parameters. Defaults to SigLIPSharding.
        use_gradient_checkpointing (bool): Whether to use gradient checkpointing. Defaults to False.
        attention_fn (Callable[..., Any] | None): Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.

    Returns:
        SigLIPVisionModel: Pretrained SigLIP vision model
    """
    from .params import load_vision_from_pretrained

    return load_vision_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

`save_pretrained(save_directory)`

Save model weights and config in HuggingFace format.

Parameters:

Name	Type	Description	Default
`save_directory`	`str`	Directory path where the model will be saved.	required

Source code in src/jimm/models/siglip/siglip_model.py

def save_pretrained(self, save_directory: str) -> None:
    """Save model weights and config in HuggingFace format.

    Args:
        save_directory (str): Directory path where the model will be saved.
    """
    from .params import save_vision_pretrained

    save_vision_pretrained(self, save_directory)

`jimm.models.siglip.SigLIPTextModel`

Bases: Module

Source code in src/jimm/models/siglip/siglip_model.py

class SigLIPTextModel(nnx.Module):
    def __init__(
        self,
        context_length: int,
        vocab_size: int,
        text_hidden_size: int,
        num_text_heads: int,
        num_text_layers: int,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = SigLIPSharding,
    ):
        """Initialize SigLIP text encoder.

        Args:
            context_length (int): Maximum sequence length.
            vocab_size (int): Size of vocabulary.
            text_hidden_size (int): Hidden dimension size of the text transformer.
            num_text_heads (int): Number of attention heads in the text transformer.
            num_text_layers (int): Number of transformer layers in the text transformer.
            use_gradient_checkpointing (bool): Enable gradient checkpointing.
            attention_fn (Callable[..., Any] | None): Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.
            rngs (rnglib.Rngs): RNG state.
            dtype (DTypeLike): Computation dtype.
            param_dtype (DTypeLike): Parameter dtype.
            sharding (ShardingSpec): Sharding specification for parameters.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        self.context_length = context_length
        self.vocab_size = vocab_size
        self.text_hidden_size = text_hidden_size
        self.num_text_heads = num_text_heads
        self.num_text_layers = num_text_layers
        self.dtype = dtype

        self.token_embedding = nnx.Embed(
            num_embeddings=vocab_size,
            features=text_hidden_size,
            dtype=dtype,
            param_dtype=param_dtype,
            rngs=rngs,
            embedding_init=nnx.with_partitioning(
                nnx.initializers.xavier_uniform(),
                sharding.embed,
            ),
        )
        self.positional_embedding = nnx.Param(
            nnx.with_partitioning(
                nnx.initializers.truncated_normal(stddev=0.02),
                sharding.text_pos_embed,
            )(rngs.params(), (context_length, text_hidden_size))
        )

        self.transformer = Transformer(
            hidden_size=text_hidden_size,
            mlp_dim=text_hidden_size * 4,
            num_layers=num_text_layers,
            num_heads=num_text_heads,
            dropout_rate=0.0,
            layernorm_epsilon=1e-6,
            use_quick_gelu=False,
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

        self.ln_final = nnx.LayerNorm(
            text_hidden_size,
            epsilon=1e-6,
            dtype=dtype,
            param_dtype=param_dtype,
            rngs=rngs,
            scale_init=nnx.with_partitioning(
                nnx.initializers.ones_init(),
                sharding.layernorm,
            ),
            bias_init=nnx.with_partitioning(
                nnx.initializers.zeros_init(),
                sharding.layernorm,
            ),
        )

        self.text_projection = nnx.Linear(
            text_hidden_size,
            text_hidden_size,
            use_bias=True,
            dtype=dtype,
            param_dtype=param_dtype,
            rngs=rngs,
            kernel_init=nnx.with_partitioning(
                nnx.initializers.xavier_uniform(),
                sharding.proj_kernel,
            ),
            bias_init=nnx.with_partitioning(
                nnx.initializers.zeros_init(),
                sharding.proj_bias,
            ),
        )

    def __call__(self, text: Int[Array, "batch context_length"], do_projection: bool = True) -> Float[Array, "batch text_hidden_size"]:
        """Encode text tokens into embeddings.

        Args:
            text (Int[Array, "batch context_length"]): Token sequences.
            do_projection (bool): Whether to apply the text projection layer. Defaults to True.

        Returns:
            Float[Array, "batch text_hidden_size"]: Text embeddings.
        """
        seq_len = text.shape[1]
        text_sharding = sharding_of(text)
        embed_sharding = named_sharding_like(text, P(*text_sharding.spec, None))
        x = self.token_embedding.embedding[...].at[text].get(out_sharding=embed_sharding)
        pos_embed = jnp.broadcast_to(self.positional_embedding[...][:seq_len], x.shape)
        x = x + reshard_like(pos_embed, x)
        x = self.transformer(x)
        x = self.ln_final(x)
        pooled_output = x[:, -1, :]
        if do_projection:
            kernel_spec = sharding_of(self.text_projection.kernel[...]).spec
            projection_sharding = named_sharding_like(pooled_output, P(sharding_of(pooled_output).spec[0], kernel_spec[1]))
            x = self.text_projection(pooled_output, out_sharding=projection_sharding)
        else:
            x = pooled_output
        return x

    @classmethod
    def from_pretrained(
        cls,
        model_name_or_path: str,
        use_pytorch: bool = False,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = SigLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
    ) -> "SigLIPTextModel":
        """Load pretrained text encoder from SigLIP checkpoint.

        Args:
            model_name_or_path (str): Local path or HuggingFace model ID.
            use_pytorch (bool): Load from PyTorch weights.
            rngs (rnglib.Rngs): RNG state.
            dtype (DTypeLike): Computation dtype.
            param_dtype (DTypeLike): Parameter dtype.
            sharding (ShardingSpec): Sharding specification for parameters.
            use_gradient_checkpointing (bool): Enable gradient checkpointing.
            attention_fn (Callable[..., Any] | None): Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.

        Returns:
            SigLIPTextModel: Pretrained text model.
        """
        from .params import load_text_from_pretrained

        return load_text_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

    @classmethod
    def from_config(
        cls,
        config: dict[str, Any],
        *,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = SigLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
    ) -> "SigLIPTextModel":
        """Create model from HuggingFace-compatible config dict.

        Args:
            config: Configuration with "text_config" key.
            rngs: Random number generator state. If None, initializes to nnx.Rngs(0).
            dtype: Data type for computations.
            param_dtype: Data type for parameters.
            sharding: Sharding specification for parameters.
            use_gradient_checkpointing: Enable gradient checkpointing.
            attention_fn: Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.

        Returns:
            SigLIPTextModel with random weights.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        text_config = config["text_config"]

        return cls(
            context_length=text_config["max_position_embeddings"],
            vocab_size=text_config["vocab_size"],
            text_hidden_size=text_config["hidden_size"],
            num_text_heads=text_config["num_attention_heads"],
            num_text_layers=text_config["num_hidden_layers"],
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

    def save_pretrained(self, save_directory: str) -> None:
        """Save model weights and config in HuggingFace format.

        Args:
            save_directory (str): Directory path where the model will be saved.
        """
        from .params import save_text_pretrained

        save_text_pretrained(self, save_directory)

`call(text, do_projection=True)`

Encode text tokens into embeddings.

Parameters:

Name	Type	Description	Default
`text`	`Int[Array, 'batch context_length']`	Token sequences.	required
`do_projection`	`bool`	Whether to apply the text projection layer. Defaults to True.	`True`

Returns:

Type	Description
`Float[Array, 'batch text_hidden_size']`	Float[Array, "batch text_hidden_size"]: Text embeddings.

Source code in src/jimm/models/siglip/siglip_model.py

def __call__(self, text: Int[Array, "batch context_length"], do_projection: bool = True) -> Float[Array, "batch text_hidden_size"]:
    """Encode text tokens into embeddings.

    Args:
        text (Int[Array, "batch context_length"]): Token sequences.
        do_projection (bool): Whether to apply the text projection layer. Defaults to True.

    Returns:
        Float[Array, "batch text_hidden_size"]: Text embeddings.
    """
    seq_len = text.shape[1]
    text_sharding = sharding_of(text)
    embed_sharding = named_sharding_like(text, P(*text_sharding.spec, None))
    x = self.token_embedding.embedding[...].at[text].get(out_sharding=embed_sharding)
    pos_embed = jnp.broadcast_to(self.positional_embedding[...][:seq_len], x.shape)
    x = x + reshard_like(pos_embed, x)
    x = self.transformer(x)
    x = self.ln_final(x)
    pooled_output = x[:, -1, :]
    if do_projection:
        kernel_spec = sharding_of(self.text_projection.kernel[...]).spec
        projection_sharding = named_sharding_like(pooled_output, P(sharding_of(pooled_output).spec[0], kernel_spec[1]))
        x = self.text_projection(pooled_output, out_sharding=projection_sharding)
    else:
        x = pooled_output
    return x

`init(context_length, vocab_size, text_hidden_size, num_text_heads, num_text_layers, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding)`

Initialize SigLIP text encoder.

Parameters:

Name	Type	Description	Default
`context_length`	`int`	Maximum sequence length.	required
`vocab_size`	`int`	Size of vocabulary.	required
`text_hidden_size`	`int`	Hidden dimension size of the text transformer.	required
`num_text_heads`	`int`	Number of attention heads in the text transformer.	required
`num_text_layers`	`int`	Number of transformer layers in the text transformer.	required
`use_gradient_checkpointing`	`bool`	Enable gradient checkpointing.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.	`None`
`rngs`	`Rngs`	RNG state.	`None`
`dtype`	`DTypeLike`	Computation dtype.	`float32`
`param_dtype`	`DTypeLike`	Parameter dtype.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters.	`SigLIPSharding`

Source code in src/jimm/models/siglip/siglip_model.py

def __init__(
    self,
    context_length: int,
    vocab_size: int,
    text_hidden_size: int,
    num_text_heads: int,
    num_text_layers: int,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = SigLIPSharding,
):
    """Initialize SigLIP text encoder.

    Args:
        context_length (int): Maximum sequence length.
        vocab_size (int): Size of vocabulary.
        text_hidden_size (int): Hidden dimension size of the text transformer.
        num_text_heads (int): Number of attention heads in the text transformer.
        num_text_layers (int): Number of transformer layers in the text transformer.
        use_gradient_checkpointing (bool): Enable gradient checkpointing.
        attention_fn (Callable[..., Any] | None): Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.
        rngs (rnglib.Rngs): RNG state.
        dtype (DTypeLike): Computation dtype.
        param_dtype (DTypeLike): Parameter dtype.
        sharding (ShardingSpec): Sharding specification for parameters.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    self.context_length = context_length
    self.vocab_size = vocab_size
    self.text_hidden_size = text_hidden_size
    self.num_text_heads = num_text_heads
    self.num_text_layers = num_text_layers
    self.dtype = dtype

    self.token_embedding = nnx.Embed(
        num_embeddings=vocab_size,
        features=text_hidden_size,
        dtype=dtype,
        param_dtype=param_dtype,
        rngs=rngs,
        embedding_init=nnx.with_partitioning(
            nnx.initializers.xavier_uniform(),
            sharding.embed,
        ),
    )
    self.positional_embedding = nnx.Param(
        nnx.with_partitioning(
            nnx.initializers.truncated_normal(stddev=0.02),
            sharding.text_pos_embed,
        )(rngs.params(), (context_length, text_hidden_size))
    )

    self.transformer = Transformer(
        hidden_size=text_hidden_size,
        mlp_dim=text_hidden_size * 4,
        num_layers=num_text_layers,
        num_heads=num_text_heads,
        dropout_rate=0.0,
        layernorm_epsilon=1e-6,
        use_quick_gelu=False,
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

    self.ln_final = nnx.LayerNorm(
        text_hidden_size,
        epsilon=1e-6,
        dtype=dtype,
        param_dtype=param_dtype,
        rngs=rngs,
        scale_init=nnx.with_partitioning(
            nnx.initializers.ones_init(),
            sharding.layernorm,
        ),
        bias_init=nnx.with_partitioning(
            nnx.initializers.zeros_init(),
            sharding.layernorm,
        ),
    )

    self.text_projection = nnx.Linear(
        text_hidden_size,
        text_hidden_size,
        use_bias=True,
        dtype=dtype,
        param_dtype=param_dtype,
        rngs=rngs,
        kernel_init=nnx.with_partitioning(
            nnx.initializers.xavier_uniform(),
            sharding.proj_kernel,
        ),
        bias_init=nnx.with_partitioning(
            nnx.initializers.zeros_init(),
            sharding.proj_bias,
        ),
    )

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

Create model from HuggingFace-compatible config dict.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration with "text_config" key.	required
`rngs`	`Rngs \| None`	Random number generator state. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	Data type for computations.	`float32`
`param_dtype`	`DTypeLike`	Data type for parameters.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters.	`SigLIPSharding`
`use_gradient_checkpointing`	`bool`	Enable gradient checkpointing.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.	`None`

Returns:

Type	Description
`SigLIPTextModel`	SigLIPTextModel with random weights.

Source code in src/jimm/models/siglip/siglip_model.py

@classmethod
def from_config(
    cls,
    config: dict[str, Any],
    *,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = SigLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
) -> "SigLIPTextModel":
    """Create model from HuggingFace-compatible config dict.

    Args:
        config: Configuration with "text_config" key.
        rngs: Random number generator state. If None, initializes to nnx.Rngs(0).
        dtype: Data type for computations.
        param_dtype: Data type for parameters.
        sharding: Sharding specification for parameters.
        use_gradient_checkpointing: Enable gradient checkpointing.
        attention_fn: Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.

    Returns:
        SigLIPTextModel with random weights.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    text_config = config["text_config"]

    return cls(
        context_length=text_config["max_position_embeddings"],
        vocab_size=text_config["vocab_size"],
        text_hidden_size=text_config["hidden_size"],
        num_text_heads=text_config["num_attention_heads"],
        num_text_layers=text_config["num_hidden_layers"],
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

Load pretrained text encoder from SigLIP checkpoint.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	Local path or HuggingFace model ID.	required
`use_pytorch`	`bool`	Load from PyTorch weights.	`False`
`rngs`	`Rngs`	RNG state.	`None`
`dtype`	`DTypeLike`	Computation dtype.	`float32`
`param_dtype`	`DTypeLike`	Parameter dtype.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters.	`SigLIPSharding`
`use_gradient_checkpointing`	`bool`	Enable gradient checkpointing.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.	`None`

Returns:

Name	Type	Description
`SigLIPTextModel`	`SigLIPTextModel`	Pretrained text model.

Source code in src/jimm/models/siglip/siglip_model.py

@classmethod
def from_pretrained(
    cls,
    model_name_or_path: str,
    use_pytorch: bool = False,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = SigLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
) -> "SigLIPTextModel":
    """Load pretrained text encoder from SigLIP checkpoint.

    Args:
        model_name_or_path (str): Local path or HuggingFace model ID.
        use_pytorch (bool): Load from PyTorch weights.
        rngs (rnglib.Rngs): RNG state.
        dtype (DTypeLike): Computation dtype.
        param_dtype (DTypeLike): Parameter dtype.
        sharding (ShardingSpec): Sharding specification for parameters.
        use_gradient_checkpointing (bool): Enable gradient checkpointing.
        attention_fn (Callable[..., Any] | None): Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.

    Returns:
        SigLIPTextModel: Pretrained text model.
    """
    from .params import load_text_from_pretrained

    return load_text_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

`save_pretrained(save_directory)`

Save model weights and config in HuggingFace format.

Parameters:

Name	Type	Description	Default
`save_directory`	`str`	Directory path where the model will be saved.	required

Source code in src/jimm/models/siglip/siglip_model.py

def save_pretrained(self, save_directory: str) -> None:
    """Save model weights and config in HuggingFace format.

    Args:
        save_directory (str): Directory path where the model will be saved.
    """
    from .params import save_text_pretrained

    save_text_pretrained(self, save_directory)

`jimm.models.siglip.SigLIP`

Bases: Module

Source code in src/jimm/models/siglip/siglip_model.py

class SigLIP(nnx.Module):
    def __init__(
        self,
        image_resolution: int,
        vision_layers: int,
        vision_hidden_size: int,
        vision_patch_size: int,
        context_length: int,
        vocab_size: int,
        text_hidden_size: int,
        num_text_heads: int,
        num_text_layers: int,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
        vision_attention_fn: Callable[..., Any] | None = None,
        text_attention_fn: Callable[..., Any] | None = None,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = SigLIPSharding,
    ):
        """Initialize the SigLIP model.

        Args:
            image_resolution (int): The resolution of the input images.
            vision_layers (int): The number of layers in the vision transformer.
            vision_hidden_size (int): The hidden dimension size of the vision transformer.
            vision_patch_size (int): The patch size of the vision transformer.
            context_length (int): The maximum sequence length for text.
            vocab_size (int): The size of the vocabulary.
            text_hidden_size (int): The hidden dimension size of the text transformer.
            num_text_heads (int): The number of attention heads in the text transformer.
            num_text_layers (int): The number of transformer layers in the text transformer.
            use_gradient_checkpointing (bool, optional): Whether to use gradient checkpointing. Defaults to False.
            attention_fn (Callable[..., Any] | None, optional): Custom attention function applied to both encoders. Defaults to None.
            vision_attention_fn (Callable[..., Any] | None, optional): Override attention_fn for vision encoder only. Defaults to None.
            text_attention_fn (Callable[..., Any] | None, optional): Override attention_fn for text encoder only. Defaults to None.
            rngs (rnglib.Rngs | None, optional): The random number generator state. If None, initializes to nnx.Rngs(0).
            dtype (DTypeLike, optional): The data type for computations. Defaults to jnp.float32.
            param_dtype (DTypeLike, optional): The data type for parameters. Defaults to jnp.float32.
            sharding (ShardingSpec, optional): Sharding specification for parameters. Defaults to SigLIPSharding.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        self.vision_layers = vision_layers
        self.vision_hidden_size = vision_hidden_size
        self.vision_patch_size = vision_patch_size
        self.context_length = context_length
        self.vocab_size = vocab_size
        self.text_hidden_size = text_hidden_size
        self.num_text_heads = num_text_heads
        self.num_text_layers = num_text_layers
        self.dtype = dtype
        self._original_config = None

        self.vision_heads = vision_hidden_size // 64
        self.vision_model = SigLIPVisionModel(
            image_resolution=image_resolution,
            vision_layers=vision_layers,
            vision_hidden_size=vision_hidden_size,
            vision_patch_size=vision_patch_size,
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=vision_attention_fn or attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

        self.text_model = SigLIPTextModel(
            context_length=context_length,
            vocab_size=vocab_size,
            text_hidden_size=text_hidden_size,
            num_text_heads=num_text_heads,
            num_text_layers=num_text_layers,
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=text_attention_fn or attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

        self.logit_scale = nnx.Param(nnx.with_partitioning(nnx.initializers.ones_init(), ())(rngs.params(), ()))
        self.logit_bias = nnx.Param(nnx.with_partitioning(nnx.initializers.ones_init(), ())(rngs.params(), ()))

    def encode_image(self, image: Float[Array, "batch height width channels"]) -> Float[Array, "batch vision_hidden_size"]:
        """Encode images into embeddings.

        Args:
            image (Float[Array, "batch height width channels"]): Batch of input images.

        Returns:
            Float[Array, "batch vision_hidden_size"]: Image embeddings.
        """
        return self.vision_model(image)

    def encode_text(self, text: Int[Array, "batch context_length"]) -> Float[Array, "batch text_hidden_size"]:
        """Encode text tokens into embeddings.

        Args:
            text (Int[Array, "batch context_length"]): Batch of token sequences.

        Returns:
            Float[Array, "batch text_hidden_size"]: Text embeddings.
        """
        return self.text_model(text)

    def __call__(self, image: Float[Array, "batch height width channels"], text: Int[Array, "batch context_length"]) -> Float[Array, "batch batch"]:
        """Calculate similarity between image and text embeddings.

        Args:
            image (Float[Array, "batch height width channels"]): Batch of input images.
            text (Int[Array, "batch context_length"]): Batch of token sequences.

        Returns:
            Float[Array, "batch batch"]: Similarity scores between all pairs of images and texts.
        """
        image_features: Float[Array, "batch vision_hidden_size"] = self.encode_image(image)
        text_features: Float[Array, "batch text_hidden_size"] = self.encode_text(text)

        image_features: Float[Array, "batch vision_hidden_size"] = image_features / jnp.linalg.norm(image_features, axis=-1, keepdims=True)
        text_features: Float[Array, "batch text_hidden_size"] = text_features / jnp.linalg.norm(text_features, axis=-1, keepdims=True)

        logit_scale: Float[Array, ""] = jnp.exp(self.logit_scale[...])
        image_spec = sharding_of(image_features).spec
        logits_sharding = named_sharding_like(image_features, P(image_spec[0], None))
        logits: Float[Array, "batch batch"] = logit_scale * jnp.matmul(image_features, text_features.T, out_sharding=logits_sharding) + self.logit_bias[...]
        return logits

    @classmethod
    def from_pretrained(
        cls,
        model_name_or_path: str,
        use_pytorch: bool = False,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = SigLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
    ) -> "SigLIP":
        """Load a pretrained SigLIP model from a local path or HuggingFace Hub.

        Args:
            model_name_or_path (str): Path to local weights or HuggingFace model ID.
            use_pytorch (bool): Whether to load from PyTorch weights. Defaults to False.
            rngs (rnglib.Rngs | None): Random number generator keys. If None, initializes to nnx.Rngs(0).
            dtype (DTypeLike): Data type for computations. Defaults to jnp.float32.
            param_dtype (DTypeLike): Data type for parameters. Defaults to jnp.float32.
            sharding (ShardingSpec): Sharding specification for parameters. Defaults to SigLIPSharding.
            use_gradient_checkpointing (bool): Whether to use gradient checkpointing. Defaults to False.
            attention_fn (Callable[..., Any] | None): Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.

        Returns:
            SigLIP: Pretrained SigLIP model
        """
        from .params import load_from_pretrained

        return load_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

    @classmethod
    def from_config(
        cls,
        config: dict[str, Any],
        *,
        rngs: rnglib.Rngs | None = None,
        dtype: DTypeLike = jnp.float32,
        param_dtype: DTypeLike = jnp.float32,
        sharding: ShardingSpec = SigLIPSharding,
        use_gradient_checkpointing: bool = False,
        attention_fn: Callable[..., Any] | None = None,
        vision_attention_fn: Callable[..., Any] | None = None,
        text_attention_fn: Callable[..., Any] | None = None,
    ) -> "SigLIP":
        """Create model from HuggingFace-compatible config dict.

        Args:
            config: Configuration with "text_config" and "vision_config" keys.
            rngs: Random number generator state. If None, initializes to nnx.Rngs(0).
            dtype: Data type for computations.
            param_dtype: Data type for parameters.
            sharding: Sharding specification for parameters.
            use_gradient_checkpointing: Enable gradient checkpointing.
            attention_fn: Custom attention function applied to both encoders. Defaults to None.
            vision_attention_fn: Override attention_fn for vision encoder only. Defaults to None.
            text_attention_fn: Override attention_fn for text encoder only. Defaults to None.

        Returns:
            SigLIP model with random weights.
        """
        if rngs is None:
            rngs = nnx.Rngs(0)
        text_config = config["text_config"]
        vision_config = config["vision_config"]

        return cls(
            image_resolution=vision_config["image_size"],
            vision_layers=vision_config["num_hidden_layers"],
            vision_hidden_size=vision_config["hidden_size"],
            vision_patch_size=vision_config["patch_size"],
            context_length=text_config["max_position_embeddings"],
            vocab_size=text_config["vocab_size"],
            text_hidden_size=text_config["hidden_size"],
            num_text_heads=text_config["num_attention_heads"],
            num_text_layers=text_config["num_hidden_layers"],
            use_gradient_checkpointing=use_gradient_checkpointing,
            attention_fn=attention_fn,
            vision_attention_fn=vision_attention_fn,
            text_attention_fn=text_attention_fn,
            rngs=rngs,
            dtype=dtype,
            param_dtype=param_dtype,
            sharding=sharding,
        )

    def save_pretrained(self, save_directory: str):
        """Save the model weights and config in HuggingFace format.

        Args:
            save_directory (str): Directory path where the model will be saved.
        """
        from .params import save_pretrained

        save_pretrained(self, save_directory)

`call(image, text)`

Calculate similarity between image and text embeddings.

Parameters:

Name	Type	Description	Default
`image`	`Float[Array, 'batch height width channels']`	Batch of input images.	required
`text`	`Int[Array, 'batch context_length']`	Batch of token sequences.	required

Returns:

Type	Description
`Float[Array, 'batch batch']`	Float[Array, "batch batch"]: Similarity scores between all pairs of images and texts.

Source code in src/jimm/models/siglip/siglip_model.py

def __call__(self, image: Float[Array, "batch height width channels"], text: Int[Array, "batch context_length"]) -> Float[Array, "batch batch"]:
    """Calculate similarity between image and text embeddings.

    Args:
        image (Float[Array, "batch height width channels"]): Batch of input images.
        text (Int[Array, "batch context_length"]): Batch of token sequences.

    Returns:
        Float[Array, "batch batch"]: Similarity scores between all pairs of images and texts.
    """
    image_features: Float[Array, "batch vision_hidden_size"] = self.encode_image(image)
    text_features: Float[Array, "batch text_hidden_size"] = self.encode_text(text)

    image_features: Float[Array, "batch vision_hidden_size"] = image_features / jnp.linalg.norm(image_features, axis=-1, keepdims=True)
    text_features: Float[Array, "batch text_hidden_size"] = text_features / jnp.linalg.norm(text_features, axis=-1, keepdims=True)

    logit_scale: Float[Array, ""] = jnp.exp(self.logit_scale[...])
    image_spec = sharding_of(image_features).spec
    logits_sharding = named_sharding_like(image_features, P(image_spec[0], None))
    logits: Float[Array, "batch batch"] = logit_scale * jnp.matmul(image_features, text_features.T, out_sharding=logits_sharding) + self.logit_bias[...]
    return logits

`init(image_resolution, vision_layers, vision_hidden_size, vision_patch_size, context_length, vocab_size, text_hidden_size, num_text_heads, num_text_layers, use_gradient_checkpointing=False, attention_fn=None, vision_attention_fn=None, text_attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding)`

Initialize the SigLIP model.

Parameters:

Name	Type	Description	Default
`image_resolution`	`int`	The resolution of the input images.	required
`vision_layers`	`int`	The number of layers in the vision transformer.	required
`vision_hidden_size`	`int`	The hidden dimension size of the vision transformer.	required
`vision_patch_size`	`int`	The patch size of the vision transformer.	required
`context_length`	`int`	The maximum sequence length for text.	required
`vocab_size`	`int`	The size of the vocabulary.	required
`text_hidden_size`	`int`	The hidden dimension size of the text transformer.	required
`num_text_heads`	`int`	The number of attention heads in the text transformer.	required
`num_text_layers`	`int`	The number of transformer layers in the text transformer.	required
`use_gradient_checkpointing`	`bool`	Whether to use gradient checkpointing. Defaults to False.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function applied to both encoders. Defaults to None.	`None`
`vision_attention_fn`	`Callable[..., Any] \| None`	Override attention_fn for vision encoder only. Defaults to None.	`None`
`text_attention_fn`	`Callable[..., Any] \| None`	Override attention_fn for text encoder only. Defaults to None.	`None`
`rngs`	`Rngs \| None`	The random number generator state. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	The data type for computations. Defaults to jnp.float32.	`float32`
`param_dtype`	`DTypeLike`	The data type for parameters. Defaults to jnp.float32.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters. Defaults to SigLIPSharding.	`SigLIPSharding`

Source code in src/jimm/models/siglip/siglip_model.py

def __init__(
    self,
    image_resolution: int,
    vision_layers: int,
    vision_hidden_size: int,
    vision_patch_size: int,
    context_length: int,
    vocab_size: int,
    text_hidden_size: int,
    num_text_heads: int,
    num_text_layers: int,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
    vision_attention_fn: Callable[..., Any] | None = None,
    text_attention_fn: Callable[..., Any] | None = None,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = SigLIPSharding,
):
    """Initialize the SigLIP model.

    Args:
        image_resolution (int): The resolution of the input images.
        vision_layers (int): The number of layers in the vision transformer.
        vision_hidden_size (int): The hidden dimension size of the vision transformer.
        vision_patch_size (int): The patch size of the vision transformer.
        context_length (int): The maximum sequence length for text.
        vocab_size (int): The size of the vocabulary.
        text_hidden_size (int): The hidden dimension size of the text transformer.
        num_text_heads (int): The number of attention heads in the text transformer.
        num_text_layers (int): The number of transformer layers in the text transformer.
        use_gradient_checkpointing (bool, optional): Whether to use gradient checkpointing. Defaults to False.
        attention_fn (Callable[..., Any] | None, optional): Custom attention function applied to both encoders. Defaults to None.
        vision_attention_fn (Callable[..., Any] | None, optional): Override attention_fn for vision encoder only. Defaults to None.
        text_attention_fn (Callable[..., Any] | None, optional): Override attention_fn for text encoder only. Defaults to None.
        rngs (rnglib.Rngs | None, optional): The random number generator state. If None, initializes to nnx.Rngs(0).
        dtype (DTypeLike, optional): The data type for computations. Defaults to jnp.float32.
        param_dtype (DTypeLike, optional): The data type for parameters. Defaults to jnp.float32.
        sharding (ShardingSpec, optional): Sharding specification for parameters. Defaults to SigLIPSharding.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    self.vision_layers = vision_layers
    self.vision_hidden_size = vision_hidden_size
    self.vision_patch_size = vision_patch_size
    self.context_length = context_length
    self.vocab_size = vocab_size
    self.text_hidden_size = text_hidden_size
    self.num_text_heads = num_text_heads
    self.num_text_layers = num_text_layers
    self.dtype = dtype
    self._original_config = None

    self.vision_heads = vision_hidden_size // 64
    self.vision_model = SigLIPVisionModel(
        image_resolution=image_resolution,
        vision_layers=vision_layers,
        vision_hidden_size=vision_hidden_size,
        vision_patch_size=vision_patch_size,
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=vision_attention_fn or attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

    self.text_model = SigLIPTextModel(
        context_length=context_length,
        vocab_size=vocab_size,
        text_hidden_size=text_hidden_size,
        num_text_heads=num_text_heads,
        num_text_layers=num_text_layers,
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=text_attention_fn or attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

    self.logit_scale = nnx.Param(nnx.with_partitioning(nnx.initializers.ones_init(), ())(rngs.params(), ()))
    self.logit_bias = nnx.Param(nnx.with_partitioning(nnx.initializers.ones_init(), ())(rngs.params(), ()))

`encode_image(image)`

Encode images into embeddings.

Parameters:

Name	Type	Description	Default
`image`	`Float[Array, 'batch height width channels']`	Batch of input images.	required

Returns:

Type	Description
`Float[Array, 'batch vision_hidden_size']`	Float[Array, "batch vision_hidden_size"]: Image embeddings.

Source code in src/jimm/models/siglip/siglip_model.py

def encode_image(self, image: Float[Array, "batch height width channels"]) -> Float[Array, "batch vision_hidden_size"]:
    """Encode images into embeddings.

    Args:
        image (Float[Array, "batch height width channels"]): Batch of input images.

    Returns:
        Float[Array, "batch vision_hidden_size"]: Image embeddings.
    """
    return self.vision_model(image)

`encode_text(text)`

Encode text tokens into embeddings.

Parameters:

Name	Type	Description	Default
`text`	`Int[Array, 'batch context_length']`	Batch of token sequences.	required

Returns:

Type	Description
`Float[Array, 'batch text_hidden_size']`	Float[Array, "batch text_hidden_size"]: Text embeddings.

Source code in src/jimm/models/siglip/siglip_model.py

def encode_text(self, text: Int[Array, "batch context_length"]) -> Float[Array, "batch text_hidden_size"]:
    """Encode text tokens into embeddings.

    Args:
        text (Int[Array, "batch context_length"]): Batch of token sequences.

    Returns:
        Float[Array, "batch text_hidden_size"]: Text embeddings.
    """
    return self.text_model(text)

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None, vision_attention_fn=None, text_attention_fn=None)` `classmethod`

Create model from HuggingFace-compatible config dict.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration with "text_config" and "vision_config" keys.	required
`rngs`	`Rngs \| None`	Random number generator state. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	Data type for computations.	`float32`
`param_dtype`	`DTypeLike`	Data type for parameters.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters.	`SigLIPSharding`
`use_gradient_checkpointing`	`bool`	Enable gradient checkpointing.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function applied to both encoders. Defaults to None.	`None`
`vision_attention_fn`	`Callable[..., Any] \| None`	Override attention_fn for vision encoder only. Defaults to None.	`None`
`text_attention_fn`	`Callable[..., Any] \| None`	Override attention_fn for text encoder only. Defaults to None.	`None`

Returns:

Type	Description
`SigLIP`	SigLIP model with random weights.

Source code in src/jimm/models/siglip/siglip_model.py

@classmethod
def from_config(
    cls,
    config: dict[str, Any],
    *,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = SigLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
    vision_attention_fn: Callable[..., Any] | None = None,
    text_attention_fn: Callable[..., Any] | None = None,
) -> "SigLIP":
    """Create model from HuggingFace-compatible config dict.

    Args:
        config: Configuration with "text_config" and "vision_config" keys.
        rngs: Random number generator state. If None, initializes to nnx.Rngs(0).
        dtype: Data type for computations.
        param_dtype: Data type for parameters.
        sharding: Sharding specification for parameters.
        use_gradient_checkpointing: Enable gradient checkpointing.
        attention_fn: Custom attention function applied to both encoders. Defaults to None.
        vision_attention_fn: Override attention_fn for vision encoder only. Defaults to None.
        text_attention_fn: Override attention_fn for text encoder only. Defaults to None.

    Returns:
        SigLIP model with random weights.
    """
    if rngs is None:
        rngs = nnx.Rngs(0)
    text_config = config["text_config"]
    vision_config = config["vision_config"]

    return cls(
        image_resolution=vision_config["image_size"],
        vision_layers=vision_config["num_hidden_layers"],
        vision_hidden_size=vision_config["hidden_size"],
        vision_patch_size=vision_config["patch_size"],
        context_length=text_config["max_position_embeddings"],
        vocab_size=text_config["vocab_size"],
        text_hidden_size=text_config["hidden_size"],
        num_text_heads=text_config["num_attention_heads"],
        num_text_layers=text_config["num_hidden_layers"],
        use_gradient_checkpointing=use_gradient_checkpointing,
        attention_fn=attention_fn,
        vision_attention_fn=vision_attention_fn,
        text_attention_fn=text_attention_fn,
        rngs=rngs,
        dtype=dtype,
        param_dtype=param_dtype,
        sharding=sharding,
    )

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

Load a pretrained SigLIP model from a local path or HuggingFace Hub.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	Path to local weights or HuggingFace model ID.	required
`use_pytorch`	`bool`	Whether to load from PyTorch weights. Defaults to False.	`False`
`rngs`	`Rngs \| None`	Random number generator keys. If None, initializes to nnx.Rngs(0).	`None`
`dtype`	`DTypeLike`	Data type for computations. Defaults to jnp.float32.	`float32`
`param_dtype`	`DTypeLike`	Data type for parameters. Defaults to jnp.float32.	`float32`
`sharding`	`ShardingSpec`	Sharding specification for parameters. Defaults to SigLIPSharding.	`SigLIPSharding`
`use_gradient_checkpointing`	`bool`	Whether to use gradient checkpointing. Defaults to False.	`False`
`attention_fn`	`Callable[..., Any] \| None`	Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.	`None`

Returns:

Name	Type	Description
`SigLIP`	`SigLIP`	Pretrained SigLIP model

Source code in src/jimm/models/siglip/siglip_model.py

@classmethod
def from_pretrained(
    cls,
    model_name_or_path: str,
    use_pytorch: bool = False,
    rngs: rnglib.Rngs | None = None,
    dtype: DTypeLike = jnp.float32,
    param_dtype: DTypeLike = jnp.float32,
    sharding: ShardingSpec = SigLIPSharding,
    use_gradient_checkpointing: bool = False,
    attention_fn: Callable[..., Any] | None = None,
) -> "SigLIP":
    """Load a pretrained SigLIP model from a local path or HuggingFace Hub.

    Args:
        model_name_or_path (str): Path to local weights or HuggingFace model ID.
        use_pytorch (bool): Whether to load from PyTorch weights. Defaults to False.
        rngs (rnglib.Rngs | None): Random number generator keys. If None, initializes to nnx.Rngs(0).
        dtype (DTypeLike): Data type for computations. Defaults to jnp.float32.
        param_dtype (DTypeLike): Data type for parameters. Defaults to jnp.float32.
        sharding (ShardingSpec): Sharding specification for parameters. Defaults to SigLIPSharding.
        use_gradient_checkpointing (bool): Whether to use gradient checkpointing. Defaults to False.
        attention_fn (Callable[..., Any] | None): Custom attention function (e.g. jimm.tokamax_attention). Defaults to None.

    Returns:
        SigLIP: Pretrained SigLIP model
    """
    from .params import load_from_pretrained

    return load_from_pretrained(cls, model_name_or_path, use_pytorch, rngs, dtype, param_dtype, sharding, use_gradient_checkpointing, attention_fn)

`save_pretrained(save_directory)`

Save the model weights and config in HuggingFace format.

Parameters:

Name	Type	Description	Default
`save_directory`	`str`	Directory path where the model will be saved.	required

Source code in src/jimm/models/siglip/siglip_model.py

def save_pretrained(self, save_directory: str):
    """Save the model weights and config in HuggingFace format.

    Args:
        save_directory (str): Directory path where the model will be saved.
    """
    from .params import save_pretrained

    save_pretrained(self, save_directory)

SigLIP (Sigmoid Loss for Language Image Pre-Training)

Flash / Splash Attention

FSDP / Explicit Sharding

jimm.models.siglip.SigLIPVisionModel

__call__(image, do_projection=True)

__init__(image_resolution, vision_layers, vision_hidden_size, vision_patch_size, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding)

from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None) classmethod

from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None) classmethod

save_pretrained(save_directory)

jimm.models.siglip.SigLIPTextModel

__call__(text, do_projection=True)

__init__(context_length, vocab_size, text_hidden_size, num_text_heads, num_text_layers, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding)

from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None) classmethod

from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None) classmethod

save_pretrained(save_directory)

jimm.models.siglip.SigLIP

__call__(image, text)

encode_image(image)

encode_text(text)

from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None, vision_attention_fn=None, text_attention_fn=None) classmethod

from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None) classmethod

save_pretrained(save_directory)

`jimm.models.siglip.SigLIPVisionModel`

`call(image, do_projection=True)`

`init(image_resolution, vision_layers, vision_hidden_size, vision_patch_size, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding)`

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

`save_pretrained(save_directory)`

`jimm.models.siglip.SigLIPTextModel`

`call(text, do_projection=True)`

`init(context_length, vocab_size, text_hidden_size, num_text_heads, num_text_layers, use_gradient_checkpointing=False, attention_fn=None, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding)`

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

`save_pretrained(save_directory)`

`jimm.models.siglip.SigLIP`

`call(image, text)`

`encode_image(image)`

`encode_text(text)`

`from_config(config, *, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None, vision_attention_fn=None, text_attention_fn=None)` `classmethod`

`from_pretrained(model_name_or_path, use_pytorch=False, rngs=None, dtype=jnp.float32, param_dtype=jnp.float32, sharding=SigLIPSharding, use_gradient_checkpointing=False, attention_fn=None)` `classmethod`

`save_pretrained(save_directory)`