strickvl's picture
Add new SentenceTransformer model.
f1fa80d verified
|
raw
history blame
59.2 kB
metadata
language:
  - en
license: apache-2.0
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:3284
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-m-v1.5
widget:
  - source_sentence: >-
      Does ZenML officially support Macs running on Apple Silicon, and are there
      any specific configurations needed?
    sentences:
      - >-
        ding ZenML to learn more!


        Do you support Windows?ZenML officially supports Windows if you're using
        WSL. Much of ZenML will also work on Windows outside a WSL environment,
        but we don't officially support it and some features don't work (notably
        anything that requires spinning up a server process).


        Do you support Macs running on Apple Silicon?


        Yes, ZenML does support Macs running on Apple Silicon. You just need to
        make sure that you set the following environment variable:


        export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES


        This is a known issue with how forking works on Macs running on Apple
        Silicon and it will enable you to use ZenML and the server. This
        environment variable is needed if you are working with a local server on
        your Mac, but if you're just using ZenML as a client / CLI and
        connecting to a deployed server then you don't need to set it.


        How can I make ZenML work with my custom tool? How can I extend or build
        on ZenML?


        This depends on the tool and its respective MLOps category. We have a
        full guide on this over here!


        How can I contribute?


        We develop ZenML together with our community! To get involved, the best
        way to get started is to select any issue from the good-first-issue
        label. If you would like to contribute, please review our Contributing
        Guide for all relevant details.


        How can I speak with the community?


        The first point of the call should be our Slack group. Ask your
        questions about bugs or specific use cases and someone from the core
        team will respond.


        Which license does ZenML use?


        ZenML is distributed under the terms of the Apache License Version 2.0.
        A complete version of the license is available in the LICENSE.md in this
        repository. Any contribution made to this project will be licensed under
        the Apache License Version 2.0.


        PreviousCommunity & content


        Last updated 3 months ago
      - |-
        Registering a Model

        PreviousUse the Model Control PlaneNextDeleting a Model

        Last updated 4 months ago
      - >-
        Synthetic data generation


        Generate synthetic data with distilabel to finetune embeddings.


        PreviousImprove retrieval by finetuning embeddingsNextFinetuning
        embeddings with Sentence Transformers


        Last updated 21 days ago
  - source_sentence: >-
      How can I change the logging verbosity level in ZenML for both local and
      remote pipeline runs?
    sentences:
      - >-
        ncepts covered in this guide to your own projects.By the end of this
        guide, you'll have a solid understanding of how to leverage LLMs in your
        MLOps workflows using ZenML, enabling you to build powerful, scalable,
        and maintainable LLM-powered applications. First up, let's take a look
        at a super simple implementation of the RAG paradigm to get started.


        PreviousAn end-to-end projectNextRAG with ZenML


        Last updated 21 days ago
      - >-
        Configuring a pipeline at runtime


        Configuring a pipeline at runtime.


        PreviousUse pipeline/step parametersNextReference environment variables
        in configurations


        Last updated 28 days ago
      - >-
        Set logging verbosity


        How to set the logging verbosity in ZenML.


        By default, ZenML sets the logging verbosity to INFO. If you wish to
        change this, you can do so by setting the following environment
        variable:


        export ZENML_LOGGING_VERBOSITY=INFO


        Choose from INFO, WARN, ERROR, CRITICAL, DEBUG. This will set the logs
        to whichever level you suggest.


        Note that setting this on the client environment (e.g. your local
        machine which runs the pipeline) will not automatically set the same
        logging verbosity for remote pipeline runs. That means setting this
        variable locally with only effect pipelines that run locally.


        If you wish to control for remote pipeline runs, you can set the
        ZENML_LOGGING_VERBOSITY environment variable in your pipeline runs
        environment as follows:


        docker_settings = DockerSettings(environment={"ZENML_LOGGING_VERBOSITY":
        "DEBUG"})


        # Either add it to the decorator

        @pipeline(settings={"docker": docker_settings})

        def my_pipeline() -> None:
            my_step()

        # Or configure the pipelines options

        my_pipeline = my_pipeline.with_options(
            settings={"docker": docker_settings}
        )


        PreviousEnable or disable logs storageNextDisable rich traceback output


        Last updated 21 days ago
  - source_sentence: >-
      How can I autogenerate a template yaml file for my specific pipeline using
      ZenML?
    sentences:
      - >-
        Autogenerate a template yaml file


        To help you figure out what you can put in your configuration file,
        simply autogenerate a template.


        If you want to generate a template yaml file of your specific pipeline,
        you can do so by using the .write_run_configuration_template() method.
        This will generate a yaml file with all options commented out. This way
        you can pick and choose the settings that are relevant to you.


        from zenml import pipeline

        ...


        @pipeline(enable_cache=True) # set cache behavior at step level

        def simple_ml_pipeline(parameter: int):
            dataset = load_data(parameter=parameter)
            train_model(dataset)

        simple_ml_pipeline.write_run_configuration_template(path="<Insert_path_here>")


        When you want to configure your pipeline with a certain stack in mind,
        you can do so as well:
        `...write_run_configuration_template(stack=<Insert_stack_here>)


        PreviousFind out which configuration was used for a runNextCustomize
        Docker builds


        Last updated 21 days ago
      - |-
        Deleting a Model

        Learn how to delete models.

        PreviousRegistering a ModelNextAssociate a pipeline with a Model

        Last updated 4 months ago
      - >-
        Load artifacts into memory


        Often ZenML pipeline steps consume artifacts produced by one another
        directly in the pipeline code, but there are scenarios where you need to
        pull external data into your steps. Such external data could be
        artifacts produced by non-ZenML codes. For those cases, it is advised to
        use ExternalArtifact, but what if we plan to exchange data created with
        other ZenML pipelines?


        ZenML pipelines are first compiled and only executed at some later
        point. During the compilation phase, all function calls are executed,
        and this data is fixed as step input parameters. Given all this, the
        late materialization of dynamic objects, like data artifacts, is
        crucial. Without late materialization, it would not be possible to pass
        not-yet-existing artifacts as step inputs, or their metadata, which is
        often the case in a multi-pipeline setting.


        We identify two major use cases for exchanging artifacts between
        pipelines:


        You semantically group your data products using ZenML Models


        You prefer to use ZenML Client to bring all the pieces together


        We recommend using models to group and access artifacts across
        pipelines. Find out how to load an artifact from a ZenML Model here.


        Use client methods to exchange artifacts


        If you don't yet use the Model Control Plane, you can still exchange
        data between pipelines with late materialization. Let's rework the
        do_predictions pipeline code as follows:


        from typing import Annotated

        from zenml import step, pipeline

        from zenml.client import Client

        import pandas as pd

        from sklearn.base import ClassifierMixin
  - source_sentence: >-
      How can I create a Kubernetes cluster on EKS and configure it to run Spark
      with a custom Docker image?
    sentences:
      - |-
        View logs on the dashboard

        PreviousControl loggingNextEnable or disable logs storage

        Last updated 21 days ago
      - >-
        Datasets in ZenML


        Model datasets using simple abstractions.


        As machine learning projects grow in complexity, you often need to work
        with various data sources and manage intricate data flows. This chapter
        explores how to use custom Dataset classes and Materializers in ZenML to
        handle these challenges efficiently. For strategies on scaling your data
        processing for larger datasets, refer to scaling strategies for big
        data.


        Introduction to Custom Dataset Classes


        Custom Dataset classes in ZenML provide a way to encapsulate data
        loading, processing, and saving logic for different data sources.
        They're particularly useful when:


        Working with multiple data sources (e.g., CSV files, databases, cloud
        storage)


        Dealing with complex data structures that require special handling


        Implementing custom data processing or transformation logic


        Implementing Dataset Classes for Different Data Sources


        Let's create a base Dataset class and implement it for CSV and BigQuery
        data sources:


        from abc import ABC, abstractmethod

        import pandas as pd

        from google.cloud import bigquery

        from typing import Optional


        class Dataset(ABC):
            @abstractmethod
            def read_data(self) -> pd.DataFrame:
                pass

        class CSVDataset(Dataset):
            def __init__(self, data_path: str, df: Optional[pd.DataFrame] = None):
                self.data_path = data_path
                self.df = df

        def read_data(self) -> pd.DataFrame:
                if self.df is None:
                    self.df = pd.read_csv(self.data_path)
                return self.df

        class BigQueryDataset(Dataset):
            def __init__(
                self,
                table_id: str,
                df: Optional[pd.DataFrame] = None,
                project: Optional[str] = None,
            ):
                self.table_id = table_id
                self.project = project
                self.df = df
                self.client = bigquery.Client(project=self.project)

        def read_data(self) -> pd.DataFrame:
                query = f"SELECT * FROM `{self.table_id}`"
                self.df = self.client.query(query).to_dataframe()
                return self.df
      - >-
        e the correct region is selected on the top right.Click on Add cluster
        and select Create.


        Enter a name and select the cluster role for Cluster service role.


        Keep the default values for the networking and logging steps and create
        the cluster.


        Note down the cluster name and the API server endpoint:


        EKS_CLUSTER_NAME=<EKS_CLUSTER_NAME>

        EKS_API_SERVER_ENDPOINT=<API_SERVER_ENDPOINT>


        After the cluster is created, select it and click on Add node group in
        the Compute tab.


        Enter a name and select the node role.


        For the instance type, we recommend t3a.xlarge, as it provides up to 4
        vCPUs and 16 GB of memory.


        Docker image for the Spark drivers and executors


        When you want to run your steps on a Kubernetes cluster, Spark will
        require you to choose a base image for the driver and executor pods.
        Normally, for this purpose, you can either use one of the base images in
        Spark’s dockerhub or create an image using the docker-image-tool which
        will use your own Spark installation and build an image.


        When using Spark in EKS, you need to use the latter and utilize the
        docker-image-tool. However, before the build process, you also need to
        download the following packages


        hadoop-aws = 3.3.1


        aws-java-sdk-bundle = 1.12.150


        and put them in the jars folder within your Spark installation. Once
        that is set up, you can build the image as follows:


        cd $SPARK_HOME # If this empty for you then you need to set the
        SPARK_HOME variable which points to your Spark installation


        SPARK_IMAGE_TAG=<SPARK_IMAGE_TAG>


        ./bin/docker-image-tool.sh -t $SPARK_IMAGE_TAG -p
        kubernetes/dockerfiles/spark/bindings/python/Dockerfile -u 0 build


        BASE_IMAGE_NAME=spark-py:$SPARK_IMAGE_TAG


        If you are working on an M1 Mac, you will need to build the image for
        the amd64 architecture, by using the prefix -X on the previous command.
        For example:


        ./bin/docker-image-tool.sh -X -t $SPARK_IMAGE_TAG -p
        kubernetes/dockerfiles/spark/bindings/python/Dockerfile -u 0 build


        Configuring RBAC
  - source_sentence: How can I configure a pipeline with a YAML file in ZenML?
    sentences:
      - |-
        atically retry steps

        Run pipelines asynchronouslyControl execution order of steps

        Using a custom step invocation ID

        Name your pipeline runs

        Use failure/success hooks

        Hyperparameter tuning

        Access secrets in a step

        Run an individual step

        Fetching pipelines

        Get past pipeline/step runs

        🚨Trigger a pipeline

        Use templates: Python SDK

        Use templates: Dashboard

        Use templates: Rest API

        📃Use configuration files

        How to configure a pipeline with a YAML

        What can be configured

        Runtime settings for Docker, resources, and stack components

        Configuration hierarchy

        Find out which configuration was used for a run

        Autogenerate a template yaml file

        🐳Customize Docker builds

        Docker settings on a pipeline

        Docker settings on a step

        Use a prebuilt image for pipeline execution

        Specify pip dependencies and apt packages

        Use your own Dockerfiles

        Which files are built into the image

        How to reuse builds

        Define where an image is built

        📔Run remote pipelines from notebooks

        Limitations of defining steps in notebook cells

        Run a single step from a notebook

        🤹Manage your ZenML server

        Best practices for upgrading ZenML

        Upgrade your ZenML server

        Using ZenML server in production

        Troubleshoot your ZenML server

        Migration guide

        Migration guide 0.13.2  0.20.0

        Migration guide 0.23.0  0.30.0

        Migration guide 0.39.1  0.41.0

        Migration guide 0.58.2  0.60.0

        📍Develop locally

        Use config files to develop locally

        Keep your pipelines and dashboard clean

        ⚒️Manage stacks & components

        Deploy a cloud stack with ZenML

        Deploy a cloud stack with Terraform

        Register a cloud stack

        Reference secrets in stack configuration

        Implement a custom stack component

        🚜Train with GPUs

        Distributed Training with 🤗 Accelerate

        🌲Control logging

        View logs on the dashboard

        Enable or disable logs storage

        Set logging verbosity

        Disable rich traceback output

        Disable colorful logging

        🗄️Handle Data/Artifacts

        How ZenML stores data

        Return multiple outputs from a step

        Delete an artifact

        Organize data with tags

        Get arbitrary artifacts in a step
      - >-
        Security best practices


        Best practices concerning the various authentication methods implemented
        by Service Connectors.


        Service Connector Types, especially those targeted at cloud providers,
        offer a plethora of authentication methods matching those supported by
        remote cloud platforms. While there is no single authentication standard
        that unifies this process, there are some patterns that are easily
        identifiable and can be used as guidelines when deciding which
        authentication method to use to configure a Service Connector.


        This section explores some of those patterns and gives some advice
        regarding which authentication methods are best suited for your needs.


        This section may require some general knowledge about authentication and
        authorization to be properly understood. We tried to keep it simple and
        limit ourselves to talking about high-level concepts, but some areas may
        get a bit too technical.


        Username and password


        The key takeaway is this: you should avoid using your primary account
        password as authentication credentials as much as possible. If there are
        alternative authentication methods that you can use or other types of
        credentials (e.g. session tokens, API keys, API tokens), you should
        always try to use those instead.


        Ultimately, if you have no choice, be cognizant of the third parties you
        share your passwords with. If possible, they should never leave the
        premises of your local host or development environment.


        This is the typical authentication method that uses a username or
        account name plus the associated password. While this is the de facto
        method used to log in with web consoles and local CLIs, this is the
        least secure of all authentication methods and never something you want
        to share with other members of your team or organization or use to
        authenticate automated workloads.
      - >-
        ━━━┷━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━┛$ zenml orchestrator
        connect <ORCHESTRATOR_NAME> --connector aws-iam-multi-us

        Running with active stack: 'default' (repository)

        Successfully connected orchestrator `<ORCHESTRATOR_NAME>` to the
        following resources:

        ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┓

                     CONNECTOR ID              CONNECTOR NAME    CONNECTOR
        TYPE  RESOURCE TYPE          RESOURCE NAMES   

        ┠──────────────────────────────────────┼──────────────────┼────────────────┼───────────────────────┼──────────────────┨

         ed528d5a-d6cb-4fc4-bc52-c3d2d01643e5  aws-iam-multi-us  🔶
        aws          🌀 kubernetes-cluster  zenhacks-cluster 

        ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┛


        # Register and activate a stack with the new orchestrator

        $ zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set


        if you don't have a Service Connector on hand and you don't want to
        register one , the local Kubernetes kubectl client needs to be
        configured with a configuration context pointing to the remote cluster.
        The kubernetes_context stack component must also be configured with the
        value of that context:


        zenml orchestrator register <ORCHESTRATOR_NAME> \
            --flavor=kubernetes \
            --kubernetes_context=<KUBERNETES_CONTEXT>

        # Register and activate a stack with the new orchestrator

        zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set


        ZenML will build a Docker image called
        <CONTAINER_REGISTRY_URI>/zenml:<PIPELINE_NAME> which includes your code
        and use it to run your pipeline steps in Kubernetes. Check out this page
        if you want to learn more about how ZenML builds these images and how
        you can customize them.


        You can now run any ZenML pipeline using the Kubernetes orchestrator:


        python file_that_runs_a_zenml_pipeline.py
datasets: []
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
model-index:
  - name: zenml/finetuned-snowflake-arctic-embed-m-v1.5
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 384
          type: dim_384
        metrics:
          - type: cosine_accuracy@1
            value: 0.1863013698630137
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.4794520547945205
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.6602739726027397
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.7972602739726027
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.1863013698630137
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.1598173515981735
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.13205479452054794
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.07972602739726026
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.1863013698630137
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.4794520547945205
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.6602739726027397
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.7972602739726027
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.47459290361092754
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.3725994781474232
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.37953809566266083
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 256
          type: dim_256
        metrics:
          - type: cosine_accuracy@1
            value: 0.18356164383561643
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.4876712328767123
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.6602739726027397
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.7917808219178082
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.18356164383561643
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.16255707762557076
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.1320547945205479
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.07917808219178081
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.18356164383561643
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.4876712328767123
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.6602739726027397
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.7917808219178082
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.47334554819769054
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.3724179169384647
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.37931260226095775
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 128
          type: dim_128
        metrics:
          - type: cosine_accuracy@1
            value: 0.18356164383561643
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.4684931506849315
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.6356164383561644
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.7780821917808219
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.18356164383561643
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.1561643835616438
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.12712328767123285
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.07780821917808219
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.18356164383561643
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.4684931506849315
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.6356164383561644
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.7780821917808219
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.46219638130094637
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.3628680147858229
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.37047490630037583
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 64
          type: dim_64
        metrics:
          - type: cosine_accuracy@1
            value: 0.2054794520547945
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.4767123287671233
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.6273972602739726
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.7534246575342466
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.2054794520547945
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.15890410958904108
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.12547945205479452
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.07534246575342465
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.2054794520547945
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.4767123287671233
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.6273972602739726
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.7534246575342466
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.46250756548591326
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.37069906501413347
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.37874559284369463
            name: Cosine Map@100

zenml/finetuned-snowflake-arctic-embed-m-v1.5

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-m-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("zenml/finetuned-snowflake-arctic-embed-m-v1.5")
# Run inference
sentences = [
    'How can I configure a pipeline with a YAML file in ZenML?',
    'atically retry steps\n\nRun pipelines asynchronouslyControl execution order of steps\n\nUsing a custom step invocation ID\n\nName your pipeline runs\n\nUse failure/success hooks\n\nHyperparameter tuning\n\nAccess secrets in a step\n\nRun an individual step\n\nFetching pipelines\n\nGet past pipeline/step runs\n\n🚨Trigger a pipeline\n\nUse templates: Python SDK\n\nUse templates: Dashboard\n\nUse templates: Rest API\n\n📃Use configuration files\n\nHow to configure a pipeline with a YAML\n\nWhat can be configured\n\nRuntime settings for Docker, resources, and stack components\n\nConfiguration hierarchy\n\nFind out which configuration was used for a run\n\nAutogenerate a template yaml file\n\n🐳Customize Docker builds\n\nDocker settings on a pipeline\n\nDocker settings on a step\n\nUse a prebuilt image for pipeline execution\n\nSpecify pip dependencies and apt packages\n\nUse your own Dockerfiles\n\nWhich files are built into the image\n\nHow to reuse builds\n\nDefine where an image is built\n\n📔Run remote pipelines from notebooks\n\nLimitations of defining steps in notebook cells\n\nRun a single step from a notebook\n\n🤹Manage your ZenML server\n\nBest practices for upgrading ZenML\n\nUpgrade your ZenML server\n\nUsing ZenML server in production\n\nTroubleshoot your ZenML server\n\nMigration guide\n\nMigration guide 0.13.2 → 0.20.0\n\nMigration guide 0.23.0 → 0.30.0\n\nMigration guide 0.39.1 → 0.41.0\n\nMigration guide 0.58.2 → 0.60.0\n\n📍Develop locally\n\nUse config files to develop locally\n\nKeep your pipelines and dashboard clean\n\n⚒️Manage stacks & components\n\nDeploy a cloud stack with ZenML\n\nDeploy a cloud stack with Terraform\n\nRegister a cloud stack\n\nReference secrets in stack configuration\n\nImplement a custom stack component\n\n🚜Train with GPUs\n\nDistributed Training with 🤗 Accelerate\n\n🌲Control logging\n\nView logs on the dashboard\n\nEnable or disable logs storage\n\nSet logging verbosity\n\nDisable rich traceback output\n\nDisable colorful logging\n\n🗄️Handle Data/Artifacts\n\nHow ZenML stores data\n\nReturn multiple outputs from a step\n\nDelete an artifact\n\nOrganize data with tags\n\nGet arbitrary artifacts in a step',
    "━━━┷━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━┛$ zenml orchestrator connect <ORCHESTRATOR_NAME> --connector aws-iam-multi-us\nRunning with active stack: 'default' (repository)\nSuccessfully connected orchestrator `<ORCHESTRATOR_NAME>` to the following resources:\n┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┓\n┃             CONNECTOR ID             │ CONNECTOR NAME   │ CONNECTOR TYPE │ RESOURCE TYPE         │ RESOURCE NAMES   ┃\n┠──────────────────────────────────────┼──────────────────┼────────────────┼───────────────────────┼──────────────────┨\n┃ ed528d5a-d6cb-4fc4-bc52-c3d2d01643e5 │ aws-iam-multi-us │ 🔶 aws         │ 🌀 kubernetes-cluster │ zenhacks-cluster ┃\n┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┛\n\n# Register and activate a stack with the new orchestrator\n$ zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set\n\nif you don't have a Service Connector on hand and you don't want to register one , the local Kubernetes kubectl client needs to be configured with a configuration context pointing to the remote cluster. The kubernetes_context stack component must also be configured with the value of that context:\n\nzenml orchestrator register <ORCHESTRATOR_NAME> \\\n    --flavor=kubernetes \\\n    --kubernetes_context=<KUBERNETES_CONTEXT>\n\n# Register and activate a stack with the new orchestrator\nzenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set\n\nZenML will build a Docker image called <CONTAINER_REGISTRY_URI>/zenml:<PIPELINE_NAME> which includes your code and use it to run your pipeline steps in Kubernetes. Check out this page if you want to learn more about how ZenML builds these images and how you can customize them.\n\nYou can now run any ZenML pipeline using the Kubernetes orchestrator:\n\npython file_that_runs_a_zenml_pipeline.py",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.1863
cosine_accuracy@3 0.4795
cosine_accuracy@5 0.6603
cosine_accuracy@10 0.7973
cosine_precision@1 0.1863
cosine_precision@3 0.1598
cosine_precision@5 0.1321
cosine_precision@10 0.0797
cosine_recall@1 0.1863
cosine_recall@3 0.4795
cosine_recall@5 0.6603
cosine_recall@10 0.7973
cosine_ndcg@10 0.4746
cosine_mrr@10 0.3726
cosine_map@100 0.3795

Information Retrieval

Metric Value
cosine_accuracy@1 0.1836
cosine_accuracy@3 0.4877
cosine_accuracy@5 0.6603
cosine_accuracy@10 0.7918
cosine_precision@1 0.1836
cosine_precision@3 0.1626
cosine_precision@5 0.1321
cosine_precision@10 0.0792
cosine_recall@1 0.1836
cosine_recall@3 0.4877
cosine_recall@5 0.6603
cosine_recall@10 0.7918
cosine_ndcg@10 0.4733
cosine_mrr@10 0.3724
cosine_map@100 0.3793

Information Retrieval

Metric Value
cosine_accuracy@1 0.1836
cosine_accuracy@3 0.4685
cosine_accuracy@5 0.6356
cosine_accuracy@10 0.7781
cosine_precision@1 0.1836
cosine_precision@3 0.1562
cosine_precision@5 0.1271
cosine_precision@10 0.0778
cosine_recall@1 0.1836
cosine_recall@3 0.4685
cosine_recall@5 0.6356
cosine_recall@10 0.7781
cosine_ndcg@10 0.4622
cosine_mrr@10 0.3629
cosine_map@100 0.3705

Information Retrieval

Metric Value
cosine_accuracy@1 0.2055
cosine_accuracy@3 0.4767
cosine_accuracy@5 0.6274
cosine_accuracy@10 0.7534
cosine_precision@1 0.2055
cosine_precision@3 0.1589
cosine_precision@5 0.1255
cosine_precision@10 0.0753
cosine_recall@1 0.2055
cosine_recall@3 0.4767
cosine_recall@5 0.6274
cosine_recall@10 0.7534
cosine_ndcg@10 0.4625
cosine_mrr@10 0.3707
cosine_map@100 0.3787

Training Details

Training Dataset

Unnamed Dataset

  • Size: 3,284 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 10 tokens
    • mean: 22.7 tokens
    • max: 48 tokens
    • min: 17 tokens
    • mean: 316.5 tokens
    • max: 512 tokens
  • Samples:
    positive anchor
    How does ZenML help in integrating machine learning with operational processes? ZenML - Bridging the gap between ML & Ops

    Legacy Docs

    Bleeding EdgeLegacy Docs0.67.0

    🧙‍♂️Find older version our docs

    Powered by GitBook
    How can I configure a data integrity check step in ZenML to perform outlier sample detection and string length verification on a dataset with specific conditions? ks. For example, the following step configuration:deepchecks_data_integrity_check_step(
    check_list=[
    DeepchecksDataIntegrityCheck.TABULAR_OUTLIER_SAMPLE_DETECTION,
    DeepchecksDataIntegrityCheck.TABULAR_STRING_LENGTH_OUT_OF_BOUNDS,
    ],
    dataset_kwargs=dict(label='class', cat_features=['country', 'state']),
    check_kwargs={
    DeepchecksDataIntegrityCheck.TABULAR_OUTLIER_SAMPLE_DETECTION: dict(
    nearest_neighbors_percent=0.01,
    extent_parameter=3,
    condition_outlier_ratio_less_or_equal=dict(
    max_outliers_ratio=0.007,
    outlier_score_threshold=0.5,
    ),
    condition_no_outliers=dict(
    outlier_score_threshold=0.6,
    )
    ),
    DeepchecksDataIntegrityCheck.TABULAR_STRING_LENGTH_OUT_OF_BOUNDS: dict(
    num_percentiles=1000,
    min_unique_values=3,
    condition_number_of_outliers_less_or_equal=dict(
    max_outliers=3,
    )
    ),
    },
    ...
    )

    is equivalent to running the following Deepchecks tests:

    import deepchecks.tabular.checks as tabular_checks
    from deepchecks.tabular import Suite
    from deepchecks.tabular import Dataset

    train_dataset = Dataset(
    reference_dataset,
    label='class',
    cat_features=['country', 'state']
    )

    suite = Suite(name="custom")
    check = tabular_checks.OutlierSampleDetection(
    nearest_neighbors_percent=0.01,
    extent_parameter=3,
    )
    check.add_condition_outlier_ratio_less_or_equal(
    max_outliers_ratio=0.007,
    outlier_score_threshold=0.5,
    )
    check.add_condition_no_outliers(
    outlier_score_threshold=0.6,
    )
    suite.add(check)
    check = tabular_checks.StringLengthOutOfBounds(
    num_percentiles=1000,
    min_unique_values=3,
    )
    check.add_condition_number_of_outliers_less_or_equal(
    max_outliers=3,
    )
    suite.run(train_dataset=train_dataset)

    The Deepchecks Data Validator
    How can I develop a custom data validator in ZenML? custom data validator

    📈Experiment Trackers

    CometMLflow

    Neptune

    Weights & Biases

    Develop a custom experiment tracker

    🏃‍♀️Model Deployers

    MLflow

    Seldon

    BentoML

    Hugging Face

    Databricks

    Develop a Custom Model Deployer

    👣Step Operators

    Amazon SageMaker

    Google Cloud VertexAI

    AzureML

    Kubernetes

    Spark

    Develop a Custom Step Operator

    ❗Alerters

    Discord Alerter

    Slack Alerter

    Develop a Custom Alerter

    🖼️Image Builders

    Local Image Builder

    Kaniko Image Builder

    Google Cloud Image Builder

    Develop a Custom Image Builder

    🏷️Annotators

    Argilla

    Label Studio

    Pigeon

    Prodigy

    Develop a Custom Annotator

    📓Model Registries

    MLflow Model Registry

    Develop a Custom Model Registry

    📊Feature Stores

    Feast

    Develop a Custom Feature Store

    Examples

    🚀Quickstart

    🔏End-to-End Batch Inference

    📚Basic NLP with BERT

    👁️Computer Vision with YoloV8

    📖LLM Finetuning

    🧩More Projects...

    Reference

    🐍Python Client

    📼Global settings

    🌎Environment Variables

    👀API reference

    🤷SDK & CLI reference

    📚How do I...?

    ♻️Migration guide

    Migration guide 0.13.2 → 0.20.0

    Migration guide 0.23.0 → 0.30.0

    Migration guide 0.39.1 → 0.41.0

    Migration guide 0.58.2 → 0.60.0

    💜Community & content

    ❓FAQ

    Powered by GitBook
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            384,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 4
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • tf32: False
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: False
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: True
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_128_cosine_map@100 dim_256_cosine_map@100 dim_384_cosine_map@100 dim_64_cosine_map@100
0.3893 10 1.7142 - - - -
0.7786 20 0.4461 - - - -
0.9732 25 - 0.3544 0.3592 0.3674 0.3523
1.1655 30 0.1889 - - - -
1.5547 40 0.1196 - - - -
1.9440 50 0.0717 - - - -
1.9830 51 - 0.3672 0.3727 0.3728 0.3797
2.3309 60 0.0474 - - - -
2.7202 70 0.0418 - - - -
2.9927 77 - 0.3722 0.3772 0.3798 0.3783
3.1071 80 0.0355 - - - -
3.4964 90 0.0351 - - - -
3.8856 100 0.0276 0.3705 0.3793 0.3795 0.3787
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.3
  • Sentence Transformers: 3.0.1
  • Transformers: 4.44.0
  • PyTorch: 2.5.0+cu124
  • Accelerate: 0.33.0
  • Datasets: 2.20.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}