metadata
language:
- en
license: apache-2.0
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:3284
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-m-v1.5
widget:
- source_sentence: >-
Does ZenML officially support Macs running on Apple Silicon, and are there
any specific configurations needed?
sentences:
- >-
ding ZenML to learn more!
Do you support Windows?ZenML officially supports Windows if you're using
WSL. Much of ZenML will also work on Windows outside a WSL environment,
but we don't officially support it and some features don't work (notably
anything that requires spinning up a server process).
Do you support Macs running on Apple Silicon?
Yes, ZenML does support Macs running on Apple Silicon. You just need to
make sure that you set the following environment variable:
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
This is a known issue with how forking works on Macs running on Apple
Silicon and it will enable you to use ZenML and the server. This
environment variable is needed if you are working with a local server on
your Mac, but if you're just using ZenML as a client / CLI and
connecting to a deployed server then you don't need to set it.
How can I make ZenML work with my custom tool? How can I extend or build
on ZenML?
This depends on the tool and its respective MLOps category. We have a
full guide on this over here!
How can I contribute?
We develop ZenML together with our community! To get involved, the best
way to get started is to select any issue from the good-first-issue
label. If you would like to contribute, please review our Contributing
Guide for all relevant details.
How can I speak with the community?
The first point of the call should be our Slack group. Ask your
questions about bugs or specific use cases and someone from the core
team will respond.
Which license does ZenML use?
ZenML is distributed under the terms of the Apache License Version 2.0.
A complete version of the license is available in the LICENSE.md in this
repository. Any contribution made to this project will be licensed under
the Apache License Version 2.0.
PreviousCommunity & content
Last updated 3 months ago
- |-
Registering a Model
PreviousUse the Model Control PlaneNextDeleting a Model
Last updated 4 months ago
- >-
Synthetic data generation
Generate synthetic data with distilabel to finetune embeddings.
PreviousImprove retrieval by finetuning embeddingsNextFinetuning
embeddings with Sentence Transformers
Last updated 21 days ago
- source_sentence: >-
How can I change the logging verbosity level in ZenML for both local and
remote pipeline runs?
sentences:
- >-
ncepts covered in this guide to your own projects.By the end of this
guide, you'll have a solid understanding of how to leverage LLMs in your
MLOps workflows using ZenML, enabling you to build powerful, scalable,
and maintainable LLM-powered applications. First up, let's take a look
at a super simple implementation of the RAG paradigm to get started.
PreviousAn end-to-end projectNextRAG with ZenML
Last updated 21 days ago
- >-
Configuring a pipeline at runtime
Configuring a pipeline at runtime.
PreviousUse pipeline/step parametersNextReference environment variables
in configurations
Last updated 28 days ago
- >-
Set logging verbosity
How to set the logging verbosity in ZenML.
By default, ZenML sets the logging verbosity to INFO. If you wish to
change this, you can do so by setting the following environment
variable:
export ZENML_LOGGING_VERBOSITY=INFO
Choose from INFO, WARN, ERROR, CRITICAL, DEBUG. This will set the logs
to whichever level you suggest.
Note that setting this on the client environment (e.g. your local
machine which runs the pipeline) will not automatically set the same
logging verbosity for remote pipeline runs. That means setting this
variable locally with only effect pipelines that run locally.
If you wish to control for remote pipeline runs, you can set the
ZENML_LOGGING_VERBOSITY environment variable in your pipeline runs
environment as follows:
docker_settings = DockerSettings(environment={"ZENML_LOGGING_VERBOSITY":
"DEBUG"})
@pipeline(settings={"docker": docker_settings})
def my_pipeline() -> None:
my_step()
my_pipeline = my_pipeline.with_options(
settings={"docker": docker_settings}
)
PreviousEnable or disable logs storageNextDisable rich traceback output
Last updated 21 days ago
- source_sentence: >-
How can I autogenerate a template yaml file for my specific pipeline using
ZenML?
sentences:
- >-
Autogenerate a template yaml file
To help you figure out what you can put in your configuration file,
simply autogenerate a template.
If you want to generate a template yaml file of your specific pipeline,
you can do so by using the .write_run_configuration_template() method.
This will generate a yaml file with all options commented out. This way
you can pick and choose the settings that are relevant to you.
from zenml import pipeline
...
@pipeline(enable_cache=True)
def simple_ml_pipeline(parameter: int):
dataset = load_data(parameter=parameter)
train_model(dataset)
simple_ml_pipeline.write_run_configuration_template(path="<Insert_path_here>")
When you want to configure your pipeline with a certain stack in mind,
you can do so as well:
`...write_run_configuration_template(stack=<Insert_stack_here>)
PreviousFind out which configuration was used for a runNextCustomize
Docker builds
Last updated 21 days ago
- |-
Deleting a Model
Learn how to delete models.
PreviousRegistering a ModelNextAssociate a pipeline with a Model
Last updated 4 months ago
- >-
Load artifacts into memory
Often ZenML pipeline steps consume artifacts produced by one another
directly in the pipeline code, but there are scenarios where you need to
pull external data into your steps. Such external data could be
artifacts produced by non-ZenML codes. For those cases, it is advised to
use ExternalArtifact, but what if we plan to exchange data created with
other ZenML pipelines?
ZenML pipelines are first compiled and only executed at some later
point. During the compilation phase, all function calls are executed,
and this data is fixed as step input parameters. Given all this, the
late materialization of dynamic objects, like data artifacts, is
crucial. Without late materialization, it would not be possible to pass
not-yet-existing artifacts as step inputs, or their metadata, which is
often the case in a multi-pipeline setting.
We identify two major use cases for exchanging artifacts between
pipelines:
You semantically group your data products using ZenML Models
You prefer to use ZenML Client to bring all the pieces together
We recommend using models to group and access artifacts across
pipelines. Find out how to load an artifact from a ZenML Model here.
Use client methods to exchange artifacts
If you don't yet use the Model Control Plane, you can still exchange
data between pipelines with late materialization. Let's rework the
do_predictions pipeline code as follows:
from typing import Annotated
from zenml import step, pipeline
from zenml.client import Client
import pandas as pd
from sklearn.base import ClassifierMixin
- source_sentence: >-
How can I create a Kubernetes cluster on EKS and configure it to run Spark
with a custom Docker image?
sentences:
- |-
View logs on the dashboard
PreviousControl loggingNextEnable or disable logs storage
Last updated 21 days ago
- >-
Datasets in ZenML
Model datasets using simple abstractions.
As machine learning projects grow in complexity, you often need to work
with various data sources and manage intricate data flows. This chapter
explores how to use custom Dataset classes and Materializers in ZenML to
handle these challenges efficiently. For strategies on scaling your data
processing for larger datasets, refer to scaling strategies for big
data.
Introduction to Custom Dataset Classes
Custom Dataset classes in ZenML provide a way to encapsulate data
loading, processing, and saving logic for different data sources.
They're particularly useful when:
Working with multiple data sources (e.g., CSV files, databases, cloud
storage)
Dealing with complex data structures that require special handling
Implementing custom data processing or transformation logic
Implementing Dataset Classes for Different Data Sources
Let's create a base Dataset class and implement it for CSV and BigQuery
data sources:
from abc import ABC, abstractmethod
import pandas as pd
from google.cloud import bigquery
from typing import Optional
class Dataset(ABC):
@abstractmethod
def read_data(self) -> pd.DataFrame:
pass
class CSVDataset(Dataset):
def __init__(self, data_path: str, df: Optional[pd.DataFrame] = None):
self.data_path = data_path
self.df = df
def read_data(self) -> pd.DataFrame:
if self.df is None:
self.df = pd.read_csv(self.data_path)
return self.df
class BigQueryDataset(Dataset):
def __init__(
self,
table_id: str,
df: Optional[pd.DataFrame] = None,
project: Optional[str] = None,
):
self.table_id = table_id
self.project = project
self.df = df
self.client = bigquery.Client(project=self.project)
def read_data(self) -> pd.DataFrame:
query = f"SELECT * FROM `{self.table_id}`"
self.df = self.client.query(query).to_dataframe()
return self.df
- >-
e the correct region is selected on the top right.Click on Add cluster
and select Create.
Enter a name and select the cluster role for Cluster service role.
Keep the default values for the networking and logging steps and create
the cluster.
Note down the cluster name and the API server endpoint:
EKS_CLUSTER_NAME=<EKS_CLUSTER_NAME>
EKS_API_SERVER_ENDPOINT=<API_SERVER_ENDPOINT>
After the cluster is created, select it and click on Add node group in
the Compute tab.
Enter a name and select the node role.
For the instance type, we recommend t3a.xlarge, as it provides up to 4
vCPUs and 16 GB of memory.
Docker image for the Spark drivers and executors
When you want to run your steps on a Kubernetes cluster, Spark will
require you to choose a base image for the driver and executor pods.
Normally, for this purpose, you can either use one of the base images in
Spark’s dockerhub or create an image using the docker-image-tool which
will use your own Spark installation and build an image.
When using Spark in EKS, you need to use the latter and utilize the
docker-image-tool. However, before the build process, you also need to
download the following packages
hadoop-aws = 3.3.1
aws-java-sdk-bundle = 1.12.150
and put them in the jars folder within your Spark installation. Once
that is set up, you can build the image as follows:
cd $SPARK_HOME
SPARK_HOME variable which points to your Spark installation
SPARK_IMAGE_TAG=<SPARK_IMAGE_TAG>
./bin/docker-image-tool.sh -t $SPARK_IMAGE_TAG -p
kubernetes/dockerfiles/spark/bindings/python/Dockerfile -u 0 build
BASE_IMAGE_NAME=spark-py:$SPARK_IMAGE_TAG
If you are working on an M1 Mac, you will need to build the image for
the amd64 architecture, by using the prefix -X on the previous command.
For example:
./bin/docker-image-tool.sh -X -t $SPARK_IMAGE_TAG -p
kubernetes/dockerfiles/spark/bindings/python/Dockerfile -u 0 build
Configuring RBAC
- source_sentence: How can I configure a pipeline with a YAML file in ZenML?
sentences:
- |-
atically retry steps
Run pipelines asynchronouslyControl execution order of steps
Using a custom step invocation ID
Name your pipeline runs
Use failure/success hooks
Hyperparameter tuning
Access secrets in a step
Run an individual step
Fetching pipelines
Get past pipeline/step runs
🚨Trigger a pipeline
Use templates: Python SDK
Use templates: Dashboard
Use templates: Rest API
📃Use configuration files
How to configure a pipeline with a YAML
What can be configured
Runtime settings for Docker, resources, and stack components
Configuration hierarchy
Find out which configuration was used for a run
Autogenerate a template yaml file
🐳Customize Docker builds
Docker settings on a pipeline
Docker settings on a step
Use a prebuilt image for pipeline execution
Specify pip dependencies and apt packages
Use your own Dockerfiles
Which files are built into the image
How to reuse builds
Define where an image is built
📔Run remote pipelines from notebooks
Limitations of defining steps in notebook cells
Run a single step from a notebook
🤹Manage your ZenML server
Best practices for upgrading ZenML
Upgrade your ZenML server
Using ZenML server in production
Troubleshoot your ZenML server
Migration guide
Migration guide 0.13.2 → 0.20.0
Migration guide 0.23.0 → 0.30.0
Migration guide 0.39.1 → 0.41.0
Migration guide 0.58.2 → 0.60.0
📍Develop locally
Use config files to develop locally
Keep your pipelines and dashboard clean
⚒️Manage stacks & components
Deploy a cloud stack with ZenML
Deploy a cloud stack with Terraform
Register a cloud stack
Reference secrets in stack configuration
Implement a custom stack component
🚜Train with GPUs
Distributed Training with 🤗 Accelerate
🌲Control logging
View logs on the dashboard
Enable or disable logs storage
Set logging verbosity
Disable rich traceback output
Disable colorful logging
🗄️Handle Data/Artifacts
How ZenML stores data
Return multiple outputs from a step
Delete an artifact
Organize data with tags
Get arbitrary artifacts in a step
- >-
Security best practices
Best practices concerning the various authentication methods implemented
by Service Connectors.
Service Connector Types, especially those targeted at cloud providers,
offer a plethora of authentication methods matching those supported by
remote cloud platforms. While there is no single authentication standard
that unifies this process, there are some patterns that are easily
identifiable and can be used as guidelines when deciding which
authentication method to use to configure a Service Connector.
This section explores some of those patterns and gives some advice
regarding which authentication methods are best suited for your needs.
This section may require some general knowledge about authentication and
authorization to be properly understood. We tried to keep it simple and
limit ourselves to talking about high-level concepts, but some areas may
get a bit too technical.
Username and password
The key takeaway is this: you should avoid using your primary account
password as authentication credentials as much as possible. If there are
alternative authentication methods that you can use or other types of
credentials (e.g. session tokens, API keys, API tokens), you should
always try to use those instead.
Ultimately, if you have no choice, be cognizant of the third parties you
share your passwords with. If possible, they should never leave the
premises of your local host or development environment.
This is the typical authentication method that uses a username or
account name plus the associated password. While this is the de facto
method used to log in with web consoles and local CLIs, this is the
least secure of all authentication methods and never something you want
to share with other members of your team or organization or use to
authenticate automated workloads.
- >-
━━━┷━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━┛$ zenml orchestrator
connect <ORCHESTRATOR_NAME> --connector aws-iam-multi-us
Running with active stack: 'default' (repository)
Successfully connected orchestrator `<ORCHESTRATOR_NAME>` to the
following resources:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┓
┃ CONNECTOR ID │ CONNECTOR NAME │ CONNECTOR
TYPE │ RESOURCE TYPE │ RESOURCE NAMES ┃
┠──────────────────────────────────────┼──────────────────┼────────────────┼───────────────────────┼──────────────────┨
┃ ed528d5a-d6cb-4fc4-bc52-c3d2d01643e5 │ aws-iam-multi-us │ 🔶
aws │ 🌀 kubernetes-cluster │ zenhacks-cluster ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┛
$ zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set
if you don't have a Service Connector on hand and you don't want to
register one , the local Kubernetes kubectl client needs to be
configured with a configuration context pointing to the remote cluster.
The kubernetes_context stack component must also be configured with the
value of that context:
zenml orchestrator register <ORCHESTRATOR_NAME> \
--flavor=kubernetes \
--kubernetes_context=<KUBERNETES_CONTEXT>
zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set
ZenML will build a Docker image called
<CONTAINER_REGISTRY_URI>/zenml:<PIPELINE_NAME> which includes your code
and use it to run your pipeline steps in Kubernetes. Check out this page
if you want to learn more about how ZenML builds these images and how
you can customize them.
You can now run any ZenML pipeline using the Kubernetes orchestrator:
python file_that_runs_a_zenml_pipeline.py
datasets: []
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: zenml/finetuned-snowflake-arctic-embed-m-v1.5
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: dim 384
type: dim_384
metrics:
- type: cosine_accuracy@1
value: 0.1863013698630137
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.4794520547945205
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.6602739726027397
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 0.7972602739726027
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.1863013698630137
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.1598173515981735
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.13205479452054794
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.07972602739726026
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.1863013698630137
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.4794520547945205
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.6602739726027397
name: Cosine Recall@5
- type: cosine_recall@10
value: 0.7972602739726027
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.47459290361092754
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.3725994781474232
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.37953809566266083
name: Cosine Map@100
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: dim 256
type: dim_256
metrics:
- type: cosine_accuracy@1
value: 0.18356164383561643
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.4876712328767123
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.6602739726027397
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 0.7917808219178082
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.18356164383561643
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.16255707762557076
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.1320547945205479
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.07917808219178081
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.18356164383561643
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.4876712328767123
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.6602739726027397
name: Cosine Recall@5
- type: cosine_recall@10
value: 0.7917808219178082
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.47334554819769054
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.3724179169384647
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.37931260226095775
name: Cosine Map@100
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: dim 128
type: dim_128
metrics:
- type: cosine_accuracy@1
value: 0.18356164383561643
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.4684931506849315
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.6356164383561644
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 0.7780821917808219
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.18356164383561643
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.1561643835616438
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.12712328767123285
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.07780821917808219
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.18356164383561643
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.4684931506849315
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.6356164383561644
name: Cosine Recall@5
- type: cosine_recall@10
value: 0.7780821917808219
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.46219638130094637
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.3628680147858229
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.37047490630037583
name: Cosine Map@100
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: dim 64
type: dim_64
metrics:
- type: cosine_accuracy@1
value: 0.2054794520547945
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.4767123287671233
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.6273972602739726
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 0.7534246575342466
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.2054794520547945
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.15890410958904108
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.12547945205479452
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.07534246575342465
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.2054794520547945
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.4767123287671233
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.6273972602739726
name: Cosine Recall@5
- type: cosine_recall@10
value: 0.7534246575342466
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.46250756548591326
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.37069906501413347
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.37874559284369463
name: Cosine Map@100
zenml/finetuned-snowflake-arctic-embed-m-v1.5
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-m-v1.5
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
- Language: en
- License: apache-2.0
Model Sources
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("zenml/finetuned-snowflake-arctic-embed-m-v1.5")
sentences = [
'How can I configure a pipeline with a YAML file in ZenML?',
'atically retry steps\n\nRun pipelines asynchronouslyControl execution order of steps\n\nUsing a custom step invocation ID\n\nName your pipeline runs\n\nUse failure/success hooks\n\nHyperparameter tuning\n\nAccess secrets in a step\n\nRun an individual step\n\nFetching pipelines\n\nGet past pipeline/step runs\n\n🚨Trigger a pipeline\n\nUse templates: Python SDK\n\nUse templates: Dashboard\n\nUse templates: Rest API\n\n📃Use configuration files\n\nHow to configure a pipeline with a YAML\n\nWhat can be configured\n\nRuntime settings for Docker, resources, and stack components\n\nConfiguration hierarchy\n\nFind out which configuration was used for a run\n\nAutogenerate a template yaml file\n\n🐳Customize Docker builds\n\nDocker settings on a pipeline\n\nDocker settings on a step\n\nUse a prebuilt image for pipeline execution\n\nSpecify pip dependencies and apt packages\n\nUse your own Dockerfiles\n\nWhich files are built into the image\n\nHow to reuse builds\n\nDefine where an image is built\n\n📔Run remote pipelines from notebooks\n\nLimitations of defining steps in notebook cells\n\nRun a single step from a notebook\n\n🤹Manage your ZenML server\n\nBest practices for upgrading ZenML\n\nUpgrade your ZenML server\n\nUsing ZenML server in production\n\nTroubleshoot your ZenML server\n\nMigration guide\n\nMigration guide 0.13.2 → 0.20.0\n\nMigration guide 0.23.0 → 0.30.0\n\nMigration guide 0.39.1 → 0.41.0\n\nMigration guide 0.58.2 → 0.60.0\n\n📍Develop locally\n\nUse config files to develop locally\n\nKeep your pipelines and dashboard clean\n\n⚒️Manage stacks & components\n\nDeploy a cloud stack with ZenML\n\nDeploy a cloud stack with Terraform\n\nRegister a cloud stack\n\nReference secrets in stack configuration\n\nImplement a custom stack component\n\n🚜Train with GPUs\n\nDistributed Training with 🤗 Accelerate\n\n🌲Control logging\n\nView logs on the dashboard\n\nEnable or disable logs storage\n\nSet logging verbosity\n\nDisable rich traceback output\n\nDisable colorful logging\n\n🗄️Handle Data/Artifacts\n\nHow ZenML stores data\n\nReturn multiple outputs from a step\n\nDelete an artifact\n\nOrganize data with tags\n\nGet arbitrary artifacts in a step',
"━━━┷━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━┛$ zenml orchestrator connect <ORCHESTRATOR_NAME> --connector aws-iam-multi-us\nRunning with active stack: 'default' (repository)\nSuccessfully connected orchestrator `<ORCHESTRATOR_NAME>` to the following resources:\n┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┓\n┃ CONNECTOR ID │ CONNECTOR NAME │ CONNECTOR TYPE │ RESOURCE TYPE │ RESOURCE NAMES ┃\n┠──────────────────────────────────────┼──────────────────┼────────────────┼───────────────────────┼──────────────────┨\n┃ ed528d5a-d6cb-4fc4-bc52-c3d2d01643e5 │ aws-iam-multi-us │ 🔶 aws │ 🌀 kubernetes-cluster │ zenhacks-cluster ┃\n┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┛\n\n# Register and activate a stack with the new orchestrator\n$ zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set\n\nif you don't have a Service Connector on hand and you don't want to register one , the local Kubernetes kubectl client needs to be configured with a configuration context pointing to the remote cluster. The kubernetes_context stack component must also be configured with the value of that context:\n\nzenml orchestrator register <ORCHESTRATOR_NAME> \\\n --flavor=kubernetes \\\n --kubernetes_context=<KUBERNETES_CONTEXT>\n\n# Register and activate a stack with the new orchestrator\nzenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set\n\nZenML will build a Docker image called <CONTAINER_REGISTRY_URI>/zenml:<PIPELINE_NAME> which includes your code and use it to run your pipeline steps in Kubernetes. Check out this page if you want to learn more about how ZenML builds these images and how you can customize them.\n\nYou can now run any ZenML pipeline using the Kubernetes orchestrator:\n\npython file_that_runs_a_zenml_pipeline.py",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
Evaluation
Metrics
Information Retrieval
Metric |
Value |
cosine_accuracy@1 |
0.1863 |
cosine_accuracy@3 |
0.4795 |
cosine_accuracy@5 |
0.6603 |
cosine_accuracy@10 |
0.7973 |
cosine_precision@1 |
0.1863 |
cosine_precision@3 |
0.1598 |
cosine_precision@5 |
0.1321 |
cosine_precision@10 |
0.0797 |
cosine_recall@1 |
0.1863 |
cosine_recall@3 |
0.4795 |
cosine_recall@5 |
0.6603 |
cosine_recall@10 |
0.7973 |
cosine_ndcg@10 |
0.4746 |
cosine_mrr@10 |
0.3726 |
cosine_map@100 |
0.3795 |
Information Retrieval
Metric |
Value |
cosine_accuracy@1 |
0.1836 |
cosine_accuracy@3 |
0.4877 |
cosine_accuracy@5 |
0.6603 |
cosine_accuracy@10 |
0.7918 |
cosine_precision@1 |
0.1836 |
cosine_precision@3 |
0.1626 |
cosine_precision@5 |
0.1321 |
cosine_precision@10 |
0.0792 |
cosine_recall@1 |
0.1836 |
cosine_recall@3 |
0.4877 |
cosine_recall@5 |
0.6603 |
cosine_recall@10 |
0.7918 |
cosine_ndcg@10 |
0.4733 |
cosine_mrr@10 |
0.3724 |
cosine_map@100 |
0.3793 |
Information Retrieval
Metric |
Value |
cosine_accuracy@1 |
0.1836 |
cosine_accuracy@3 |
0.4685 |
cosine_accuracy@5 |
0.6356 |
cosine_accuracy@10 |
0.7781 |
cosine_precision@1 |
0.1836 |
cosine_precision@3 |
0.1562 |
cosine_precision@5 |
0.1271 |
cosine_precision@10 |
0.0778 |
cosine_recall@1 |
0.1836 |
cosine_recall@3 |
0.4685 |
cosine_recall@5 |
0.6356 |
cosine_recall@10 |
0.7781 |
cosine_ndcg@10 |
0.4622 |
cosine_mrr@10 |
0.3629 |
cosine_map@100 |
0.3705 |
Information Retrieval
Metric |
Value |
cosine_accuracy@1 |
0.2055 |
cosine_accuracy@3 |
0.4767 |
cosine_accuracy@5 |
0.6274 |
cosine_accuracy@10 |
0.7534 |
cosine_precision@1 |
0.2055 |
cosine_precision@3 |
0.1589 |
cosine_precision@5 |
0.1255 |
cosine_precision@10 |
0.0753 |
cosine_recall@1 |
0.2055 |
cosine_recall@3 |
0.4767 |
cosine_recall@5 |
0.6274 |
cosine_recall@10 |
0.7534 |
cosine_ndcg@10 |
0.4625 |
cosine_mrr@10 |
0.3707 |
cosine_map@100 |
0.3787 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 3,284 training samples
- Columns:
positive
and anchor
- Approximate statistics based on the first 1000 samples:
|
positive |
anchor |
type |
string |
string |
details |
- min: 10 tokens
- mean: 22.7 tokens
- max: 48 tokens
|
- min: 17 tokens
- mean: 316.5 tokens
- max: 512 tokens
|
- Samples:
positive |
anchor |
How does ZenML help in integrating machine learning with operational processes? |
ZenML - Bridging the gap between ML & Ops
Legacy Docs
Bleeding EdgeLegacy Docs0.67.0
🧙♂️Find older version our docs
Powered by GitBook |
How can I configure a data integrity check step in ZenML to perform outlier sample detection and string length verification on a dataset with specific conditions? |
ks. For example, the following step configuration:deepchecks_data_integrity_check_step( check_list=[ DeepchecksDataIntegrityCheck.TABULAR_OUTLIER_SAMPLE_DETECTION, DeepchecksDataIntegrityCheck.TABULAR_STRING_LENGTH_OUT_OF_BOUNDS, ], dataset_kwargs=dict(label='class', cat_features=['country', 'state']), check_kwargs={ DeepchecksDataIntegrityCheck.TABULAR_OUTLIER_SAMPLE_DETECTION: dict( nearest_neighbors_percent=0.01, extent_parameter=3, condition_outlier_ratio_less_or_equal=dict( max_outliers_ratio=0.007, outlier_score_threshold=0.5, ), condition_no_outliers=dict( outlier_score_threshold=0.6, ) ), DeepchecksDataIntegrityCheck.TABULAR_STRING_LENGTH_OUT_OF_BOUNDS: dict( num_percentiles=1000, min_unique_values=3, condition_number_of_outliers_less_or_equal=dict( max_outliers=3, ) ), }, ... )
is equivalent to running the following Deepchecks tests:
import deepchecks.tabular.checks as tabular_checks from deepchecks.tabular import Suite from deepchecks.tabular import Dataset
train_dataset = Dataset( reference_dataset, label='class', cat_features=['country', 'state'] )
suite = Suite(name="custom") check = tabular_checks.OutlierSampleDetection( nearest_neighbors_percent=0.01, extent_parameter=3, ) check.add_condition_outlier_ratio_less_or_equal( max_outliers_ratio=0.007, outlier_score_threshold=0.5, ) check.add_condition_no_outliers( outlier_score_threshold=0.6, ) suite.add(check) check = tabular_checks.StringLengthOutOfBounds( num_percentiles=1000, min_unique_values=3, ) check.add_condition_number_of_outliers_less_or_equal( max_outliers=3, ) suite.run(train_dataset=train_dataset)
The Deepchecks Data Validator |
How can I develop a custom data validator in ZenML? |
custom data validator
📈Experiment Trackers
CometMLflow
Neptune
Weights & Biases
Develop a custom experiment tracker
🏃♀️Model Deployers
MLflow
Seldon
BentoML
Hugging Face
Databricks
Develop a Custom Model Deployer
👣Step Operators
Amazon SageMaker
Google Cloud VertexAI
AzureML
Kubernetes
Spark
Develop a Custom Step Operator
❗Alerters
Discord Alerter
Slack Alerter
Develop a Custom Alerter
🖼️Image Builders
Local Image Builder
Kaniko Image Builder
Google Cloud Image Builder
Develop a Custom Image Builder
🏷️Annotators
Argilla
Label Studio
Pigeon
Prodigy
Develop a Custom Annotator
📓Model Registries
MLflow Model Registry
Develop a Custom Model Registry
📊Feature Stores
Feast
Develop a Custom Feature Store
Examples
🚀Quickstart
🔏End-to-End Batch Inference
📚Basic NLP with BERT
👁️Computer Vision with YoloV8
📖LLM Finetuning
🧩More Projects...
Reference
🐍Python Client
📼Global settings
🌎Environment Variables
👀API reference
🤷SDK & CLI reference
📚How do I...?
♻️Migration guide
Migration guide 0.13.2 → 0.20.0
Migration guide 0.23.0 → 0.30.0
Migration guide 0.39.1 → 0.41.0
Migration guide 0.58.2 → 0.60.0
💜Community & content
❓FAQ
Powered by GitBook |
- Loss:
MatryoshkaLoss
with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
384,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1
],
"n_dims_per_step": -1
}
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: epoch
per_device_train_batch_size
: 4
per_device_eval_batch_size
: 16
gradient_accumulation_steps
: 16
learning_rate
: 2e-05
num_train_epochs
: 4
lr_scheduler_type
: cosine
warmup_ratio
: 0.1
tf32
: False
load_best_model_at_end
: True
optim
: adamw_torch_fused
batch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: False
do_predict
: False
eval_strategy
: epoch
prediction_loss_only
: True
per_device_train_batch_size
: 4
per_device_eval_batch_size
: 16
per_gpu_train_batch_size
: None
per_gpu_eval_batch_size
: None
gradient_accumulation_steps
: 16
eval_accumulation_steps
: None
torch_empty_cache_steps
: None
learning_rate
: 2e-05
weight_decay
: 0.0
adam_beta1
: 0.9
adam_beta2
: 0.999
adam_epsilon
: 1e-08
max_grad_norm
: 1.0
num_train_epochs
: 4
max_steps
: -1
lr_scheduler_type
: cosine
lr_scheduler_kwargs
: {}
warmup_ratio
: 0.1
warmup_steps
: 0
log_level
: passive
log_level_replica
: warning
log_on_each_node
: True
logging_nan_inf_filter
: True
save_safetensors
: True
save_on_each_node
: False
save_only_model
: False
restore_callback_states_from_checkpoint
: False
no_cuda
: False
use_cpu
: False
use_mps_device
: False
seed
: 42
data_seed
: None
jit_mode_eval
: False
use_ipex
: False
bf16
: False
fp16
: False
fp16_opt_level
: O1
half_precision_backend
: auto
bf16_full_eval
: False
fp16_full_eval
: False
tf32
: False
local_rank
: 0
ddp_backend
: None
tpu_num_cores
: None
tpu_metrics_debug
: False
debug
: []
dataloader_drop_last
: False
dataloader_num_workers
: 0
dataloader_prefetch_factor
: None
past_index
: -1
disable_tqdm
: True
remove_unused_columns
: True
label_names
: None
load_best_model_at_end
: True
ignore_data_skip
: False
fsdp
: []
fsdp_min_num_params
: 0
fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap
: None
accelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed
: None
label_smoothing_factor
: 0.0
optim
: adamw_torch_fused
optim_args
: None
adafactor
: False
group_by_length
: False
length_column_name
: length
ddp_find_unused_parameters
: None
ddp_bucket_cap_mb
: None
ddp_broadcast_buffers
: False
dataloader_pin_memory
: True
dataloader_persistent_workers
: False
skip_memory_metrics
: True
use_legacy_prediction_loop
: False
push_to_hub
: False
resume_from_checkpoint
: None
hub_model_id
: None
hub_strategy
: every_save
hub_private_repo
: False
hub_always_push
: False
gradient_checkpointing
: False
gradient_checkpointing_kwargs
: None
include_inputs_for_metrics
: False
eval_do_concat_batches
: True
fp16_backend
: auto
push_to_hub_model_id
: None
push_to_hub_organization
: None
mp_parameters
:
auto_find_batch_size
: False
full_determinism
: False
torchdynamo
: None
ray_scope
: last
ddp_timeout
: 1800
torch_compile
: False
torch_compile_backend
: None
torch_compile_mode
: None
dispatch_batches
: None
split_batches
: None
include_tokens_per_second
: False
include_num_input_tokens_seen
: False
neftune_noise_alpha
: None
optim_target_modules
: None
batch_eval_metrics
: False
eval_on_start
: False
eval_use_gather_object
: False
batch_sampler
: no_duplicates
multi_dataset_batch_sampler
: proportional
Training Logs
Epoch |
Step |
Training Loss |
dim_128_cosine_map@100 |
dim_256_cosine_map@100 |
dim_384_cosine_map@100 |
dim_64_cosine_map@100 |
0.3893 |
10 |
1.7142 |
- |
- |
- |
- |
0.7786 |
20 |
0.4461 |
- |
- |
- |
- |
0.9732 |
25 |
- |
0.3544 |
0.3592 |
0.3674 |
0.3523 |
1.1655 |
30 |
0.1889 |
- |
- |
- |
- |
1.5547 |
40 |
0.1196 |
- |
- |
- |
- |
1.9440 |
50 |
0.0717 |
- |
- |
- |
- |
1.9830 |
51 |
- |
0.3672 |
0.3727 |
0.3728 |
0.3797 |
2.3309 |
60 |
0.0474 |
- |
- |
- |
- |
2.7202 |
70 |
0.0418 |
- |
- |
- |
- |
2.9927 |
77 |
- |
0.3722 |
0.3772 |
0.3798 |
0.3783 |
3.1071 |
80 |
0.0355 |
- |
- |
- |
- |
3.4964 |
90 |
0.0351 |
- |
- |
- |
- |
3.8856 |
100 |
0.0276 |
0.3705 |
0.3793 |
0.3795 |
0.3787 |
- The bold row denotes the saved checkpoint.
Framework Versions
- Python: 3.12.3
- Sentence Transformers: 3.0.1
- Transformers: 4.44.0
- PyTorch: 2.5.0+cu124
- Accelerate: 0.33.0
- Datasets: 2.20.0
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}