Creating datasets using the SDK
First select where your data will be hosted with the appropriate StorageLocation.
Note
Creating a dataset and adding data to a dataset are two distinct steps. Click here to learn how to add data to an existing dataset.
Note
Datasets cannot be deleted using the SDK or the API. Please use the web-app to delete datasets.
The following example will create a dataset called “Example Title” that will expect data hosted on AWS S3. Substitute <private_key_path> with the file path for your private key.
# Import dependencies
from encord import EncordUserClient
from encord.orm.dataset import StorageLocation
# Authenticate with Encord using the path to your private key
user_client = EncordUserClient.create_with_ssh_private_key(ssh_private_key_path='<private_key_path>')
# Create a dataset by specifying a title as well as a storage location
dataset = user_client.create_dataset(
"Example Title", StorageLocation.AWS
)
# Prints the dataset, as shown in the example output
print(dataset)
{
"title": "Example Title",
"type": 1,
"dataset_hash": "<dataset_hash>",
"user_hash": "<user_hash>",
}
If your data is hosted on a different cloud server, simply replace the argument AWS
for the StorageLocation
method with the relevant argument.
Storage location | StorageLocation method argument |
---|---|
AWS S3 | AWS |
GCP | GCP |
Azure blob | AZURE |
Open telekom cloud | OTC |
Encord storage | CORD_STORAGE |
Tip
If you wish to upload your data from local storage to Encord host-storage, use CORD_STORAGE as an argument for the StorageLocation method. Click here to learn how to upload data to Encord-hosted storage.
Listing existing datasets
Use the EncordUserClient method to query and list the user client's datasets.
In the example below, a user authenticates with Encord and then fetches all datasets available to them. Substitute <private_key_path> with the file path for your private key.
Tip
The dataset hash can be found within the URL once a dataset has been selected:
app.encord.com/projects/view/<dataset_hash>/summary
# Import dependencies
from encord import EncordUserClient
# Authenticate with Encord using the path to your private key
user_client = EncordUserClient.create_with_ssh_private_key(ssh_private_key_path='<private_key_path>')
# List existing datasets
datasets = user_client.get_datasets()
print(datasets)
[
{
"dataset": DatasetInfo(
dataset_hash="<dataset_hash>",
user_hash="<user_hash>",
title="Example title",
description="Example description ... ",
type=0, # encord.orm.dataset.StorageLocation
created_at=datetime.datetime(...),
last_edited_at=datetime.datetime(...)
),
"user_role": DatasetUserRole.ADMIN
},
# ...
]
The type attribute in the output refers to the StorageLocation used when a dataset was created.
Tip
EncordUserClient.get_datasets() has multiple optional arguments that allow you to query datasets with specific > characteristics. For example, if you only want datasets with titles starting with “Validation”, you could use user_client.get_datasets(title_like="Validation%"). Other keyword arguments such as
created_before
oredited_after
may also be of interest.