Creating a Dataset and adding files to a Dataset are two distinct steps. Click here to learn how to add data to an existing Dataset.
Datasets cannot be deleted using the SDK or the API. Use the Encord platform to delete Datasets.
The following example creates a Dataset called “Houses” that expects data hosted on AWS S3.
Substitute <private_key_path> with the file path for your private key.
Replace “Houses” with the name you want your Dataset to have.
# Import dependenciesfrom encord import EncordUserClientfrom encord.orm.dataset import StorageLocation# Authenticate with Encord using the path to your private keyuser_client = EncordUserClient.create_with_ssh_private_key( ssh_private_key_path="<private_key_path>")# Create a new datasetdataset_response = user_client.create_dataset( dataset_title="Houses", dataset_type=StorageLocation.AWS, create_backing_folder=False,)# Prints a CreateDatasetResponse object. Verify the Dataset creationprint(dataset_response)# Print the storage locationprint(f"Using storage location: AWS")
Use the following script to create a new Dataset from the label rows of a specific Project.
Replace <private_key_path> with the path to your private key.
Replace <project_hash> with the hash of the Project containing the data units you want to create a new Dataset from.
Replace My new Dataset with the name you want to give your new Dataset.
If create_backing_folder is True, a mirrored Dataset is created. Mirrored Datasets sync the content of the backed Folder with the Dataset.
# Import dependenciesfrom encord.orm.dataset import StorageLocationfrom encord.user_client import EncordUserClient# Authenticate with Encord using the path to your private keyuser_client = EncordUserClient.create_with_ssh_private_key( ssh_private_key_path="<private_key_path>" )# Specify a Projectproject = user_client.get_project("<project_hash>")# Get the UUIDs of the items to be added to the new Datasetitem_uuids = [lr.backing_item_uuid for lr in project.list_label_rows_v2() if "subset_me" in lr.data_title]# Create new Dataset and link the itemsresponse = user_client.create_dataset( dataset_title="My new Dataset", dataset_type=StorageLocation.CORD_STORAGE, create_backing_folder=False)dataset = client.get_dataset(response.dataset_hash)dataset.link_items(item_uuids)
Use the EncordUserClient method to query and list the user client’s Datasets.
The following example fetches all Datasets available to the user. Substitute <private_key_path> with the file path for your private key.
The Dataset hash can be found within the URL once a Dataset has been selected:
app.encord.com/projects/view/\<dataset_hash>/summary or app.us.encord.com/projects/view/\<dataset_hash>/summary
# Import dependenciesfrom encord import EncordUserClient# Authenticate with Encord using the path to your private keyuser_client = EncordUserClient.create_with_ssh_private_key(ssh_private_key_path='<private_key_path>')# List existing Datasetsdatasets = user_client.get_datasets()print(datasets)
The type attribute in the output refers to the StorageLocation