Uploading data in bulk¶

This tutorial will follow a Flattened Data Layout. With a project that has this example layout:

.
├── biospecimen_experiment_1
│   ├── fileA.txt
│   └── fileB.txt
├── biospecimen_experiment_2
│   ├── fileC.txt
│   └── fileD.txt
├── single_cell_RNAseq_batch_1
│   ├── SRR12345678_R1.fastq.gz
│   └── SRR12345678_R2.fastq.gz
└── single_cell_RNAseq_batch_2
    ├── SRR12345678_R1.fastq.gz
    └── SRR12345678_R2.fastq.gz

Tutorial Purpose¶

In this tutorial you will:

Find the synapse ID of your project
Create a manifest CSV file to upload data in bulk
Upload all of the files for our project
Add an annotation to all of our files
Add a provenance/activity record to one of our files

Preferred API

The recommended way to upload files in bulk is [Project.sync_to_synapse][synapseclient.models.mixins.StorableContainer.sync_to_synapse] (or Folder.sync_to_synapse). The legacy synapseutils.syncToSynapse is deprecated and will be removed in v5.0.0.

Uploading Very Large Files

The bulk upload approach using Project.sync_to_synapse() is optimized for uploading many files efficiently. However, if you are uploading very large files (>100 GiB each), consider using sequential uploads with async API instead.

For very large file uploads, see the execute_walk_file_sequential() function in uploadBenchmark.py as a reference implementation. This approach uses asyncio.run(file.store_async()) with the newer async API, which has been optimized for handling very large files efficiently. In benchmarks, this pattern successfully uploaded 45 files of 100 GB each (4.5 TB total) in approximately 20.6 hours.

Prerequisites¶

Make sure that you have completed the following tutorials:
- Project
This tutorial is setup to upload the data from ~/my_ad_project, make sure that this or another desired directory exists.
Pandas is used in this tutorial. Refer to our installation guide to install it. Feel free to skip this portion of the tutorial if you do not wish to use Pandas. You may also use external tools to open and manipulate CSV files.

1. Find the synapse ID of your project¶

First let's set up some constants we'll use in this script, and find the ID of our project

import os

import synapseclient
from synapseclient.models import Project

syn = synapseclient.Synapse()
syn.login()

# Create some constants to store the paths to the data
DIRECTORY_FOR_MY_PROJECT = os.path.expanduser(os.path.join("~", "my_ad_project"))
PATH_TO_MANIFEST_FILE = os.path.expanduser(os.path.join("~", "manifest-for-upload.csv"))

# Step 1: Let's find the synapse ID of our project:
my_project_id = syn.findEntityId(
    name="My uniquely named project about Alzheimer's Disease"
)

# Step 2: Create a manifest CSV file to upload data in bulk

2. Create a manifest CSV file to upload data in bulk¶

Let's walk our local directory and build a CSV manifest with the required path and parentId columns. In a future release Project.sync_from_synapse will support writing a manifest CSV directly; for now we build one with pandas.

# Walk the local directory tree and build a manifest with the required "path" and
# "parentId" columns.  Folders that do not yet exist in Synapse are created
# automatically by sync_to_synapse, so we set parentId to the project for every file.
# TODO: https://sagebionetworks.jira.com/browse/SYNPY-1804
# In a future release, Project.sync_from_synapse will support writing a manifest CSV directly, removing the need to build one manually.
import pandas as pd

rows = []
for dirpath, _dirnames, filenames in os.walk(DIRECTORY_FOR_MY_PROJECT):
    for filename in filenames:
        rows.append(
            {
                "path": os.path.join(dirpath, filename),
                "parentId": my_project_id,
            }
        )

df = pd.DataFrame(rows)
df.to_csv(PATH_TO_MANIFEST_FILE, index=False)

# Step 3: After generating the manifest file, we can upload the data in bulk
project = Project(id=my_project_id)

After this has been run if you inspect the CSV file created you'll see it will look similar to this:

path,parentId
/home/user_name/my_ad_project/single_cell_RNAseq_batch_2/SRR12345678_R2.fastq.gz,syn60109500
/home/user_name/my_ad_project/single_cell_RNAseq_batch_2/SRR12345678_R1.fastq.gz,syn60109500
/home/user_name/my_ad_project/biospecimen_experiment_2/fileD.txt,syn60109500
/home/user_name/my_ad_project/biospecimen_experiment_2/fileC.txt,syn60109500
/home/user_name/my_ad_project/single_cell_RNAseq_batch_1/SRR12345678_R2.fastq.gz,syn60109500
/home/user_name/my_ad_project/single_cell_RNAseq_batch_1/SRR12345678_R1.fastq.gz,syn60109500
/home/user_name/my_ad_project/biospecimen_experiment_1/fileA.txt,syn60109500
/home/user_name/my_ad_project/biospecimen_experiment_1/fileB.txt,syn60109500

3. Upload the data in bulk¶

# Step 4: Let's add an annotation to our manifest file
# Pandas is a powerful data manipulation library in Python, although it is not required

While this is running you'll see output in your console similar to:

Validating manifest: /home/user_name/manifest-for-upload.csv
Validating that all paths exist...
Validating that all files are unique...
Validating that all the files are not empty...
Validating file names...
Validating provenance and parent containers...
About to upload 8 files with a total size of 8 bytes.
Uploading 8 files: 100%|███████████████████| 8.00/8.00 [00:01<00:00, 6.09B/s]

4. Add an annotation to our manifest file¶

At this point in the tutorial we will use pandas to manipulate the CSV manifest. If you are not comfortable with pandas you may use any tool that can open and manipulate CSV files such as Excel or Google Sheets.

# file before uploading it to Synapse.

# Read CSV file into a pandas DataFrame
df = pd.read_csv(PATH_TO_MANIFEST_FILE)

# Add a new column to the DataFrame
df["species"] = "Homo sapiens"

# Write the DataFrame back to the manifest file
df.to_csv(PATH_TO_MANIFEST_FILE, index=False)

project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE, send_messages=False)

# Step 5: Let's create an Activity/Provenance

Now that you have uploaded and annotated your files you'll be able to inspect your data on the Files tab of your project in the synapse web UI. Each file will have a single annotation that you added in the previous step. In more advanced workflows you'll likely need to build a more complex manifest file, but this should give you a good starting point.

5. Create an Activity/Provenance¶

Let's create an Activity/Provenance record for one of our files. In otherwords, we will record the steps taken to generate the file.

In this code we are finding a row in our CSV file and pointing to the file path of another file within our manifest. By doing this we are creating a relationship between the two files. This is a simple example of how you can create a provenance record in Synapse. Additionally we'll link off to a sample URL that describes a process that we may have executed to generate the file.

].index


# After finding the row we want to update let's go ahead and add a relationship to
# another file in our manifest. This allows us to say "We used 'this' file in some way".
df.loc[row_index, "used"] = (
    f"{DIRECTORY_FOR_MY_PROJECT}/single_cell_RNAseq_batch_1/SRR12345678_R1.fastq.gz"
)

# Let's also link to the pipeline that we ran in order to produce these results. In a
# real scenario you may want to link to a specific run of the tool where the results
# were produced.
df.loc[row_index, "executed"] = "https://nf-co.re/rnaseq/3.14.0"

# Let's also add a description for this Activity/Provenance
df.loc[row_index, "activityDescription"] = (
    "Experiment results created as a result of the linked data while running the pipeline."
)

# Write the DataFrame back to the manifest file
df.to_csv(PATH_TO_MANIFEST_FILE, index=False)

project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE, send_messages=False)

After running this code we may again inspect the synapse web UI. In this screenshot i've navigated to the Files tab and selected the file that we added a Provenance record to.

edit provenance button

edit provenance screen

Source code for this tutorial¶

Click to show me

"""
Here is where you'll find the code for the uploading data in bulk tutorial.
"""

import os

import synapseclient
from synapseclient.models import Project

syn = synapseclient.Synapse()
syn.login()

# Create some constants to store the paths to the data
DIRECTORY_FOR_MY_PROJECT = os.path.expanduser(os.path.join("~", "my_ad_project"))
PATH_TO_MANIFEST_FILE = os.path.expanduser(os.path.join("~", "manifest-for-upload.csv"))

# Step 1: Let's find the synapse ID of our project:
my_project_id = syn.findEntityId(
    name="My uniquely named project about Alzheimer's Disease"
)

# Step 2: Create a manifest CSV file to upload data in bulk
# Walk the local directory tree and build a manifest with the required "path" and
# "parentId" columns.  Folders that do not yet exist in Synapse are created
# automatically by sync_to_synapse, so we set parentId to the project for every file.
# TODO: https://sagebionetworks.jira.com/browse/SYNPY-1804
# In a future release, Project.sync_from_synapse will support writing a manifest CSV directly, removing the need to build one manually.
import pandas as pd

rows = []
for dirpath, _dirnames, filenames in os.walk(DIRECTORY_FOR_MY_PROJECT):
    for filename in filenames:
        rows.append(
            {
                "path": os.path.join(dirpath, filename),
                "parentId": my_project_id,
            }
        )

df = pd.DataFrame(rows)
df.to_csv(PATH_TO_MANIFEST_FILE, index=False)

# Step 3: After generating the manifest file, we can upload the data in bulk
project = Project(id=my_project_id)
project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE, send_messages=False)

# Step 4: Let's add an annotation to our manifest file
# Pandas is a powerful data manipulation library in Python, although it is not required
# for this tutorial, it is used here to demonstrate how you can manipulate the manifest
# file before uploading it to Synapse.

# Read CSV file into a pandas DataFrame
df = pd.read_csv(PATH_TO_MANIFEST_FILE)

# Add a new column to the DataFrame
df["species"] = "Homo sapiens"

# Write the DataFrame back to the manifest file
df.to_csv(PATH_TO_MANIFEST_FILE, index=False)

project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE, send_messages=False)

# Step 5: Let's create an Activity/Provenance
# First let's find the row in the CSV we want to update. This code finds the row number
# that we would like to update.
row_index = df[
    df["path"] == f"{DIRECTORY_FOR_MY_PROJECT}/biospecimen_experiment_1/fileA.txt"
].index


# After finding the row we want to update let's go ahead and add a relationship to
# another file in our manifest. This allows us to say "We used 'this' file in some way".
df.loc[row_index, "used"] = (
    f"{DIRECTORY_FOR_MY_PROJECT}/single_cell_RNAseq_batch_1/SRR12345678_R1.fastq.gz"
)

# Let's also link to the pipeline that we ran in order to produce these results. In a
# real scenario you may want to link to a specific run of the tool where the results
# were produced.
df.loc[row_index, "executed"] = "https://nf-co.re/rnaseq/3.14.0"

# Let's also add a description for this Activity/Provenance
df.loc[row_index, "activityDescription"] = (
    "Experiment results created as a result of the linked data while running the pipeline."
)

# Write the DataFrame back to the manifest file
df.to_csv(PATH_TO_MANIFEST_FILE, index=False)

project.sync_to_synapse(manifest_path=PATH_TO_MANIFEST_FILE, send_messages=False)

References used in this tutorial¶

syn.login
syn.findEntityId
[Project.sync_to_synapse][synapseclient.models.mixins.StorableContainer.sync_to_synapse]
Manifest CSV format
Activity/Provenance