Step-by-step Guidance to Run GUANACO

Command-line Usage

Basic command-line options:

options:
  -h, --help            Show this help message and exit
  -c CONFIG, --config CONFIG
                        Name of configuration JSON file (relative to --data-dir)
                        (default: guanaco.json)
  -d DATA_DIR, --data-dir DATA_DIR
                        Directory containing AnnData files referenced in config
                        (default: current directory)
  -p PORT, --port PORT  Port to run the Dash server on (default: 4399)
  --host HOST           Host to run the Dash server on (default: 0.0.0.0)
  --debug               Run server in debug mode (default: False)
  --max-cells MAX_CELLS Maximum number of cells to load per dataset (default: 10000)
  --backed-mode         Enable backed mode for memory-efficient loading of
                        large datasets (default: False)

Tip

Quick start: If the config file and data files are in the current directory, you can simply run:

guanaco -c config.json

Config File Structure

The GUANACO configuration file is written in JSON format and has two main sections:

Studies

Each study is defined by a dictionary entry, where the key is the study name. A study can include either matrix-based data, track-based data, or both:
- description (optional but recommended)
  
  A short text description of the study. Displayed in the interface to help distinguish between datasets.
- Matrix-based data
  - sc_data (required): Name of a .h5ad or .h5mu file.
  - markers (optional): List of marker genes for plots (heatmaps, dot plots, violin plots, stacked bar plots, pseudotime plots).
- Track-based data
  - bucket_urls (required): List of URLs pointing to S3 buckets containing BigWig or other supported files. See example and instruction Additional Information 1.
  - genome (required): Reference genome build (e.g. hg38, mm10). All supported genome are Additional Information 2.
  - ATAC_name (optional): List of custom display names for ATAC tracks, aligned with bucket_urls. Defaults to Track1, Track2 etc.
  - max_height (optional): List of maximum display heights for tracks, aligned with bucket_urls. Defaults to autoscale.
General customization

Fields outside of individual studies apply globally to the entire GUANACO session:
- title (optional): Title of the GUANACO instance.
- color (optional): List of custom HEX colors applied to clusters and tracks.

Example Config File

{
  "Study1": {
    "description": "PBMC single-cell RNA and ATAC integration study",
    "sc_data": "pbmc.h5ad",
    "markers": ["MS4A1", "LYZ"],
    "genome": "hg38",
    "bucket_urls": ["https://atac-bucket-1/"],
    "ATAC_name": ["PBMC-ATAC"],
    "max_height": [20]
  },
  "Study2": {
    "description": "Mouse skin study",
    "sc_data": "skin.h5mu"
  },
  "title": "GUANACO",
  "color": ["#1f77b4", "#ff7f0e", "#2ca02c"]
}

Once the config file is set and the AnnData and BigWig files are prepared, GUANACO can be accessed locally or deployed on a server. The data will be displayed in the designated areas of the platform, as shown in the figures below.

Additional Information

1. Track Data Configuration (Formats, URLs, Policies, CORS)

Users should upload BigWig or other supported files to a cloud storage service that provides public URLs. AWS S3 is recommended for simplicity and scalability.

Supported URL formats

AWS-style: https://bucket.s3.region.amazonaws.com
Path-style: https://host/bucket[/prefix...]

Supported file formats

BigWig: .bigwig, .bw (ATAC peaks)

If you need to convert an .h5ad or .h5mu file to BigWig, refer to the Jupyter notebook example provided with scCAMEL: TACoWig_Template.ipynb

Interaction files: .bedpe
Annotation tracks (motifs, SNPs, etc.): .bed, .bigBed (.bb)

You can generate motif tracks in BigBed format in two ways:
1. Using the JASPAR TFBS extraction tool: https://jaspar.elixir.no/tfbs_extraction/
2. Using our provided script: https://github.com/Systems-Immunometabolism-Lab/guanaco-viz/blob/main/motif_extraction

Bucket policy for public read access

Your S3 bucket must allow public read access so that GUANACO can fetch the files. Use the following bucket policy as an example (replace name-of-your-bucket with your bucket name):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PublicListBucket",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::name-of-your-bucket"
    },
    {
      "Sid": "PublicReadGetObject",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::name-of-your-bucket/*"
    }
  ]
}

CORS configuration

To enable cross-origin requests (web browser access), configure CORS on your bucket. Example configuration:

[
  {
    "AllowedHeaders": ["*"],
    "AllowedMethods": ["GET", "HEAD"],
    "AllowedOrigins": ["*"],
    "ExposeHeaders": ["ETag"],
    "MaxAgeSeconds": 3000
  }
]

⚠️ Security note: For better security, replace "*" in AllowedOrigins with a specific domain (e.g. "http://example.com") to restrict access.

2. Supported genome builds for track-based data

GUANACO supports the following reference genomes (UCSC 2bit format):

Human: hg38, hg19, hg18

Mouse: mm39, mm10, mm9

Rat: rn6, rn5

Zebrafish: danRer11, danRer10

Fruit fly: dm6, dm3

Worm (C. elegans): ce11, ce10

Yeast: sacCer3

Chicken: galGal6

Xenopus: xenTro9

Dog: canFam3

Cow: bosTau9

Pig: susScr11

Macaque: rheMac10

Each build corresponds to a UCSC-hosted 2bit file, e.g.: https://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit