1 Introduction

Chimeric Antigen Receptor T-cell (CAR-T) therapy represents a revolutionary approach to cancer treatment, where a patient’s T-cells are genetically modified to recognize and attack cancer cells. The success of CAR-T therapy critically depends on identifying suitable target antigens—proteins expressed on cancer cells that can serve as targets for the engineered T-cells.

This vignette provides a comprehensive workflow for evaluating potential CAR-T targets, guiding non-bioinformaticians through essential computational analyses. We demonstrate this workflow using ERBB2 (also known as HER2/neu) as an example, a receptor tyrosine kinase overexpressed in approximately 25-30% of breast cancers Slamon, D.J. et al. and an established target for monoclonal antibody therapy (trastuzumab) Yoon, J. et al..

1.1 Target Audience and Methodology

This vignette is designed for clinicians and researchers without coding experience, utilizing web-based tools with graphical interfaces. While many tools necessarily operate through web servers, we recommend adopting computational approaches whenever possible for improved reproducibility, traceability, and reduced error rates. Collaborating with a bioinformatician can help implement more robust, automated workflows for comprehensive target assessment.

2 Retrieve Isoforms: Ensembl

Understanding protein isoform diversity is crucial for CAR-T target assessment, as different isoforms may have altered membrane topology, expression patterns, or epitope accessibility. We begin by identifying all coding isoforms of the potential CAR-T target using Ensembl.

Step 1: Search for the gene

  • Go to https://www.ensembl.org/
  • Enter the gene name (e.g. “ERBB2”) in the search box
  • Select “Human” from the species dropdown if not already selected
  • Click “Go” or press Enter

Step 2: Navigate to the gene page

  • From the search results, click on the gene entry (usually the first result showing the gene symbol, Ensembl gene ID and description)
  • This takes you to the main gene summary page

Step 3: Access the transcript information

  • Locate the “Transcripts” section on the gene summary page
  • Click “Show transcript table” to view all transcript variants

Step 4: Indentify coding isoforms

  • In the transcripts table, examine the “Biotype” column - coding isoforms are labeled as “protein_coding”
  • Sort by this column to focus on protein-coding transcripts
  • Each row shows transcript ID, number of amino acids (Protein), UniProt match, and other relevant information
Ensembl transcripts table showing protein-coding isoforms with transcript IDs, protein lengths, and UniProt matches.

Ensembl transcripts table showing protein-coding isoforms with transcript IDs, protein lengths, and UniProt matches.

Step 5: Examine individual isoforms

  • Click on any transcript ID (e.g., ENST00000269571) to view detailed information
  • This shows the transcript structure, exons, protein sequence, and other annotations
  • The transcript page also indicates if it is the canonical/principal isoform

Step 6: Export transcript data

  • The transcripttable can be downloaded and used for further analyses

3 Retrieve Isoform Sequences: UniProt

Protein sequences are essential for downstream computational analyses. They can be found in UniProt, which provides high-quality, manually curated protein sequences, though not all Ensembl-predicted isoforms may be represented. UniProt applies rigorous curation standards requiring experimental evidence, while Ensembl includes all computationally predicted coding transcripts. This discrepancy means some predicted isoforms may lack UniProt entries.

To retrieve FASTA sequences of a gene and its protein-coding isoforms from UniProt:

Step 1: Search for the protein

  • Go to https://www.uniprot.org/.
  • Enter the gene name (e.g., “ERBB2”) or Ensembl transcript ID in the search box
  • Select “UniProtKB” from the dropdown menu, if not already selected
  • Click search or press Enter

Step 2: Identify the canonical entry

  • On the left, in the section “Popular organisms”, select “Human”
  • From the search results, look for entries marked as “Reviewed” (Swiss-Prot) - these are manually curated
  • Click on the main protein entry (e.g., P04626 for ERBB2_HUMAN)

Step 3: Download FASTA sequences

  • Click on “Download”
  • In the download menu:
    • Select “Entry” under “Dataset”
    • Choose “FASTA (canonical & isoform)” under “Format”
  • Click “Download” to obtain all the sequence variants
UniProt download interface showing options for retrieving FASTA sequences of canonical and isoform variants.

UniProt download interface showing options for retrieving FASTA sequences of canonical and isoform variants.

Step 4: Individual isoform access

4 Investigate Subcellular Localization with DeepLoc2.1

Subcellular localization prediction is fundamental for CAR-T target validation. Targets must localize to the cell membrane to be accessible for CAR recognition. DeepLoc2.1 provides state-of-the-art localization and membrane association predictions.

DeepLoc2.1 uses transformer-based protein language models to predict where proteins localize within cells. The method analyzes amino acid sequences to identify sorting signals and structural features that determine cellular targeting. The deep learning model was trained on thousands of experimentally validated protein localizations and can predict multiple simultaneous localizations, reflecting the biological reality that some proteins function in multiple cellular compartments.

Step 1: Access the web server

Step 2: Input your protein sequence

  • Paste your protein FASTA sequences (obtained from UniProt) into the text box, or upload a FASTA file containing your sequence(s)

Step 3: Submit the prediction

  • Click “Submit” to start the analysis

Step 4: Interpret the results

The results page contains several key sections for each analysed isoform. For comprehensive interpretation of predicted localizations, confidence scores, and attention plots, refer to the DeepLoc2.1 user manual.

5 Examine Target Expression Across Tissues (TCGA and GTEx): Xena Browser

Expression analysis across healthy and cancer tissues is crucial for assessing target specificity and potential off-target effects. The UCSC Xena Browser provides access to uniformly processed TCGA (cancer) and GTEx (normal tissue) expression data.

Note: The Xena Browser interface can be challenging to navigate for new users. While the following workflow provides basic functionality, we strongly recommend utilizing computational approaches for more comprehensive and reliable expression analysis (refer to the coding vignette for advanced methods).

Step 1: Access the Xena Browser

Step 2: Select the TCGA TARGET GTEx cohort

  • In the study selection box, type “TCGA TARGET GTEx”
  • Select “TCGA TARGET GTEx” from the dropdown menu
  • Click “To first variable”

This cohort contains uniformly processed samples from TCGA (cancer), TARGET (pediatric cancer), and GTEx (normal tissue).

Step 3: Add your gene of interest

  • Type your gene name (e.g., “ERBB2”) in the variable search box and select it from the dropdown menu
  • Select “Gene Expression” under “Dataset” in the “Basic” tab
  • Click “To second variable”
Xena Browser interface showing gene selection and data type options.

Xena Browser interface showing gene selection and data type options.

Step 4: Add the phenotype categories

  • Select “Phenotypic” under “Select Data Type”
  • In the “Phenotype” dropdown menu, select “Main category”
  • Press “Done”

Step 5: Visualize and interpret the data

  • Click the bar chart icon in the top right
  • Select “Compare subgroups”
  • Configure the comparison:
    • Show data from: Column B - ERBB2 gene expression
    • Subgroup samples by: Column C - main category
  • Choose visualization type. We recommend the box plot visualization because it allows statistical comparison between cancer and normal tissues and outliers and median values are clearly visible.
Box plot visualization comparing ERBB2 expression across cancer types and normal tissues.

Box plot visualization comparing ERBB2 expression across cancer types and normal tissues.

Step 6: Download results

From here you can: - Export the visualization as images - Download the underlying data as TSV files for further analysis (top right of the page)

5.1 Transcript-Level Expression Analysis

For isoform-specific expression analysis:

  • Go to the transcript-dedicated section: https://xenabrowser.net/transcripts/
  • Type the gene name in the search box
  • In the dropdown menus, select the two tissues you want to compare (e.g., TCGA Breast invasive carcinoma and GTEX Breast)
  • Here one can examine the expression of the different isoforms of the gene in the different healthy and cancer tissues
Transcript-level expression comparison showing isoform-specific patterns across cancer and normal tissues.

Transcript-level expression comparison showing isoform-specific patterns across cancer and normal tissues.

6 Investigate Membrane Topology: DeepTMHMM

Membrane topology prediction is essential for understanding protein architecture and identifying accessible extracellular domains for CAR targeting. DeepTMHMM provides accurate predictions for both α-helical and β-barrel transmembrane proteins.

DeepTMHMM employs deep learning protein language models to predict transmembrane protein topology. The method analyzes amino acid sequences to identify hydrophobic transmembrane regions, signal peptides, and the orientation of protein domains relative to cellular membranes. The model integrates evolutionary information and physical properties of amino acids to achieve state-of-the-art accuracy in distinguishing between cytoplasmic, extracellular, and membrane-spanning regions.

Step 1: Access DeepTMHMM

Step 2: Prepare your protein sequences

  • Use the FASTA sequences obtained from UniProt for your target protein and all its isoforms
  • Ensure sequences are in proper FASTA format with descriptive headers

Step 3: Submit the analysis

  • Paste your protein sequence(s) in FASTA format into the text box, or upload it from your computer
  • Click “Submit” to start the prediction
DeepTMHMM submission interface for protein topology prediction.

DeepTMHMM submission interface for protein topology prediction.

Step 4: Interpret the results

DeepTMHMM provides comprehensive topology predictions through several output sections. For detailed interpretation of topology classifications, coordinate systems, and confidence assessments, refer to the DeepTMHMM user manual.

7 Investigate the 3D Structure: PDB or AlphaFold3

Three-dimensional structural analysis provides crucial insights into epitope accessibility, surface exposure, and potential binding sites. We examine both experimental structures (PDB) and computational predictions (AlphaFold3).

7.1 Finding Experimental Structures in the Protein Data Bank (PDB)

Step 1: Access the RCSB PDB

  • Go to https://www.rcsb.org/
  • The RCSB PDB is the primary repository for experimentally determined protein structures

Step 2: Search for your protein

  • Enter your protein name (e.g., “ERBB2”) or UniProt ID (e.g., “P04626”) in the search box
  • Click the search button to retrieve all related structures
PDB search results showing available experimental structures for ERBB2.

PDB search results showing available experimental structures for ERBB2.

Note: PDB often lacks complete structures of potential CAR-T targets, particularly full-length membrane proteins. In such cases, AlphaFold3 (detailed below) provides comprehensive structural predictions. However, PDB is particularly valuable when it contains antibody-bound complexes, as these reveal clinically validated epitopes that could potentially be used in CAR-T therapy design.

Example PDB structure showing antibody-bound complex.

Example PDB structure showing antibody-bound complex.

7.2 Predicting Structures with AlphaFold3

For isoforms or complete proteins lacking experimental structures, AlphaFold3 provides highly accurate computational predictions.

AlphaFold3 uses advanced deep learning architecture combining transformer-based language models with diffusion networks to predict protein structures. The method analyzes amino acid sequences and evolutionary relationships to predict how proteins fold into their three-dimensional shapes. AlphaFold3 can also model protein complexes with DNA, RNA, and small molecules, providing unprecedented capability for studying molecular interactions.

Step 1: Check existing predictions

  • Search https://alphafold.ebi.ac.uk/ for existing predictions
  • The database contains over 200 million protein structure predictions
  • Search using UniProt ID or gene name
  • Download available predictions
AlphaFold database ERBB2 existing prediction.

AlphaFold database ERBB2 existing prediction.

Step 2: Submit new predictions (if needed)

The non-canonical isoforms of the potential CAR-T target may not be in the database. Hence, to get their 3D structure, one needs to predict it using Alphafold3. - Navigate to the AlphaFold server https://alphafoldserver.com/ - On the AlphaFold Server, create an account or log in - Input your protein sequence in FASTA format - For CAR-T analysis, you can predict: - Individual protein isoforms - Protein-protein complexes, for example the protein of interest bound to an antibody that could be used as part of the CAR on the T-cell - Click “Continue and preview job”

Step 5: Interpret AlphaFold3 results

For comprehensive understanding of confidence scores (pLDDT), predicted aligned error (PAE), and structural interpretation, refer to the AlphaFold documentation.

AlphaFold3 prediction results showing structure with confidence coloring and quality metrics.

AlphaFold3 prediction results showing structure with confidence coloring and quality metrics.

8 Multiple Sequence Alignment

Multiple sequence alignment (MSA) is a critical step that allows comparison of protein sequences across isoforms to understand structural and functional conservation. This analysis is essential for determining whether potential epitopes are maintained across different protein variants.

Multiple sequence alignment algorithms identify regions of similarity and difference between related protein sequences by optimally aligning amino acids based on evolutionary relationships and structural constraints. MSA reveals conserved domains, variable regions, and helps predict which areas of the protein are functionally important and structurally preserved across isoforms.

At this stage, MSA becomes crucial because you need to determine:

  • Epitope conservation: Whether your target epitope spans all isoforms or only specific variants
  • Domain preservation: If the epitope consistently localizes to extracellular domains across isoforms
  • Structural context: How sequence variations might affect the epitope’s accessibility and presentation
  • Isoform selection: Which protein variants maintain the target epitope in accessible regions

Recommended Tools: Use online MSA tools such as Clustal Omega or MUSCLE to align your protein isoforms. Analyze the alignment in conjunction with your topology predictions to identify the most suitable target variants.

For computational MSA approaches and advanced analysis methods, refer to the coding vignette.

Multiple sequence alignment of ERBB2 protein isoforms using MUSCLE.

Multiple sequence alignment of ERBB2 protein isoforms using MUSCLE.

9 Epitope Identification and Validation

The goal of this workflow is to verify whether a specific epitope represents a suitable CAR-T target based on the comprehensive analyses performed in previous steps. To conduct this validation, you first need to obtain the epitope sequence, which can be found through literature searches, patent databases, or other published sources. However, there is no standardized workflow for epitope identification, as the available information varies greatly depending on the target protein and existing research.

Once you have identified a potential epitope sequence, use the results from this workflow to assess its suitability: confirm the epitope is located in accessible extracellular domains (topology analysis), verify it is conserved across therapeutically relevant isoforms (sequence alignment), ensure it is expressed in target cancers while minimally present in healthy tissues (expression analysis), and validate its structural accessibility (3D structure analysis).