Using Rosetta denovo and Rosetta EM refinement tools for the CryoEM challenge 2019
  May 2019
  Frank DiMaio (dimaio@uw.edu)
  Dan Farrell (danpf@uw.edu)

This document describes the Rosetta pipeline used to complete the 2019 cryoEM modelling package.  A Rosetta version newer than week 18 of 2019 is needed for the tutorial (though older versions can perform all but ligand docking).


I. Preparation
==============

1) Download all of the maps for all four targets.  They are not included to keep this file small.  Place them in the 'inputs/' folder without modifying the name.

2) Generate fragment files using Robetta (http://robetta.bakerlab.org/fragmentsubmit.jsp) for both sequences.  Be sure to check "exclude homologues".  These files are included in 'inputs/'

3) Generate monomer maps (for initial denovo modelling) with a tight bounding box:
    - For apoferritin, the deposited monomer model was used.  Chimera's subregion selection tool was used to manually minimize the padding
    - For alcohol dehydrogenase, the chain A of deposited model was used to define a 4A mask using Chimera's "zone" tool.
   Both maps used for the challenge are included in the inputs/ folder.

4) Build symmetry files.  The 'sym' command in chimera was used to generate complexes from BIOMT lines.  Then these files were saved, each asymm unit was given a unique chain ID, and was processed with Rosetta's make_symmdef_file.pl.
   - For apoferritin:
        make_symmdef_file.pl -m pseudo -a A -p apoferritin_symm.pdb > O.symm
   - For alcohol dehydrogenase:
        make_symmdef_file.pl -m ncs -a A -b B -p alc_dehyd_symm.pdb > C2.symm
   Both symmetry definition files used for the challenge are included in the inputs/ folder.


II. Apoferritin models (challenges 1-3)
=======================================

In all three cases, the same procedure was used, outlined below.  Only the input maps change.

1) denovo modelling.  The folder '1_denovo' contains the commands for this step.  There are four steps in total, run sequentially:
    - 1_search.sh
    - 2_score.sh
    - 3_assemble.sh
    - 4_consensus.sh

Steps 1 and 3 are run multiple times (in parallel).  Step 1 needs to be run once for each residue in the monomer.  Using GNU parallel:
    parallel -j24 ./1_search.sh ::: {1..182}

Step 2 is run on a single CPU.

Step 3 needs to be run in a large number of independent trajectories:
    parallel -j24 ./3_assemble.sh {} ::: {1..250}

Step 4 outputs a single model, 'S_0001.pdb,' which is used for the next step (and is copied to 'denovo.pdb' in the step 2 folder).

2) completion with Rosetta CM.  The folder '2_cm_asu' contains scripts for model completion, using the same map as the denovo modelling.  Here, 24 models were generated independently (the exact # of models is unimportant).  Using GNU parallel:
    parallel -j24 ./hybrid.sh {} ::: {1..24}

The lowest-energy model is carried over to the next step.  To get the lowest energy model:
    grep SCORE: *.sc | grep -v desc | sort -nk 2 | head -1 | awk '{print $NF}'

3) model refinement.  The folder '3_symm_refine' contains scripts for the final refinement.  Here, we use half map 1 for scoring, and half map 2 for validation.  Refinement is run in the context of the symmetric assembly, and B-factor fitting is carried out.

The script symm_refine.sh carries this out.  It is run in a large number (here 120) of independent trajectories:
    parallel -j24 ./symm_refine.sh best_cm.pdb {} ::: {1..120}

Output PDBs have a line in the header:
    REMARK   1 FSC[mask=3.2](10:1.8) = 0.536788 / 0.537512
The numbers correspond to the half-1 ("training") and half-2 ("testing") integrated FSCs.  Identify the best with the following command:
    grep FSC *.pdb | sort -nk 7 | tail -5


III. Alcohol dehydrogenase models (challenge 4)
===============================================

Model-building for alcohol dehydrogenase was largely similar to apoferritin, though with several additional steps:
  - two rounds of denovo modelling needed to be carried out
  - RosettaES was required to rebuild loops in the structure
  - A new ligand-docking method was used to build the ligand

1) denovo modelling.  The folders '1A_denovo_rd1' and '1B_denovo_rd2' contains commands for this step.  Two iterations of the four-step protocol above were used, with the output of the first serving as input for the second.

2) loop building with RosettaES.  The folder '2_rosettaES' contains commands for this step.  Unlike in challenges 1-3, the model after step 1 is still quite incomplete, with segments as large as 85 residues unassigned.  We use RosettaES to rebuild these loops independently.  The command 'SampleSegment.sh' carries this out.  The argument '-c' in this file declares the number of processors to use, while the argument input to SampleSegment is the loop number to build.

Depending on the number of cores available, it makes sense to run each segment independently:
    parallel -j8 ./SampleSegment.sh {} ::: {1..8}
This will run all 8 loops seperately, each using '-c' cores (4 currently), for a total of 32 processes.

This will create 8 folders, 'loop_N'.  When complete each will have 8 models with an energy appended (e.g., after_filter_35__0_-684.874.pdb).  Take the model with lowest energy in this folder for _each_ loop and copy it to the step 3 folder as 'es_loopN.pdb'.

3) completion with Rosetta CM.  The folder '3_cm_asu' contains scripts for model completion, as with apoferritin.  The inputs were the lowest-energy models _for each loop_ from the previous step.

As with apoferritin, 24 models were generated independently.  Using GNU parallel:
    parallel -j24 ./hybrid.sh {} ::: {1..24}

4) model refinement.  The folder '4_symm_refine' contains scripts for the final refinement.  It is run the same as apoferritin.  Copt the model with highest free FSC to the next folder.

5) ligand docking.  The folder '5_ligdock' contains scripts for ligand docking.  Note that this is still an unpublished experimental feature that is subject to change in future versions of Rosetta.

We first need to prepare the ligand:
    - download the NAD model from the PDB's chemical component dictionary
    - generate AM1-BCC charges using Chimera's "Add charge" and save as a mol2 file
    - Run the Rosetta script:
        python $ROSETTA/source/scripts/python/public/generic_potential/mol2genparams.py --nm=NAD --amide_chi -s NAD.mol2
      This will create two files, a Rosetta params file (NAD.params), and an "ideal" ligand PDB (NAD_0001.pdb).
    - place the ligand approximately in the pocket.  This was crudely done in chimera.  It does not have to be accurate.  Save this as model_lig.pdb.

We then run ligand docking on the model from the previous state.  We run 5 trajectories:
     parallel -j5 ./run_ligdock.sh model_lig.pdb {} ::: {1..5}

Each will output 20 models for a total of 100.  We refine all of these into density:
    parallel -j24 ./relax.sh {} ::: model_lig*.pdb

As in step 4, we finally select the best model by free FSC.

