Author Topic: Chain names in DSSR JSON output for PDB Biological Assemblies (Read 63083 times)

chemikeris · « **on:** December 14, 2017, 06:02:19 am »

I try to use DSSR for large-scale analysis of nucleic acid structures which I take from Biological Assemblies from the Protein Data Bank. For this I use the option '--symmetry' which I find very useful.

Unfortunately, I noticed an inconsistency in naming of the chains in DSSR JSON output which creates some troubles when parsing the DSSR results.

When chains in the Biological Assembly file come from different asymmetric units of the crystal structure, their names usually include the MODEL number and chain name from PDB file. Using PDB entry 4ZSF, we see two chains named 'B', one is from MODEL 1, another from MODEL 2:

Code: [Select]

[justas@catfish tmp]$ x3dna-dssr -i=4zsf.pdb1 --symm --json | jq .chains
{
  "m1_chain_B": {
    "num_nts": 14,
    "bseq": "CTCGACCGGTCGAG",
    "sstr": "((((((((((((((",
    "form": "ABBBB...BBBB.-",
    "helical_rise": 3.489,
    "helical_rise_std": 0.789,
    "helical_axis": [
      0.828,
      0.076,
      0.555
    ],
    "point1": [
      -18.24,
      -31.778,
      5.458
    ],
    "point2": [
      19.36,
      -28.329,
      30.661
    ],
    "num_chars": 40,
    "suite": "C1bT!!C4bG!!A!!C1bC!!G4bG!!T!!C!!G!!A!!G"
  },
  "m2_chain_B": {
    "num_nts": 14,
    "bseq": "CTCGACCGGTCGAG",
    "sstr": "))))))))))))))",
    "form": "ABBBB...BBBB.-",
    "helical_rise": 3.489,
    "helical_rise_std": 0.789,
    "helical_axis": [
      -0.828,
      0.076,
      -0.555
    ],
    "point1": [
      18.24,
      -31.778,
      31.283
    ],
    "point2": [
      -19.36,
      -28.329,
      6.08
    ],
    "num_chars": 40,
    "suite": "C1bT!!C4bG!!A!!C1bC!!G4bG!!T!!C!!G!!A!!G"
  }
}

However, in the cases when there are chains from two assymetric units (MODEL 1 and MODEL 2 in input file), but their names are different, we see the no model numbers in chains section of the output.
For example, in PDB entry 4ILM Biological Assembly 2, we see only chains E and I:

Code: [Select]

[justas@catfish tmp]$ x3dna-dssr -i=4ilm.pdb2 --symm --json | jq .chains
{
  "chain_E": {
    "num_nts": 16,
    "bseq": "GCUAAUCUACUAUAGA",
    "sstr": "......((.....)).",
    "form": "A.....A......AA-",
    "helical_rise": 0.115,
    "helical_rise_std": 3.392,
    "helical_axis": [
      -0.734,
      -0.496,
      -0.464
    ],
    "point1": [
      64.548,
      -24.513,
      89.342
    ],
    "point2": [
      62.86,
      -25.653,
      88.275
    ],
    "num_chars": 46,
    "suite": "G!!C!!U!!A!!A4bU4nC1aU!!A!!C4pU2[A6pU!!A1aG1aA"
  },
  "chain_I": {
    "num_nts": 16,
    "bseq": "GCUAAUCUACUAUAGA",
    "sstr": "......((.....)).",
    "form": "A....BA......A.-",
    "helical_rise": 0.305,
    "helical_rise_std": 3.547,
    "helical_axis": [
      0.584,
      0.616,
      0.528
    ],
    "point1": [
      79.844,
      -52.783,
      70.422
    ],
    "point2": [
      83.132,
      -49.316,
      73.395
    ],
    "num_chars": 46,
    "suite": "G!!C!!U!!A!!A!!U4nC1aU!!A!!C4pU2[A6pU2aA1aG!!A"
  }
}

When analyzing the results in more detail (pairs, helices, multiplets, etc.), we see that chain E comes from MODEL 1 in the Biological Assembly file, and chain I is from MODEL 2:

Code: [Select]

[justas@catfish tmp]$ x3dna-dssr -i=4ilm.pdb2 --symm --json | jq .pairs
[
  {
    "index": 1,
    "nt1": "1:E.C7",
    "nt2": "1:E.G15",
    "bp": "C-G",
    "name": "WC",
    "Saenger": "19-XIX",
    "LW": "cWW",
    "DSSR": "cW-W"
  },
  {
    "index": 2,
    "nt1": "1:E.U8",
    "nt2": "1:E.A14",
    "bp": "U-A",
    "name": "WC",
    "Saenger": "20-XX",
    "LW": "cWW",
    "DSSR": "cW-W"
  },
  {
    "index": 3,
    "nt1": "2:I.U6",
    "nt2": "2:I.A16",
    "bp": "U+A",
    "name": "--",
    "Saenger": "n/a",
    "LW": "cWH",
    "DSSR": "cW+M"
  },
  {
    "index": 4,
    "nt1": "2:I.C7",
    "nt2": "2:I.G15",
    "bp": "C-G",
    "name": "WC",
    "Saenger": "19-XIX",
    "LW": "cWW",
    "DSSR": "cW-W"
  },
  {
    "index": 5,
    "nt1": "2:I.U8",
    "nt2": "2:I.A14",
    "bp": "U-A",
    "name": "WC",
    "Saenger": "20-XX",
    "LW": "cWW",
    "DSSR": "cW-W"
  }
]

This inconsistency causes troubles when parsing multiple DSSR output files generated for the PDB Biological Assemblies. I wonder, if the model number for the PDB chain could be included everywhere in the DSSR output, when '--symmetry' option is used?

Thank you very much in advance for your feedback.

xiangjun · « **Reply #1 on:** December 14, 2017, 08:33:46 am »

Good catch, and your detailed report is exemplary. I’ll look into the issue and get back to you on the Forum, probably by tonight.

Thanks,

Xiang-Jun

xiangjun · « **Reply #2 on:** December 14, 2017, 11:11:22 pm »

I've updated DSSR so that the model number will be included in the key to identify a chain, with the --json option. So for 4ilm.pdb2, the two chains (E on model 1, and I on model 2) will be identified as m1_chain_E and m2_chain_I respectively.

See the updated output below for 4ilm.pdb2:

Code: [Select]

# x3dna-dssr -i=4ilm.pdb2 --symm --json | jq .chains
{
  "m1_chain_E": {
    "num_nts": 16,
    "bseq": "GCUAAUCUACUAUAGA",
    "sstr": "......((.....)).",
    "form": "A.....A......AA-",
    "helical_rise": 0.115,
    "helical_rise_std": 3.392,
    "helical_axis": [
      -0.734,
      -0.496,
      -0.464
    ],
    "point1": [
      64.548,
      -24.513,
      89.342
    ],
    "point2": [
      62.86,
      -25.653,
      88.275
    ],
    "num_chars": 46,
    "suite": "G!!C!!U!!A!!A4bU4nC1aU!!A!!C4pU2[A6pU!!A1aG1aA"
  },
  "m2_chain_I": {
    "num_nts": 16,
    "bseq": "GCUAAUCUACUAUAGA",
    "sstr": "......((.....)).",
    "form": "A....BA......A.-",
    "helical_rise": 0.305,
    "helical_rise_std": 3.547,
    "helical_axis": [
      0.584,
      0.616,
      0.528
    ],
    "point1": [
      79.844,
      -52.783,
      70.422
    ],
    "point2": [
      83.132,
      -49.316,
      73.395
    ],
    "num_chars": 46,
    "suite": "G!!C!!U!!A!!A!!U4nC1aU!!A!!C4pU2[A6pU2aA1aG!!A"
  }
}

Please verify that the update solves the inconsistency issue you have.

Best regards,

Xiang-Jun

chemikeris · « **Reply #3 on:** December 19, 2017, 06:13:57 am »

Sorry for delayed reply, but I checked the software on ~5500 PDB entries and found that the fixed version of DSSR works correctly.

Thank you very much for fast update.

xiangjun · « **Reply #4 on:** December 19, 2017, 01:30:02 pm »

Quote

I checked the software on ~5500 PDB entries and found that the fixed version of DSSR works correctly.

Glad to hear that!

In retrospect, the original goal was just to having a unique key for each chain in the JSON output. So when two chains already have different identifiers, as in the case for chains E (in model #1) and I (in model #2) in 4ilm.pdb2, the corresponding model numbers were excluded, for simplicity. I did not expect the key itself would be used/useful. With a specific use case, the inconsistency in keys of different model/chains was easily fixed.

For other viewers of this thread, if you find any inconsistency or just feel something is missing/not right with DSSR, please do not hesitate to let me know. By reporting DSSR-related issues on the Forum, you're likely to receive a quick fix to proceed with your project, and you help improve the software per se that would benefit the community at large.

Best regards,

Xiang-Jun

News:

Author Topic: Chain names in DSSR JSON output for PDB Biological Assemblies (Read 63083 times)

chemikeris

Chain names in DSSR JSON output for PDB Biological Assemblies

xiangjun

Re: Chain names in DSSR JSON output for PDB Biological Assemblies

xiangjun

Re: Chain names in DSSR JSON output for PDB Biological Assemblies

chemikeris

Re: Chain names in DSSR JSON output for PDB Biological Assemblies

xiangjun

Re: Chain names in DSSR JSON output for PDB Biological Assemblies