I try to use DSSR for large-scale analysis of nucleic acid structures which I take from Biological Assemblies from the Protein Data Bank. For this I use the option '--symmetry' which I find very useful.
Unfortunately, I noticed an inconsistency in naming of the chains in DSSR JSON output which creates some troubles when parsing the DSSR results.
When chains in the Biological Assembly file come from different asymmetric units of the crystal structure, their names usually include the MODEL number and chain name from PDB file. Using PDB entry 4ZSF, we see two chains named 'B', one is from MODEL 1, another from MODEL 2:
[justas@catfish tmp]$ x3dna-dssr -i=4zsf.pdb1 --symm --json | jq .chains
{
"m1_chain_B": {
"num_nts": 14,
"bseq": "CTCGACCGGTCGAG",
"sstr": "((((((((((((((",
"form": "ABBBB...BBBB.-",
"helical_rise": 3.489,
"helical_rise_std": 0.789,
"helical_axis": [
0.828,
0.076,
0.555
],
"point1": [
-18.24,
-31.778,
5.458
],
"point2": [
19.36,
-28.329,
30.661
],
"num_chars": 40,
"suite": "C1bT!!C4bG!!A!!C1bC!!G4bG!!T!!C!!G!!A!!G"
},
"m2_chain_B": {
"num_nts": 14,
"bseq": "CTCGACCGGTCGAG",
"sstr": "))))))))))))))",
"form": "ABBBB...BBBB.-",
"helical_rise": 3.489,
"helical_rise_std": 0.789,
"helical_axis": [
-0.828,
0.076,
-0.555
],
"point1": [
18.24,
-31.778,
31.283
],
"point2": [
-19.36,
-28.329,
6.08
],
"num_chars": 40,
"suite": "C1bT!!C4bG!!A!!C1bC!!G4bG!!T!!C!!G!!A!!G"
}
}
However, in the cases when there are chains from two assymetric units (MODEL 1 and MODEL 2 in input file), but their names are different, we see the no model numbers in chains section of the output.
For example, in PDB entry 4ILM Biological Assembly 2, we see only chains E and I:
[justas@catfish tmp]$ x3dna-dssr -i=4ilm.pdb2 --symm --json | jq .chains
{
"chain_E": {
"num_nts": 16,
"bseq": "GCUAAUCUACUAUAGA",
"sstr": "......((.....)).",
"form": "A.....A......AA-",
"helical_rise": 0.115,
"helical_rise_std": 3.392,
"helical_axis": [
-0.734,
-0.496,
-0.464
],
"point1": [
64.548,
-24.513,
89.342
],
"point2": [
62.86,
-25.653,
88.275
],
"num_chars": 46,
"suite": "G!!C!!U!!A!!A4bU4nC1aU!!A!!C4pU2[A6pU!!A1aG1aA"
},
"chain_I": {
"num_nts": 16,
"bseq": "GCUAAUCUACUAUAGA",
"sstr": "......((.....)).",
"form": "A....BA......A.-",
"helical_rise": 0.305,
"helical_rise_std": 3.547,
"helical_axis": [
0.584,
0.616,
0.528
],
"point1": [
79.844,
-52.783,
70.422
],
"point2": [
83.132,
-49.316,
73.395
],
"num_chars": 46,
"suite": "G!!C!!U!!A!!A!!U4nC1aU!!A!!C4pU2[A6pU2aA1aG!!A"
}
}
When analyzing the results in more detail (pairs, helices, multiplets, etc.), we see that chain E comes from MODEL 1 in the Biological Assembly file, and chain I is from MODEL 2:
[justas@catfish tmp]$ x3dna-dssr -i=4ilm.pdb2 --symm --json | jq .pairs
[
{
"index": 1,
"nt1": "1:E.C7",
"nt2": "1:E.G15",
"bp": "C-G",
"name": "WC",
"Saenger": "19-XIX",
"LW": "cWW",
"DSSR": "cW-W"
},
{
"index": 2,
"nt1": "1:E.U8",
"nt2": "1:E.A14",
"bp": "U-A",
"name": "WC",
"Saenger": "20-XX",
"LW": "cWW",
"DSSR": "cW-W"
},
{
"index": 3,
"nt1": "2:I.U6",
"nt2": "2:I.A16",
"bp": "U+A",
"name": "--",
"Saenger": "n/a",
"LW": "cWH",
"DSSR": "cW+M"
},
{
"index": 4,
"nt1": "2:I.C7",
"nt2": "2:I.G15",
"bp": "C-G",
"name": "WC",
"Saenger": "19-XIX",
"LW": "cWW",
"DSSR": "cW-W"
},
{
"index": 5,
"nt1": "2:I.U8",
"nt2": "2:I.A14",
"bp": "U-A",
"name": "WC",
"Saenger": "20-XX",
"LW": "cWW",
"DSSR": "cW-W"
}
]
This inconsistency causes troubles when parsing multiple DSSR output files generated for the PDB Biological Assemblies. I wonder, if the model number for the PDB chain could be included everywhere in the DSSR output, when '--symmetry' option is used?
Thank you very much in advance for your feedback.