Author Topic: modified nucleotides incorrect. (Read 6999 times)

tctcab · « **on:** February 04, 2020, 01:26:15 am »

Hi, Dr. Li,

I've read your post regarding the modified nucleotides issue. https://x3dna.org/highlights/modified-nucleotides-in-the-pdb

However, during my usage, I noticed some modified nucleotides are still incorrect:

PDB: 1ASY_R

output of x3dna-dssr:

UCCGUGAUAGUUPAAuGGuCAGAAUGGGCGCPUGUCgCGUGCCAGAUcGGGGtPCAAUUCCCCGUCGCGGAGCCA

50 G ( R.G651 0.012 anti,~C3'-endo,BI,canonical,non-pair-contact,helix,stem,coaxial-stack
51 G ( R.G652 0.011 anti,~C3'-endo,BI,canonical,non-pair-contact,helix,stem,coaxial-stack
52 G ( R.G653 0.013 anti,~C3'-endo,BI,canonical,non-pair-contact,helix,stem-end,coaxial-stack,hairpin-loop,kissing-loop
53 t . R.5MU654 0.017 modified,anti,~C3'-endo,BI,non-canonical,non-pair-contact,helix,hairpin-loop,kissing-loop
54 P . R.PSU655 0.011 modified,u-turn,anti,~C3'-endo,BI,non-canonical,non-pair-contact,helix,hairpin-loop,kissing-loop,cap-acceptor
55 C ] R.C656 0.019 pseudoknotted,turn,u-turn,anti,~C3'-endo,BI,isolated-canonical,non-pair-contact,helix-end,hairpin-loop,kissing-loop
56 A . R.A657 0.010 u-turn,anti,~C3'-endo,non-pair-contact,hairpin-loop,kissing-loop,cap-donor,phosphate

the PDB fasta from PDB database:

>1ASY_R
UCCGUGAUAGUUUAAUGGUCAGAAUGGGCGCUUGUCGCGUGCCAGAUCGGGGUUCAAUUCCCCGUCGCGGAGCCA

According to the list in your provided https://x3dna.org/luxfiles/modified-bases-2013oct18.txt

5MU should be U, instead of t, PSU should be u, instead of P

hope this helps.

TC

xiangjun · « **Reply #1 on:** February 04, 2020, 11:14:49 am »

Hi TC,

Thanks for using 3DNA/DSSR and for posting your questions on the 3DNA Forum.

Quote

However, during my usage, I noticed some modified nucleotides are still incorrect:

I checked PDB entry 1ASY_R, and can reproduce your reported results regarding the DSSR auto-assigned modified nucleotides. However, I disagree with you that DSSR has made a mistake here, especially with regard to the pseudouridine (PSU).

As noted in the DSSR paper, in the section on "Identification of nucleotides":

Quote

In the derived base sequence, DSSR uses a one-letter shorthand for each identified nucleotide: upper case A, C, G, U and T for standard RNA and DNA bases, and lower case letters for modified nucleotides mapped to their canonical counterparts (e.g. ‘c’ for 5-methylcytidine, 5MC; Figure 2 and Supplementary Sample Output). Note that pseudouridine (PSU) is shortened to ‘P’, due to its special C1′–C5 glycosidic linkage (Figure 2).

Taking PSU as a modified U, i.e., using the standard base reference frame of U, would lead to wrong base-pair parameters. Thus 3DNA/DSSR specifically adds the P symbol -- this is a deliberate choice, a feature, not a bug.

As for taking 5MU as modified T ('t'), that's because of the 5-methyl group. I agree that choice here is arbitrary for the assignment of 5MU as modified U or T. Users can take it as a modified U explictly in 3DNA via the 'basedate.dat' file, as documented in that blogpost. For the purpose of 3DNA/DSSR, however, taking 5MU as a modified T or U does not has noticeable effect on the derived parameters. So DSSR always uses an implicit assignment, for simplicity.

For users who want to compare DSSR-derived sequences with other resources, they need to pay attention to the lower-case letters and P and take proper actions. By design, the DSSR-derived base sequences from 3D atomic coordinates would be different from those listed in the PDB when pseudouridine (PSU) is involved. I could add a new DSSR option so that users can explicitly set the mapping in cases like 5MU. In my support of DSSR for more than 6 years, however, this ambiguity has not been a concern in practice. Do you think such a feature would be useful to you?

Best regards,

Xiang-Jun

tctcab · « **Reply #2 on:** February 04, 2020, 07:26:53 pm »

Hi, Xiang-Jun,

Thanks for your explanation, now I understand your choice.

However, regarding 1ASY_R and PSU, the basepair classification of DSSR would be:

command: x3dna-dssr -i=1ASY_R.pdb --json -o=1ASY_R.dssr.json

pairs:
...
index nt1 nt2 bp name Saenger LW DSSR
28 28 R.C631 R.G639 C-G WC 19-XIX cWW cW-W
29 29 R.PSU632 R.C638 P-C -- n/a cWW cW-W
30 30 R.PSU632 R.G639 P-G -- n/a cWW cW-W
...

You should notice the inconsistency between BP classification. briefly, the name, Saenger columns do not recognize the WC pair of 28,29, while LW and DSSR annotate them as canonical. So if I want to get canonical pairs, it seems that I can't use the former two columns, right? what's your advice for the task of retrieving canonical basepairs when the sequence has PSU?

Quote

For users who want to compare DSSR-derived sequences with other resources, they need to pay attention to the lower-case letters and P and take proper actions. By design, the DSSR-derived base sequences from 3D atomic coordinates would be different from those listed in the PDB when pseudouridine (PSU) is involved. I could add a new DSSR option so that users can explicitly set the mapping in cases like 5MU. In my support of DSSR for more than 6 years, however, this ambiguity has not been a concern in practice. Do you think such a feature would be useful to you?

This will definitely help and useful for other users, I believe.

In my workflow, I used your list https://x3dna.org/luxfiles/modified-bases-2013oct18.txt to convert sequence back to standard RNA sequence (AUGCNX) and do sequence-search. a letter P in the output of DSSR will be treated as proline. My suggestion is to keep the output sequence in line with the IUPAC code in order to reduce ambiguity.
https://www.bioinformatics.org/sms/iupac.html

xiangjun · « **Reply #3 on:** February 04, 2020, 08:26:42 pm »

Quote

index nt1 nt2 bp name Saenger LW DSSR
28 28 R.C631 R.G639 C-G WC 19-XIX cWW cW-W
29 29 R.PSU632 R.C638 P-C -- n/a cWW cW-W
30 30 R.PSU632 R.G639 P-G -- n/a cWW cW-W

You should notice the inconsistency between BP classification. briefly, the name, Saenger columns do not recognize the WC pair of 28,29, while LW and DSSR annotate them as canonical. So if I want to get canonical pairs, it seems that I can't use the former two columns, right? what's your advice for the task of retrieving canonical basepairs when the sequence has PSU?

I am confused by the first two columns here. Also you mentioned the "the WC pair of 28,29". What is it? What about 30? Please clarify.

DSSR follows the convention that "canonical pairs" include only WC and G-U wobble pairs. The DSSR (and its implementation of the LW) pair annotations are geometry based. If want to retrieve WC-like pairs involving PSU, you can check the 'cW-W' DSSR notation. You may check the "RNA Structure Atlas" website for 'authentic' LW annotation of base pairs.

As mentioned in my previous response, the mapping of PSU to symbol P is a deliberate decision in 3DNA/DSSR. The P symbol makes it stands out for the most common modified nucleotide, pseudouridine. Users could easily replace P in DSSR-derived base sequence to U, as they wish. So it should not be an issue in practice.

I will add a new option to DSSR so users can have control over the mapping of 5MU to U, for example. However, PSU to U mapping will not be allowed: PSU is topologically different from U in terms of sugar-base connectivity.

Best regards,

Xiang-Jun

xiangjun · « **Reply #4 on:** February 05, 2020, 12:41:02 pm »

As a follow-up, I've updated DSSR to v1.9.9-2020feb06 on the download page. The update introduces a new option --nt-mapping that takes a comma-separated list of modified nucleotides in the form of 3-letter-id:1-letter-symbol. For example, to map 5MU to u, one can use --nt-mapping='5MU:u'. More modified nucleotides are allowed, which are separated by comma. The one-letter symbol must be among ACGTUP (or acgtup). By design, PSU is assigned to P by default, and cannot be changed via this option.

Using 1ASY_R as an example, here are the detailed steps (viewers can follow):

Code: Text

curl https://files.rcsb.org/download/1ASY.pdb -o 1ASY.pdb
x3dna-dssr -i=1ASY.pdb --select-chain=R -o=1ASY-R.pdb
x3dna-dssr -i=1ASY-R.pdb
x3dna-dssr -i=1ASY-R.pdb --nt-mapping='5MU: u'

The DSSR-derived sequences are listed below:

  UCCGUGAUAGUUPAAuGGuCAGAAUGGGCGCPUGUCgCGUGCCAGAUcGGGGtPCAAUUCCCCGUCGCGGAGCCA
  UCCGUGAUAGUUPAAuGGuCAGAAUGGGCGCPUGUCgCGUGCCAGAUcGGGGuPCAAUUCCCCGUCGCGGAGCCA

Note that mapping 5MU to 't' or 'u' has minimal influence on DSSR-derived base-pair parameters, as show below. 3DNA/DSSR is robust against the (potential) ambiguity in designating a modified nucleotide to its nearest canonical counterpart.

Code: Text

36 R.5MU654       R.A658         t-A rHoogsteen  24-XXIV   tWH  tW-M
     bp-pars: [4.19    -2.18   -0.08   -4.49   6.74    -93.68]
# with 5MU:u
36 R.5MU654       R.A658         u-A rHoogsteen  24-XXIV   tWH  tW-M
     bp-pars: [4.20    -2.20   -0.08   -4.48   6.74    -93.47]

As a side note, the v1.9.9-2020feb06 also contains many refinements at the DSSR-PyMOL interface for producing the characteristic block schematics. See http://skmatic.x3dna.org.

Xiang-Jun

tctcab · « **Reply #5 on:** February 05, 2020, 10:13:05 pm »

Many thanks for your time!

I find DSSR to be handy and really like your work and your devotion to maintaining it for so long.

News:

Author Topic: modified nucleotides incorrect. (Read 6999 times)

tctcab

modified nucleotides incorrect.

xiangjun

Re: modified nucleotides incorrect.

tctcab

Re: modified nucleotides incorrect.

xiangjun

Re: modified nucleotides incorrect.

xiangjun

Re: modified nucleotides incorrect.

tctcab

Re: modified nucleotides incorrect.