Print Page - Parsing the base pair identifiers - separating base type from base number

Questions and answers => RNA structures (DSSR) => Topic started by: jyvdf3asdg2 on May 30, 2013, 06:44:40 pm

Netiquette · Download · News · Gallery · G-quadruplexes · DSSR-Jmol · DSSR-PyMOL · Video Overview · DSSR v2.5.4 (DSSR Manual) · Homepage

Title: Parsing the base pair identifiers - separating base type from base number
Post by: jyvdf3asdg2 on May 30, 2013, 06:44:40 pm

Hi again,

Great program, like all the improvements you've made so far.

Now, I'm trying to parse the output given by DSSR using Python, and so far it is quite easy. However, I'm running in to a bit of trouble when trying to parse the base identifiers.

IE. "0.C309" from 1S72, it is easy enough to split the strand from the base type/residue number, but then separating C from 309 becomes more difficult.

Separating by chars vs. integers would be okay, but some alt. residues have numbers in them which makes it more difficult.

Is there any way you would want to add another separator for base type from base number?

Ex. "0.C_309" or the like?

Thanks

Title: Re: Parsing the base pair identifiers - separating base type from base number
Post by: xiangjun on May 30, 2013, 08:39:47 pm

Thanks for your kind words about DSSR.

Quote

Separating by chars vs. integers would be okay, but some alt. residues have numbers in them which makes it more difficult.

Could you provide some specific cases to make your point clearer?

Indeed, there are more complications in the nt identifier than the very simple case you mentioned. For example, model number and insertion code etc are also (need to be) considered in DSSR.

Quote

Is there any way you would want to add another separator for base type from base number?
Ex. "0.C_309" or the like?

I'd like to keep the default settings for DSSR simple/succinct, targeting more towards human apprehension than computer parsing. That said, I may consider to add an option to make the id string software friendly.

Xiang-Jun

Title: Re: Parsing the base pair identifiers - separating base type from base number
Post by: jyvdf3asdg2 on May 31, 2013, 10:30:09 am

Okay, a separate option would be nice, but I understand if that's not what you intend for the output.

Quote from: xiangjun on May 30, 2013, 08:39:47 pm

Thanks for your kind words about DSSR.

Quote
Separating by chars vs. integers would be okay, but some alt. residues have numbers in them which makes it more difficult.
Could you provide some specific cases to make your point clearer?

Xiang-Jun

An example from PDB 1D9H, you have a modified base U31 of residue number 16 on chain A.

DSSR displays it as "B.U31/16", separating the numeral in the base type from the numeral of the residue number. If they all took that format, it would be nice for those who wish to parse the DSSR data.

Title: Re: Parsing the base pair identifiers - separating base type from base number
Post by: xiangjun on May 31, 2013, 11:25:44 am

Hi,

I am glad that you noticed this subtle point. Since the nucleotide is named U31, ending with digital numbers, it obviously would be confused with the residue number 16. That's why I decided to add a slash (/) in between. I will write a post on the details of nt id string in DSSR.

HTH,

Xiang-Jun

Title: Re: Parsing the base pair identifiers - separating base type from base number
Post by: xiangjun on June 03, 2013, 11:38:37 pm

I've updated DSSR to beta-r11-on-20130603 which contains a new option --long-idstr to delineate fields of nucleotide id string. The format is:

model-number.chain-id.nucleotide-name.nt-sequence-number.insertion-code

It has five fields, and some of them (model number, insertion code) can be missing. For example, with the new option, B.U31/16 in 1d9h would become .B.U31.16..

I believe this DSSR update would fulfill your needs -- please verify and report back how it goes.

Xiang-Jun

Updated on 2013-06-18: the new format is:

model-number.seqid.chain-id.nt-name.nt-number.insertion-code

Title: Re: Parsing the base pair identifiers - separating base type from base number
Post by: jyvdf3asdg2 on June 07, 2013, 10:04:59 am

Works great, thanks!

Funded by the NIH R24GM153869 grant on X3DNA-DSSR, an NIGMS National Resource for Structural Bioinformatics of Nucleic Acids

Created and maintained by Dr. Xiang-Jun Lu, Department of Biological Sciences, Columbia University

3DNA Forum

Questions and answers => RNA structures (DSSR) => Topic started by: jyvdf3asdg2 on May 30, 2013, 06:44:40 pm