Author Topic: Parsing the base pair identifiers - separating base type from base number (Read 55325 times)

jyvdf3asdg2 · « **on:** May 30, 2013, 06:44:40 pm »

Hi again,

Great program, like all the improvements you've made so far.

Now, I'm trying to parse the output given by DSSR using Python, and so far it is quite easy. However, I'm running in to a bit of trouble when trying to parse the base identifiers.

IE. "0.C309" from 1S72, it is easy enough to split the strand from the base type/residue number, but then separating C from 309 becomes more difficult.

Separating by chars vs. integers would be okay, but some alt. residues have numbers in them which makes it more difficult.

Is there any way you would want to add another separator for base type from base number?

Ex. "0.C_309" or the like?

Thanks

xiangjun · « **Reply #1 on:** May 30, 2013, 08:39:47 pm »

Thanks for your kind words about DSSR.

Quote

Separating by chars vs. integers would be okay, but some alt. residues have numbers in them which makes it more difficult.

Could you provide some specific cases to make your point clearer?

Indeed, there are more complications in the nt identifier than the very simple case you mentioned. For example, model number and insertion code etc are also (need to be) considered in DSSR.

Quote

Is there any way you would want to add another separator for base type from base number?
Ex. "0.C_309" or the like?

I'd like to keep the default settings for DSSR simple/succinct, targeting more towards human apprehension than computer parsing. That said, I may consider to add an option to make the id string software friendly.

Xiang-Jun

jyvdf3asdg2 · « **Reply #2 on:** May 31, 2013, 10:30:09 am »

Okay, a separate option would be nice, but I understand if that's not what you intend for the output.

Quote from: xiangjun on May 30, 2013, 08:39:47 pm

Thanks for your kind words about DSSR.

Quote
Separating by chars vs. integers would be okay, but some alt. residues have numbers in them which makes it more difficult.
Could you provide some specific cases to make your point clearer?

Xiang-Jun

An example from PDB 1D9H, you have a modified base U31 of residue number 16 on chain A.

DSSR displays it as "B.U31/16", separating the numeral in the base type from the numeral of the residue number. If they all took that format, it would be nice for those who wish to parse the DSSR data.

xiangjun · « **Reply #3 on:** May 31, 2013, 11:25:44 am »

Hi,

I am glad that you noticed this subtle point. Since the nucleotide is named U31, ending with digital numbers, it obviously would be confused with the residue number 16. That's why I decided to add a slash (/) in between. I will write a post on the details of nt id string in DSSR.

HTH,

Xiang-Jun

xiangjun · « **Reply #4 on:** June 03, 2013, 11:38:37 pm »

I've updated DSSR to beta-r11-on-20130603 which contains a new option --long-idstr to delineate fields of nucleotide id string. The format is:

model-number.chain-id.nucleotide-name.nt-sequence-number.insertion-code

It has five fields, and some of them (model number, insertion code) can be missing. For example, with the new option, B.U31/16 in 1d9h would become .B.U31.16..

I believe this DSSR update would fulfill your needs -- please verify and report back how it goes.

Xiang-Jun

Updated on 2013-06-18: the new format is:

model-number.seqid.chain-id.nt-name.nt-number.insertion-code

jyvdf3asdg2 · « **Reply #5 on:** June 07, 2013, 10:04:59 am »

Works great, thanks!

News:

Author Topic: Parsing the base pair identifiers - separating base type from base number (Read 55325 times)

jyvdf3asdg2

Parsing the base pair identifiers - separating base type from base number

xiangjun

Re: Parsing the base pair identifiers - separating base type from base number

jyvdf3asdg2

Re: Parsing the base pair identifiers - separating base type from base number

xiangjun

Re: Parsing the base pair identifiers - separating base type from base number

xiangjun

Re: Parsing the base pair identifiers - separating base type from base number

jyvdf3asdg2

Re: Parsing the base pair identifiers - separating base type from base number