Need Help Developing Algorithm to Decode MathML

Saladsamurai · Aug 10, 2009

Yeah. I know it is going to be a pain. But I need to do it for work.

If you are not familiar with MathML, it is similar to LaTeX. Here is an example:

The expression [tex]x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}[/tex]would me represented in MathML as

Code:

  <mrow>
    <mi>x</mi>
    <mo>=</mo>
    <mfrac>
      <mrow>
        <mo form="prefix">&minus;</mo>
        <mi>b</mi>
        <mo>&PlusMinus;</mo>
        <msqrt>
          <msup>
            <mi>b</mi>
            <mn>2</mn>
          </msup>
          <mo>&minus;</mo>
          <mn>4</mn>
          <mo>&InvisibleTimes;</mo>
          <mi>a</mi>
          <mo>&InvisibleTimes;</mo>
          <mi>c</mi>
        </msqrt>
      </mrow>
      <mrow>
        <mn>2</mn>
        <mo>&InvisibleTimes;</mo>
        <mi>a</mi>
      </mrow>
    </mfrac>
  </mrow>
</math>

Now my objective is to have my code (written in VBA excel) take in this MathML and output it as something that I can plug into the function toolbar in Excel. For example the above would come out (omitting the tricky plus-minus symbol)as:

=( -b + (b^2-4*a*c)^(1/2))/(2*a)

Now what I have accomplished so far is to have all of the code read onto VBA and inserted into an array such that each element of the ML is an element of the array.

For example: Array(1) = <mrow> Array(2) = <mi> Array(3) = x Array(4) = <mi>

and soooo on...

Now I have to figure out how to turn it into regular old 'typed math.'

Any suggestions? (Besides give up

)

I figure the fact that the ML could be seen having been typed from 'left to right" should help... but I am having trouble seeing how to attack this.Even the most "trivial" of suggestions would be appreciated.

Thanks!

EDIT: Okay let's add in some of these element's definitions:

mrow—displays its subelements in a horizontal row.
mi—represents an identifier such as the name of a function or variable.
mo—represents an operator or delimiter.
mn—represents a number.

Saladsamurai · Aug 10, 2009

So I had an idea for this. It is pretty vague and I know there will be a lot of special cases, but it is a start.

The tags that get wrapped around each term are analogous to sets of parenthesis; so I will describe my idea in terms of this analogy.

My idea is based on the distance between a left and right parenthesis. When working with a simple arithmetic expression with multiple sets of parentheses, one could denote the location of the 'innermost' left parenthesis and then denote its corresponding right parenthesis as the 'closest' right-parenthesis to it.

The backing out from the center of the expression, the next left-parenthesis would be matched to the 2nd closest right-parenthesis and so on.

This analogy could be extended on to the MathML by noting that there now different 'flavors' of parenthesis.

I know there will be some difficulties that arise. For one, how do actually determine what should be the 'innermost' set?

Secondly, what if there is a 'lone pair' somewhere for example ((()))+()

Thoughts?

mXSCNT · Aug 10, 2009

I told you, you should use a VBA library for parsing XML, perhaps in conjunction with the MathML DTD (if you want to validate). Hacking up your own parser will take more effort to get working and it will likely have bugs that an XML parser library would not have.

Saladsamurai · Aug 10, 2009

mXSCNT said:

I told you, you should use a VBA library for parsing XML, perhaps in conjunction with the MathML DTD (if you want to validate). Hacking up your own parser will take more effort to get working and it will likely have bugs that an XML parser library would not have.

I have no programming experience, have read the links you provided me and still have no idea what an XML Parser Library is or does. Maybe you could dumb it down for me.

mXSCNT · Aug 10, 2009

A text file has some format it is written in. In this case your file is written in the XML format, specifically MathML. When you want to extract information from the file, it has to be processed into a different format which your program can more easily work with. That processing is the job of a parser. In your posts about this, you describe your attempts to hand-code a parser for MathML. You might not have used that word for it until now, but that's what you've been doing.

An XML parsing library will do something similar to what you've been attempting. It will break down the MathML file into the parts that you are interested in. For example, if you used the library with the example you gave in this thread, it would tell you things like "The file has a top-level tag called mrow, which has a number of children. The first child of mrow is called mi, and it has the text value 'x.' The second child of mrow is called mo, and it has the text value '=.' The third child of mrow is called mfrac, and it has several children. The first child of mfrac is another tag called mrow ... " and so on, except not in English but in a way your program can deal with. The XML parsing library might also have other tools; for example it might be able to validate the file, which means to determine whether the file contains well-formed MathML or whether the MathML has some syntax error.

The question at this point would be what to do with that information. One simple way to organize your code is to have a set of functions, each designed to output the Excel function toolbar code for a particular tag (a tag is a name like mrow). You could have another function dispatch() which takes a tag as an argument and decides which function to call. So for example, your mfrac code might look vaguely like this (I don't know VBA, but I hope you can gather what I mean):

Code:

function handle_mfrac(tag mfrac){
  if the tag mfrac does not have exactly 2 children, signal an error and terminate.
  let numerator = the first child of the tag mfrag
  let denominator = the second child of the tag mfrac
  return "(" + dispatch(numerator) + "/" + dispatch(denominator) + ")"
}
function dispatch(tag sometag){
  if the text of sometag is "mfrac" then return handle_mfrac(sometag)
  else if the text of sometag is "mrow" then return handle_mrow(sometag)
  else if the text of sometag is "msqrt" then return handle_msqrt(sometag)
  ... (and so on like that)
  else if we don't recognize the tag, signal an error and terminate
}

Saladsamurai · Aug 10, 2009

I see. I think I am with you now. But it seems from your description, the PL does not do anything beyond what I have already accomplished, i.e. break down the XML file into its subelements.

Or perhaps it does. Does it determine the hierarchy as well? I think it does... that's the whole "children" thing.

Hmmm. I need to turn this over in my head for awhile. It seems like if it determines the hierarchy, then that should take a lot of the burden off of me.

It should also (at least I think from what you describe) eliminate the need to find a particular tags "partner" right?

For example it knows which <mrow> goes with which </mrow> right?

Thanks for your help too!

mXSCNT · Aug 10, 2009

Yes, a parsing library determines the hierarchy of which nodes are children of which other nodes, and it spares you from having to deal with details like closing tags. If the tags have attributes (an attribute is like <foo attr="someval">) then the parsing library will handle those as well, and properly escape ampersands, deal with a doctype declaration if one is there, and handle any other syntactical oddities so that you don't have to.

Saladsamurai · Aug 10, 2009

Okay then. So let's say I have a bunch of MathML files saved as .txt files. Where do I go to input them? Is this a free service?

junglebeast · Aug 11, 2009

The simplest way to parse this is with a recursive descent parser, which are typically hand written due to their simplicity (ie, not using some bloated parsing library). I just wrote a recursive descent parser for parsing math expressions last week, actually. It worked great.

Saladsamurai · Aug 11, 2009

Oh drat. Now I am torn between the different approaches. I feel like since I have already put so much thought into writing my own I should continue. But the Library might be able to pick up on things that I cannot.

Any chance you could post your parser junglebeast (at your leisure) so I could see how you tackled yours?

mXSCNT · Aug 11, 2009

Writing a recursive descent parser might be a learning exercise, and not too difficult in this case... but not worth the trouble if you just want results. A parsing library is the easier way to go, it means you have to write less code yourself and the less code you write the less chance there is for bugs.

Saladsamurai, to open a file you need to read a VBA tutorial or book. I'm sure that opening a file is one of the things that would be covered.

Edit: However, if you are just learning to program, I would recommend Python over VBA. Avoid the MS lock-in and learn a skill that is cross-platform.

Saladsamurai · Aug 11, 2009

mXSCNT said:

Writing a recursive descent parser might be a learning exercise, and not too difficult in this case... but not worth the trouble if you just want results. A parsing library is the easier way to go, it means you have to write less code yourself and the less code you write the less chance there is for bugs.

Saladsamurai, to open a file you need to read a VBA tutorial or book. I'm sure that opening a file is one of the things that would be covered.

Edit: However, if you are just learning to program, I would recommend Python over VBA. Avoid the MS lock-in and learn a skill that is cross-platform.

Good Advice, but I do need to use VBA since it's what we use at my company. I can open a file. I have already written a code that opens the file, trims the whitespace and stores each element of the XML as an array element.

Once I get the file opn, where do I actually 'plug it in' to a tool that parses it for me?

I am currently reading those links again, but I am not seeing that part of it.

Thanks guys

mXSCNT · Aug 11, 2009

Well, first you have to find a suitable XML parsing library for VBA, then you have to read the documentation or tutorial for that library to learn how to use it. Probably at some point you call a library parsing function on the text of the file or on the filename.

One of the links I gave you in your other thread about this appeared to be an XML parsing tutorial.

junglebeast · Aug 11, 2009

mXSCNT said:

Writing a recursive descent parser might be a learning exercise, and not too difficult in this case... but not worth the trouble if you just want results. A parsing library is the easier way to go, it means you have to write less code yourself and the less code you write the less chance there is for bugs.

I don't believe that using a library will reduce the amount of code that needs to be written. I think it will increase the amount of code, and will also have a significant learning curve that takes a while to learn how to use it. And because you are depending on someone elses code, it will icnrease the chance of bugs.

In my case, I got frustrated trying to learn parsing libraries and discovered it was easier to write a new parser from the ground up. After researching a bit, I discovered that recursive descent was the easiest type of parser for this job, and after reading the wikipedia example I had it up and running less than an hour later.

http://en.wikipedia.org/wiki/Recursive_descent_parser

Saladsamurai · Aug 11, 2009

I am just going to go with my original idea. I just read the Wiki and did not understand a word of it. I am not a programmer, so the added learning curve of learning all of the jargon necessary to understand that article is just too much.

Same goes for the Library. I don't see what is wrong with the original idea: read the ML, make some decisions as to how it should be handled, output what I need.

junglebeast · Aug 11, 2009

Saladsamurai said:

I am just going to go with my original idea. I just read the Wiki and did not understand a word of it. I am not a programmer, so the added learning curve of learning all of the jargon necessary to understand that article is just too much.

Same goes for the Library. I don't see what is wrong with the original idea: read the ML, make some decisions as to how it should be handled, output what I need.

Don't let me stop you, although I expect you'll just end up wasting a lot of time trying to get something that only half-works some of the time. Parsing theory is a very mature field of computer science, they have already figured out several algorithms that work and those which don't, and the simplest of those working algorithms is what I already suggested. Another simpel strategy is SLR,

http://en.wikipedia.org/wiki/Simple_LR_parser

although I find this one to be a little more confusing. The chances of you inventing an original parsing algorithm that doesn't turn out to be identical to an already known method of parsing is near zero in my opinion..although it may take you weeks or months or programming to finally design this new parser yourself. Just saying...why try to reinvent the wheel? It's not so simple as it first looks. Perhaps after dallying around for a while with no success you will be more receptive to using an established method.

Saladsamurai · Aug 11, 2009

Well, it is not that I am don't believe you, but it is like you said: "Parsing theory is a very mature field of computer science..."

But I know nothing about computer science. I just know how to write some simple If-Then, For Loops in C++ and MATLAB.

So when I look at the Wiki, everything in it is completely foreign to me. I will see if I can find like a "Parsing for Dummies" webpages...

lrowe_uk_phy · Aug 17, 2009

The format of MathML is XML.

Hook a SAX Parser into your VBA environment (You can call a c version or see if VBA has an xml SAX interface, or DOM Tree navigation will do if you don't want to use basic SAX.

Use the handler call backs to walk through the MathML code and write a state machine parser that will map that to the standard equation syntax format.

Note: You may have already created a good enough custom parse of MathML to bypass SAX, but checking out this option might be interesting even if you don't decide to use it.

Build up some basic bottom up transformations and then see if you can stitch them together with an overlaying structure that allows you to invoke them recursively as needed to make the transformation.

I'm simply trying to give you a method of approach, not over complicate the issues, read the comments a couple of times and think about it.

Good Luck!

lrowe_uk_phy · Aug 17, 2009

Here is a ref to the xml and vba interface:

http://en.allexperts.com/q/XML-1469/Manipulating-XML-files-Excel.htm

Saladsamurai · Aug 18, 2009

lrowe_uk_phy said:

Here is a ref to the xml and vba interface:

http://en.allexperts.com/q/XML-1469/Manipulating-XML-files-Excel.htm

Thanks!

I will look into this. This is a great learning experience so far!

Need Help Developing Algorithm to Decode MathML

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

France to ditch Windows for Linux

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Need Help Developing Algorithm to Decode MathML

Similar threads