Need Help Developing Algorithm to Decode MathML

  • Thread starter Saladsamurai
  • Start date
  • Tags
    Algorithm
In summary: That's all I'm going to say about it. If you don't understand, then you need to find a tutorial or other resource that explains it better than I can.In summary, the conversation discusses the use of MathML and the goal of converting it into a format that can be used in Excel. The speaker mentions their progress so far and their idea for converting the MathML. They also mention difficulties they anticipate and ask for suggestions. The expert suggests using a VBA library for parsing XML and explains the purpose of a parser.
  • #1
Saladsamurai
3,020
7
Yeah. I know it is going to be a pain. But I need to do it for work.

If you are not familiar with MathML, it is similar to LaTeX. Here is an example:

The expression [tex]x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}[/tex]would me represented in MathML as

Code:
  <mrow>
    <mi>x</mi>
    <mo>=</mo>
    <mfrac>
      <mrow>
        <mo form="prefix">&minus;</mo>
        <mi>b</mi>
        <mo>&PlusMinus;</mo>
        <msqrt>
          <msup>
            <mi>b</mi>
            <mn>2</mn>
          </msup>
          <mo>&minus;</mo>
          <mn>4</mn>
          <mo>&InvisibleTimes;</mo>
          <mi>a</mi>
          <mo>&InvisibleTimes;</mo>
          <mi>c</mi>
        </msqrt>
      </mrow>
      <mrow>
        <mn>2</mn>
        <mo>&InvisibleTimes;</mo>
        <mi>a</mi>
      </mrow>
    </mfrac>
  </mrow>
</math>

Now my objective is to have my code (written in VBA excel) take in this MathML and output it as something that I can plug into the function toolbar in Excel. For example the above would come out (omitting the tricky plus-minus symbol)as:

=( -b + (b^2-4*a*c)^(1/2))/(2*a)

Now what I have accomplished so far is to have all of the code read onto VBA and inserted into an array such that each element of the ML is an element of the array.

For example: Array(1) = <mrow> Array(2) = <mi> Array(3) = x Array(4) = <mi>

and soooo on...

Now I have to figure out how to turn it into regular old 'typed math.'

Any suggestions? (Besides give up :smile:)

I figure the fact that the ML could be seen having been typed from 'left to right" should help... but I am having trouble seeing how to attack this.Even the most "trivial" of suggestions would be appreciated.

Thanks! :smile:

EDIT: Okay let's add in some of these element's definitions:

mrow—displays its subelements in a horizontal row.
mi—represents an identifier such as the name of a function or variable.
mo—represents an operator or delimiter.
mn—represents a number.
 
Last edited:
Computer science news on Phys.org
  • #2
So I had an idea for this. It is pretty vague and I know there will be a lot of special cases, but it is a start.

The tags that get wrapped around each term are analogous to sets of parenthesis; so I will describe my idea in terms of this analogy.

My idea is based on the distance between a left and right parenthesis. When working with a simple arithmetic expression with multiple sets of parentheses, one could denote the location of the 'innermost' left parenthesis and then denote its corresponding right parenthesis as the 'closest' right-parenthesis to it.

The backing out from the center of the expression, the next left-parenthesis would be matched to the 2nd closest right-parenthesis and so on.

This analogy could be extended on to the MathML by noting that there now different 'flavors' of parenthesis.

I know there will be some difficulties that arise. For one, how do actually determine what should be the 'innermost' set?

Secondly, what if there is a 'lone pair' somewhere for example ((()))+()

Thoughts?
 
  • #3
I told you, you should use a VBA library for parsing XML, perhaps in conjunction with the MathML DTD (if you want to validate). Hacking up your own parser will take more effort to get working and it will likely have bugs that an XML parser library would not have.
 
  • #4
mXSCNT said:
I told you, you should use a VBA library for parsing XML, perhaps in conjunction with the MathML DTD (if you want to validate). Hacking up your own parser will take more effort to get working and it will likely have bugs that an XML parser library would not have.

I have no programming experience, have read the links you provided me and still have no idea what an XML Parser Library is or does. Maybe you could dumb it down for me.
 
  • #5
A text file has some format it is written in. In this case your file is written in the XML format, specifically MathML. When you want to extract information from the file, it has to be processed into a different format which your program can more easily work with. That processing is the job of a parser. In your posts about this, you describe your attempts to hand-code a parser for MathML. You might not have used that word for it until now, but that's what you've been doing.

An XML parsing library will do something similar to what you've been attempting. It will break down the MathML file into the parts that you are interested in. For example, if you used the library with the example you gave in this thread, it would tell you things like "The file has a top-level tag called mrow, which has a number of children. The first child of mrow is called mi, and it has the text value 'x.' The second child of mrow is called mo, and it has the text value '=.' The third child of mrow is called mfrac, and it has several children. The first child of mfrac is another tag called mrow ... " and so on, except not in English but in a way your program can deal with. The XML parsing library might also have other tools; for example it might be able to validate the file, which means to determine whether the file contains well-formed MathML or whether the MathML has some syntax error.

The question at this point would be what to do with that information. One simple way to organize your code is to have a set of functions, each designed to output the Excel function toolbar code for a particular tag (a tag is a name like mrow). You could have another function dispatch() which takes a tag as an argument and decides which function to call. So for example, your mfrac code might look vaguely like this (I don't know VBA, but I hope you can gather what I mean):
Code:
function handle_mfrac(tag mfrac){
  if the tag mfrac does not have exactly 2 children, signal an error and terminate.
  let numerator = the first child of the tag mfrag
  let denominator = the second child of the tag mfrac
  return "(" + dispatch(numerator) + "/" + dispatch(denominator) + ")"
}
function dispatch(tag sometag){
  if the text of sometag is "mfrac" then return handle_mfrac(sometag)
  else if the text of sometag is "mrow" then return handle_mrow(sometag)
  else if the text of sometag is "msqrt" then return handle_msqrt(sometag)
  ... (and so on like that)
  else if we don't recognize the tag, signal an error and terminate
}
 
  • #6
I see. I think I am with you now. But it seems from your description, the PL does not do anything beyond what I have already accomplished, i.e. break down the XML file into its subelements.

Or perhaps it does. Does it determine the hierarchy as well? I think it does... that's the whole "children" thing.

Hmmm. I need to turn this over in my head for awhile. It seems like if it determines the hierarchy, then that should take a lot of the burden off of me.

It should also (at least I think from what you describe) eliminate the need to find a particular tags "partner" right?

For example it knows which <mrow> goes with which </mrow> right?


Thanks for your help too! :smile:
 
  • #7
Yes, a parsing library determines the hierarchy of which nodes are children of which other nodes, and it spares you from having to deal with details like closing tags. If the tags have attributes (an attribute is like <foo attr="someval">) then the parsing library will handle those as well, and properly escape ampersands, deal with a doctype declaration if one is there, and handle any other syntactical oddities so that you don't have to.
 
  • #8
Okay then. So let's say I have a bunch of MathML files saved as .txt files. Where do I go to input them? Is this a free service?
 
  • #9
The simplest way to parse this is with a recursive descent parser, which are typically hand written due to their simplicity (ie, not using some bloated parsing library). I just wrote a recursive descent parser for parsing math expressions last week, actually. It worked great.
 
  • #10
Oh drat. Now I am torn between the different approaches. I feel like since I have already put so much thought into writing my own I should continue. But the Library might be able to pick up on things that I cannot.

Any chance you could post your parser junglebeast (at your leisure) so I could see how you tackled yours?
 
  • #11
Writing a recursive descent parser might be a learning exercise, and not too difficult in this case... but not worth the trouble if you just want results. A parsing library is the easier way to go, it means you have to write less code yourself and the less code you write the less chance there is for bugs.

Saladsamurai, to open a file you need to read a VBA tutorial or book. I'm sure that opening a file is one of the things that would be covered.

Edit: However, if you are just learning to program, I would recommend Python over VBA. Avoid the MS lock-in and learn a skill that is cross-platform.
 
Last edited:
  • #12
mXSCNT said:
Writing a recursive descent parser might be a learning exercise, and not too difficult in this case... but not worth the trouble if you just want results. A parsing library is the easier way to go, it means you have to write less code yourself and the less code you write the less chance there is for bugs.

Saladsamurai, to open a file you need to read a VBA tutorial or book. I'm sure that opening a file is one of the things that would be covered.

Edit: However, if you are just learning to program, I would recommend Python over VBA. Avoid the MS lock-in and learn a skill that is cross-platform.

Good Advice, but I do need to use VBA since it's what we use at my company. I can open a file. I have already written a code that opens the file, trims the whitespace and stores each element of the XML as an array element.

Once I get the file opn, where do I actually 'plug it in' to a tool that parses it for me?

I am currently reading those links again, but I am not seeing that part of it.

Thanks guys :smile:
 
  • #13
Well, first you have to find a suitable XML parsing library for VBA, then you have to read the documentation or tutorial for that library to learn how to use it. Probably at some point you call a library parsing function on the text of the file or on the filename.

One of the links I gave you in your other thread about this appeared to be an XML parsing tutorial.
 
  • #14
mXSCNT said:
Writing a recursive descent parser might be a learning exercise, and not too difficult in this case... but not worth the trouble if you just want results. A parsing library is the easier way to go, it means you have to write less code yourself and the less code you write the less chance there is for bugs.

I don't believe that using a library will reduce the amount of code that needs to be written. I think it will increase the amount of code, and will also have a significant learning curve that takes a while to learn how to use it. And because you are depending on someone elses code, it will icnrease the chance of bugs.

In my case, I got frustrated trying to learn parsing libraries and discovered it was easier to write a new parser from the ground up. After researching a bit, I discovered that recursive descent was the easiest type of parser for this job, and after reading the wikipedia example I had it up and running less than an hour later.

http://en.wikipedia.org/wiki/Recursive_descent_parser
 
  • #15
I am just going to go with my original idea. I just read the Wiki and did not understand a word of it. I am not a programmer, so the added learning curve of learning all of the jargon necessary to understand that article is just too much.

Same goes for the Library. I don't see what is wrong with the original idea: read the ML, make some decisions as to how it should be handled, output what I need.
 
  • #16
Saladsamurai said:
I am just going to go with my original idea. I just read the Wiki and did not understand a word of it. I am not a programmer, so the added learning curve of learning all of the jargon necessary to understand that article is just too much.

Same goes for the Library. I don't see what is wrong with the original idea: read the ML, make some decisions as to how it should be handled, output what I need.

Don't let me stop you, although I expect you'll just end up wasting a lot of time trying to get something that only half-works some of the time. Parsing theory is a very mature field of computer science, they have already figured out several algorithms that work and those which don't, and the simplest of those working algorithms is what I already suggested. Another simpel strategy is SLR,

http://en.wikipedia.org/wiki/Simple_LR_parser

although I find this one to be a little more confusing. The chances of you inventing an original parsing algorithm that doesn't turn out to be identical to an already known method of parsing is near zero in my opinion..although it may take you weeks or months or programming to finally design this new parser yourself. Just saying...why try to reinvent the wheel? It's not so simple as it first looks. Perhaps after dallying around for a while with no success you will be more receptive to using an established method.
 
  • #17
Well, it is not that I am don't believe you, but it is like you said: "Parsing theory is a very mature field of computer science..."

But I know nothing about computer science. I just know how to write some simple If-Then, For Loops in C++ and MATLAB.

So when I look at the Wiki, everything in it is completely foreign to me. I will see if I can find like a "Parsing for Dummies" webpages...
 
  • #18
The format of MathML is XML.

Hook a SAX Parser into your VBA environment (You can call a c version or see if VBA has an xml SAX interface, or DOM Tree navigation will do if you don't want to use basic SAX.

Use the handler call backs to walk through the MathML code and write a state machine parser that will map that to the standard equation syntax format.

Note: You may have already created a good enough custom parse of MathML to bypass SAX, but checking out this option might be interesting even if you don't decide to use it.

Build up some basic bottom up transformations and then see if you can stitch them together with an overlaying structure that allows you to invoke them recursively as needed to make the transformation.

I'm simply trying to give you a method of approach, not over complicate the issues, read the comments a couple of times and think about it.

Good Luck!
 
  • #19
Here is a ref to the xml and vba interface:

http://en.allexperts.com/q/XML-1469/Manipulating-XML-files-Excel.htm [Broken]
 
Last edited by a moderator:
  • #20
lrowe_uk_phy said:
Here is a ref to the xml and vba interface:

http://en.allexperts.com/q/XML-1469/Manipulating-XML-files-Excel.htm [Broken]

Thanks! :smile: I will look into this. This is a great learning experience so far!
 
Last edited by a moderator:

1. What is MathML?

MathML (Mathematical Markup Language) is an XML-based language used for representing mathematical equations and expressions in a structured format. It allows for the display and communication of mathematical content on the web and in other applications.

2. Why do we need an algorithm to decode MathML?

MathML is not easily readable by humans and requires special software or algorithms to interpret and display properly. Therefore, an algorithm is needed to convert MathML code into a format that can be easily understood and used by mathematicians, scientists, and other users.

3. How does the algorithm for decoding MathML work?

The algorithm for decoding MathML involves breaking down the MathML code into smaller components and then converting them into corresponding mathematical symbols and expressions. This requires the use of parsing techniques and knowledge of mathematical notations and syntax.

4. What challenges are involved in developing an algorithm to decode MathML?

Some of the challenges in developing an algorithm to decode MathML include handling complex mathematical expressions, dealing with different notations and syntax, and ensuring accuracy and efficiency in the conversion process. Additionally, the algorithm may need to be updated to support new versions of MathML.

5. How can I contribute to the development of an algorithm to decode MathML?

If you are interested in contributing to the development of an algorithm to decode MathML, you can learn more about the language and its syntax, and familiarize yourself with parsing techniques and algorithms. You can also join online communities and forums dedicated to MathML and contribute to open-source projects related to MathML decoding.

Similar threads

  • Introductory Physics Homework Help
Replies
3
Views
794
  • Introductory Physics Homework Help
Replies
9
Views
712
  • Introductory Physics Homework Help
Replies
2
Views
740
  • MATLAB, Maple, Mathematica, LaTeX
Replies
1
Views
3K
  • Programming and Computer Science
Replies
4
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
1
Views
2K
  • Programming and Computer Science
Replies
17
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
5
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
8
Views
1K
  • Programming and Computer Science
Replies
4
Views
1K
Back
Top