Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Parsing HTML and Searching Text

  1. Oct 29, 2012 #1
    I need to parse HTML and search for some text. For example, a text containing XXX (just random input) needs to be searched in HTML. I need to parse HTML, search this XXX and get its CSS attributes. However, this XXX text can be in any HTML tags.


    Java/JavaScript/PHP are all available for this task. Can anybody help me on this?
     
  2. jcsd
  3. Oct 29, 2012 #2

    gabbagabbahey

    User Avatar
    Homework Helper
    Gold Member

    Can you clarify your requirements a little? Where is the HTML that you wish to search? Where are you outputting the results?
     
  4. Oct 29, 2012 #3
    <html>
    <head>
    <title>asd
    </title>
    </head>
    <body>
    <div class ="abc"> xxx</div>
    <div class ="yyy"> foo</div>
    <div class ="zzz"> zoo</div>
    </body>
    </html>

    for example, this is the html file. I'm searching the xxx, but I don't know if it is in a div or span, or a. In this example, it is in the div which has class abc. However, it could be in any html tags. I need to find xxx and get their css styles (e.g bold, italic, font size)
     
  5. Oct 29, 2012 #4
    It's a pity you're not using Python. This is something that could be solved trivially using a combination of Python, regular expressions, and Beautiful Soup.
     
  6. Oct 29, 2012 #5

    gabbagabbahey

    User Avatar
    Homework Helper
    Gold Member

    That doesn't really answer my questions. Are you writing the HTML files yourself? Are they given to you in a directory somewhere, or are you getting them from a website(s)? What do you plan to do with the css info once you find it?

    The answers to these questions will help you choose between using Java, javascript, or php.
     
  7. Nov 2, 2012 #6
    Each language has a DOM parser, a quick Google search reveals many tutorials on how read XML or HTML tags programatically.

    If you don't know what the DOM is yet, that's the first thing you should find out.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook