Parsing HTML and Searching Text

  • Thread starter NerseC
  • Start date
  • Tags
    Html Text
In summary: The DOM is the Document Object Model. It is the structure of a document. This means that you can find things like the title of the document, the paragraphs, the tags, and even the attributes of a tag.
  • #1
NerseC
7
0
I need to parse HTML and search for some text. For example, a text containing XXX (just random input) needs to be searched in HTML. I need to parse HTML, search this XXX and get its CSS attributes. However, this XXX text can be in any HTML tags.


Java/JavaScript/PHP are all available for this task. Can anybody help me on this?
 
Technology news on Phys.org
  • #2
Nerse said:
I need to parse HTML and search for some text. For example, a text containing XXX (just random input) needs to be searched in HTML. I need to parse HTML, search this XXX and get its CSS attributes. However, this XXX text can be in any HTML tags.


Java/JavaScript/PHP are all available for this task. Can anybody help me on this?

Can you clarify your requirements a little? Where is the HTML that you wish to search? Where are you outputting the results?
 
  • #3
<html>
<head>
<title>asd
</title>
</head>
<body>
<div class ="abc"> xxx</div>
<div class ="yyy"> foo</div>
<div class ="zzz"> zoo</div>
</body>
</html>

for example, this is the html file. I'm searching the xxx, but I don't know if it is in a div or span, or a. In this example, it is in the div which has class abc. However, it could be in any html tags. I need to find xxx and get their css styles (e.g bold, italic, font size)
 
  • #4
It's a pity you're not using Python. This is something that could be solved trivially using a combination of Python, regular expressions, and Beautiful Soup.
 
  • #5
Nerse said:
<html>
<head>
<title>asd
</title>
</head>
<body>
<div class ="abc"> xxx</div>
<div class ="yyy"> foo</div>
<div class ="zzz"> zoo</div>
</body>
</html>

for example, this is the html file. I'm searching the xxx, but I don't know if it is in a div or span, or a. In this example, it is in the div which has class abc. However, it could be in any html tags. I need to find xxx and get their css styles (e.g bold, italic, font size)

That doesn't really answer my questions. Are you writing the HTML files yourself? Are they given to you in a directory somewhere, or are you getting them from a website(s)? What do you plan to do with the css info once you find it?

The answers to these questions will help you choose between using Java, javascript, or php.
 
  • #6
Each language has a DOM parser, a quick Google search reveals many tutorials on how read XML or HTML tags programatically.

If you don't know what the DOM is yet, that's the first thing you should find out.
 

1. What is HTML parsing?

HTML parsing refers to the process of analyzing and breaking down a HTML document into its individual components such as tags, attributes, and text. This allows for easier manipulation and searching of the document's content.

2. What tools can be used for parsing HTML and searching text?

There are several tools available for parsing HTML and searching text, including regular expressions, DOM (Document Object Model) parsing libraries, and web scraping libraries such as BeautifulSoup and Scrapy.

3. How do regular expressions help with parsing HTML and searching text?

Regular expressions, also known as regex, are patterns used to match and manipulate strings of text. They can be used to search for specific patterns or keywords within a HTML document, making it easier to extract and manipulate the desired content.

4. What is the Document Object Model (DOM) and how does it aid in parsing HTML?

The Document Object Model (DOM) is a programming interface that represents the structure of a HTML document as a tree of objects. This allows for easy navigation and manipulation of the document's elements, making it a useful tool for parsing and searching HTML.

5. Why is parsing HTML and searching text important for data analysis?

Parsing HTML and searching text is crucial for data analysis as it allows for the extraction of specific information from a website or webpage. This information can then be used for various purposes, such as market research, sentiment analysis, or data mining.

Similar threads

  • Programming and Computer Science
Replies
24
Views
1K
  • Programming and Computer Science
Replies
11
Views
1K
  • Programming and Computer Science
Replies
10
Views
1K
  • Programming and Computer Science
Replies
2
Views
875
  • Programming and Computer Science
Replies
2
Views
1K
  • Programming and Computer Science
Replies
11
Views
1K
  • Programming and Computer Science
Replies
1
Views
1K
  • Programming and Computer Science
Replies
3
Views
1K
  • Programming and Computer Science
Replies
4
Views
1K
  • Programming and Computer Science
Replies
13
Views
1K
Back
Top