How would one read website content using a program?

In summary: Test» ... if (preg_match_all("/Test/", $output, $matches) !== false) { // $matches[0] contains all the matches echo $matches[0]; }}In summary, to code a program to read the textual contents of a website and search for a specific word, you can use tools like WebRequest in VB.NET, curl in Linux/MacOS, or file_get_contents in PHP. These tools allow you to retrieve the source code of a webpage and manipulate it to extract the data you need. You can then use PHP string functions or regular expressions to search for specific words within the source code. It is important to comply with
  • #1
jackparker5
10
0
I'm basically looking to find out how I could code a program to read the textual contents of a website (in either batch, dev c++ or powershell script - or a combination) and then search for a specific word, which when found, will trigger a set of commands. It's basically like the findstr command in batch, that can look through local text files.
 
Technology news on Phys.org
  • #3
Some folks use the curl command in linux/macos to get a webpage and the use awk, perl, python or ruby to extract out the data of interest. program

https://en.wikipedia.org/wiki/CURL

There is also a library of APIs that can be used in a C program for example.
 
  • #4
It's meant to search for the URLs mentioned in urls.txt and as you can see, search for the word "Test" in the content of the URL, this is the code so far:

$webClient = new-object System.Net.WebClient
$webClient.Headers.Add("user-agent", "PowerShell Script")

$info = get-content c:\users\dell\desktop\urls.txt

foreach ($i in $info) {
$output = ""

$startTime = get-date
$output = $webClient.DownloadString($i)
$endTime = get-date

if ($output -like "Test") {
"Success`t`t" + $i + "`t`t" + ($endTime - $startTime).TotalSeconds + " seconds"
}

}

But for some reason, it won't work. Does anyone have a suggestion as to why not?
 
  • #5
In python there are 2 modules like
Python:
import requests; from bs4 import BeautifulSoup
They allow you to request, process and parser information from a url. Though you should know some information of the url's source code (html) to be able and get the data you like, with my chrome browser I just right click on the site's page and ask to view the page source.
By the time you retrieve the data, you can process them in the same script as if you had Inputed it from a text-like file from your local PC (eg output them, use them as variables or for logical expressions and stuff).
 
  • #6
The keywords you need seem to be web scraping.

As far as I can tell Python is used most often. There likely are some very explicit tutorials that do this.

If and when you deploy such a scraper I would urge you to comply with the robots.txt file if provided by the website.
That file will tell you if you're "allowed" to scrape the website. Otherwise you might be blacklisted by some sites depending on the traffic you generate.
 
  • Like
Likes harborsparrow
  • #7
With PHP, this can be really easy with a simple file_get_contents:
PHP:
$pageSource = file_get_contents('[PLAIN]http://www.example.com/');
[/PLAIN]
And $pageSource is a string containing the source code of the web page!

If you want to do more, then there is the cURL library. This is basically a web browser (really, if done properly, a web host won't know if it's a human or cURL requesting the web page). You can set the header sent and read the header received. So you can POST (i.e. fill in forms and sending them), use cookies, read over secure protocol (HTTPS), upload files ... and so much more.

I'm used with the PHP version, but cURL is available as a free command-online tool (Tutorial, How to Use, curl usage explained) and libcurl is also available for C (and probably others as well). The cURL website even offer a comparison table with their competitors, if you want to explore more options.
 
Last edited by a moderator:
  • #8
jack action said:
With PHP, this can be really easy with a simple file_get_contents:
PHP:
$pageSource = file_get_contents('[PLAIN]http://www.example.com/');
[/PLAIN]
And $pageSource is a string containing the source code of the web page!

If you want to do more, then there is the cURL library. This is basically a web browser (really, if done properly, a web host won't know if it's a human or cURL requesting the web page). You can set the header sent and read the header received. So you can POST (i.e. fill in forms and sending them), use cookies, read over secure protocol (HTTPS), upload files ... and so much more.

I'm used with the PHP version, but cURL is available as a free command-online tool (Tutorial, How to Use, curl usage explained) and libcurl is also available for C (and probably others as well). The cURL website even offer a comparison table with their competitors, if you want to explore more options.

Thanks, that seems quite easy. Do you know how I could search the variable $pageSource for a list of words? Does it need to be printed somewhere, or can it be done directly? I'm not good with PHP at all, and that would be an overstatement.
 
Last edited by a moderator:
  • #9
To manipulate strings in PHP, you have the string functions and the regular expression functions. For simply searching, you can use:

  • stripos — Find the position of the first occurrence of a case-insensitive substring in a string
  • stristr — Case-insensitive strstr
  • strpbrk — Search a string for any of a set of characters
  • strpos — Find the position of the first occurrence of a substring in a string
  • strrchr — Find the last occurrence of a character in a string
  • strripos — Find the position of the last occurrence of a case-insensitive substring in a string
  • strrpos — Find the position of the last occurrence of a substring in a string
  • strstr — Find the first occurrence of a string
  • substr — Return part of a string
Rewriting you code in PHP could simply be:
PHP:
// Assuming one URL per line in file:
$info = file("c:\users\dell\desktop\urls.txt");

foreach ($info as $url) {
    $output = "";

    $startTime = microtime(true);
    $output = file_get_contents($url);
    $endTime = microtime(true);

    if (strpos($output, "Test") !== false) {
        // I'm not sure if «`t`t» is a special syntax I'm not familiar with
        // or if you want to write it literally, I assume the latter
        echo "Success`t`t" . $url . "`t`t" . ($endTime - $startTime) . " seconds";
    }
}
 
  • #10
jack action said:
To manipulate strings in PHP, you have the string functions and the regular expression functions. For simply searching, you can use:

  • stripos — Find the position of the first occurrence of a case-insensitive substring in a string
  • stristr — Case-insensitive strstr
  • strpbrk — Search a string for any of a set of characters
  • strpos — Find the position of the first occurrence of a substring in a string
  • strrchr — Find the last occurrence of a character in a string
  • strripos — Find the position of the last occurrence of a case-insensitive substring in a string
  • strrpos — Find the position of the last occurrence of a substring in a string
  • strstr — Find the first occurrence of a string
  • substr — Return part of a string
Rewriting you code in PHP could simply be:
PHP:
// Assuming one URL per line in file:
$info = file("c:\users\dell\desktop\urls.txt");

foreach ($info as $url) {
    $output = "";

    $startTime = microtime(true);
    $output = file_get_contents($url);
    $endTime = microtime(true);

    if (strpos($output, "Test") !== false) {
        // I'm not sure if «`t`t» is a special syntax I'm not familiar with
        // or if you want to write it literally, I assume the latter
        echo "Success`t`t" . $url . "`t`t" . ($endTime - $startTime) . " seconds";
    }
}

Well what I had as code was PowerShell Script (.ps1) I'm not sure how PHP code can be put together with it, can it? I tried to run the code you wrote in PowerShell Script ISE and it said:

Missing 'in' after variable in foreach loop.
At line:4 char:16

Sorry for my lack of knowledge..
 
  • #11
A slight difference as you can read about here http://ss64.com/ps/foreach.html

When encountering such an error you can google using "Powershell" + "Missing 'in' after variable in foreach loop"
Or if you don't know the construct you are using use "Powershell foreach" which gave the link above for me.
 
  • #12
First, I must apologize as I didn't notice your code on my first post, hence why I brought PHP as I thought you were looking for a more general method. And PHP is not powershell, so it won't work with that.

Got to work with PS and I think I found your mistake. I just checked the part about fetching a URL and looking for a word:
Code:
$webClient = new-object System.Net.WebClient
$webClient.Headers.Add("user-agent", "PowerShell Script")

$info = get-content c:\users\dell\desktop\urls.txt

foreach ($i in $info) {
    $output = ""

    $startTime = get-date
    $output = $webClient.DownloadString($i)
    $endTime = get-date

    if ($output -like "*Test*") {
        $time = New-TimeSpan $startTime $endTime
        Write-Host "Success`t`t" $i "`t`t"  $time.TotalSeconds " seconds"
    }
}

The "*" before and after "Test" means match anything before and after "Test". I also add a "Write-Host" command before your string to write it on screen and the "+" signs aren't necessary to concatenate the strings. As I suspected "`t" is a tab. And I introduced the TimeSpan command.

Or you could use the regular expression version (apparently it is faster):
Code:
$webClient = new-object System.Net.WebClient
$webClient.Headers.Add("user-agent", "PowerShell Script")

$info = get-content c:\users\dell\desktop\urls.txt

foreach ($i in $info) {
    $output = ""

    $startTime = get-date
    $output = $webClient.DownloadString($i)
    $endTime = get-date

    if ($output -match "Test") {
        $time = New-TimeSpan $startTime $endTime
        Write-Host "Success`t`t" $i "`t`t" $time.TotalSeconds " seconds"
    }
}
 
  • #13
jack action said:
Got to work with PS and I think I found your mistake.

The code you provided worked fine : ) Thank you
 
  • #14
For non-text materials such as online documents such as .pdf or .doc, how do you read the contents?
 
  • #15
You have to download the file and open it with a program that can read such contents.
 
  • #16
jack action said:
You have to download the file and open it with a program that can read such contents.
I mean using a program so it can be automated?
 
  • #17
SleepDeprived said:
I mean using a program so it can be automated?
Same answer. For example, when you open a .pdf in your browser, your browser must have an access to a program that can read .pdf files if it wants to show its contents. The browser just download the file. If it doesn't have such an access, it will show you a 'Save as ...' window.
 

FAQ: How would one read website content using a program?

What is the purpose of reading website content using a program?

The purpose of reading website content using a program is to extract information from the website in a structured format that can be used for various purposes such as data analysis or automation.

What programming language can be used to read website content?

There are various programming languages that can be used to read website content, such as Python, JavaScript, and Java. The choice of language depends on the specific needs and requirements of the project.

How does a program read website content?

A program reads website content by using a combination of web scraping and parsing techniques. Web scraping involves downloading the HTML code of a webpage and then extracting specific data from it. Parsing involves analyzing the HTML code and extracting relevant information based on the structure and tags of the webpage.

Can website content be read without using a program?

Yes, website content can be read without using a program by manually viewing and reading the content on a web browser. However, using a program can automate this process and make it more efficient for extracting large amounts of data.

Are there any limitations or restrictions to reading website content using a program?

Yes, there can be limitations or restrictions to reading website content using a program, such as the website's terms of use or technical barriers. It is important to check the website's policies and regulations before extracting any data using a program.

Similar threads

Replies
15
Views
2K
Replies
32
Views
3K
Replies
13
Views
5K
Replies
17
Views
4K
Replies
75
Views
5K
Replies
6
Views
1K
Replies
4
Views
2K
Back
Top