To read website content programmatically, various programming languages and tools can be utilized, such as PowerShell, Python, and PHP. PowerShell scripts can download webpage content using the WebClient class and search for specific keywords using string matching or regular expressions. Python is popular for web scraping, leveraging libraries like Requests and BeautifulSoup for data extraction. Compliance with a website's robots.txt file is essential to avoid being blacklisted while scraping. For non-text files like PDFs, automated reading requires downloading the file and using appropriate software to process its contents.
#1
jackparker5
10
0
I'm basically looking to find out how I could code a program to read the textual contents of a website (in either batch, dev c++ or powershell script - or a combination) and then search for a specific word, which when found, will trigger a set of commands. It's basically like the findstr command in batch, that can look through local text files.
There is also a library of APIs that can be used in a C program for example.
#4
jackparker5
10
0
It's meant to search for the URLs mentioned in urls.txt and as you can see, search for the word "Test" in the content of the URL, this is the code so far:
They allow you to request, process and parser information from a url. Though you should know some information of the url's source code (html) to be able and get the data you like, with my chrome browser I just right click on the site's page and ask to view the page source.
By the time you retrieve the data, you can process them in the same script as if you had Inputed it from a text-like file from your local PC (eg output them, use them as variables or for logical expressions and stuff).
As far as I can tell Python is used most often. There likely are some very explicit tutorials that do this.
If and when you deploy such a scraper I would urge you to comply with the robots.txt file if provided by the website.
That file will tell you if you're "allowed" to scrape the website. Otherwise you might be blacklisted by some sites depending on the traffic you generate.
[/PLAIN]
And $pageSource is a string containing the source code of the web page!
If you want to do more, then there is the cURL library. This is basically a web browser (really, if done properly, a web host won't know if it's a human or cURL requesting the web page). You can set the header sent and read the header received. So you can POST (i.e. fill in forms and sending them), use cookies, read over secure protocol (HTTPS), upload files ... and so much more.
[/PLAIN]
And $pageSource is a string containing the source code of the web page!
If you want to do more, then there is the cURL library. This is basically a web browser (really, if done properly, a web host won't know if it's a human or cURL requesting the web page). You can set the header sent and read the header received. So you can POST (i.e. fill in forms and sending them), use cookies, read over secure protocol (HTTPS), upload files ... and so much more.
Thanks, that seems quite easy. Do you know how I could search the variable $pageSource for a list of words? Does it need to be printed somewhere, or can it be done directly? I'm not good with PHP at all, and that would be an overstatement.
// Assuming one URL per line in file:
$info = file("c:\users\dell\desktop\urls.txt");
foreach ($info as $url) {
$output = "";
$startTime = microtime(true);
$output = file_get_contents($url);
$endTime = microtime(true);
if (strpos($output, "Test") !== false) {
// I'm not sure if «`t`t» is a special syntax I'm not familiar with
// or if you want to write it literally, I assume the latter
echo "Success`t`t" . $url . "`t`t" . ($endTime - $startTime) . " seconds";
}
}
// Assuming one URL per line in file:
$info = file("c:\users\dell\desktop\urls.txt");
foreach ($info as $url) {
$output = "";
$startTime = microtime(true);
$output = file_get_contents($url);
$endTime = microtime(true);
if (strpos($output, "Test") !== false) {
// I'm not sure if «`t`t» is a special syntax I'm not familiar with
// or if you want to write it literally, I assume the latter
echo "Success`t`t" . $url . "`t`t" . ($endTime - $startTime) . " seconds";
}
}
Well what I had as code was PowerShell Script (.ps1) I'm not sure how PHP code can be put together with it, can it? I tried to run the code you wrote in PowerShell Script ISE and it said:
Missing 'in' after variable in foreach loop.
At line:4 char:16
When encountering such an error you can google using "Powershell" + "Missing 'in' after variable in foreach loop"
Or if you don't know the construct you are using use "Powershell foreach" which gave the link above for me.
First, I must apologize as I didn't notice your code on my first post, hence why I brought PHP as I thought you were looking for a more general method. And PHP is not powershell, so it won't work with that.
Got to work with PS and I think I found your mistake. I just checked the part about fetching a URL and looking for a word:
The "*" before and after "Test" means match anything before and after "Test". I also add a "Write-Host" command before your string to write it on screen and the "+" signs aren't necessary to concatenate the strings. As I suspected "`t" is a tab. And I introduced the TimeSpan command.
Same answer. For example, when you open a .pdf in your browser, your browser must have an access to a program that can read .pdf files if it wants to show its contents. The browser just download the file. If it doesn't have such an access, it will show you a 'Save as ...' window.