A Web Scraper is a software that helps in extracting data from websites. They can be use to extract typical information like emails, telephone numbers, addresses, etc from different URL.
I created this tutorial to teach how to create your own Web Scraper in ASP.NET MVC and jQuery. This Scraper will extract all emails and telephone numbers from a specified URL and show them in a HTML div control.
Let me tell you it is quite easy to create and you will enjoy the simple codes I have provided.
The HTML design of the Web Scraper consists of:
First create a Controller in your ASP.NET MVC application. Name the controller as WebScrapingController or you can name it anything else.
Now create a function GetUrlSource in this controller and make it as a [HttpPost] type. This function will be called on the button click event by the jQuery AJAX method.
This Code of GetUrlSource Function is:
[HttpPost]
public string GetUrlSource(string url)
{
url = url.Substring(0, 4) != "http" ? "http://" + url : url;
string htmlCode = "";
using (WebClient client = new WebClient())
{
try
{
htmlCode = client.DownloadString(url);
}
catch (Exception ex)
{
}
}
return htmlCode;
}
Explanation – The GetUrlSource function receives the URL of the page in its parameter. It reads the HTML (page source) of the URL using WebClient.DownloadString() function and then returns this HTML in the end.
Create a view named Index for the WebScrapingController controller and place the below html code in it.
<div id="message"></div>
<input id="urlInput" type="text" placeholder="Enter URL" />
<button id="submit">Submit</button>
<div class="textAlignCenter">
<img src="~/Content/Image/loading.gif" />
</div>
<div id="twoColumn">
<div></div>
<div></div>
</div>
Explanation – The above HTML code contains twoColumn div that contains two inner divs. The first inner div will show the fetched emails while the second one will show the fetched telephone numbers.
1. Server Side Validation in ASP.NET Core
2. Client Side Validation in ASP.NET Core
Now add the below jQuery Code to the view:
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.0/jquery.min.js"></script>
<script>
$(document).ready(function () {
$("#reset").click(function (e) {
$("#urlInput").val("")
$("#twoColumn > div").html("")
});
$("#submit").click(function (e) {
var validate = Validate();
$("#message").html(validate);
if (validate.length == 0) {
$.ajax({
type: "POST",
url: "/WebScraping/GetUrlSource",
contentType: "application/json; charset=utf-8",
data: '{"url":"' + $("#urlInput").val() + '"}',
dataType: "html",
success: function (result, status, xhr) {
GetUrlTelePhone(result);
},
error: function (xhr, status, error) {
$("#message").html("Result: " + status + " " + error + " " + xhr.status + " " + xhr.statusText)
}
});
}
});
function GetUrlTelePhone(html) {
emails = html.match(/([a-zA-Z0-9._-]+@@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi);
emails = emails != null ? $.uniqueSort(emails) : "";
var email = $("<p><u>Emails Found:-</u></p>");
for (var i = 0, il = emails.length; i < il; i++)
email.append("<p>" + (i + 1) + ". " + emails[i] + "</p>");
$("#twoColumn > div").first().html(email);
tels = html.match(/\(?([0-9]{3})\)?([ .-]?)([0-9]{3})\2([0-9]{4})/);
tels = tels != null ? $.uniqueSort(tels) : "";
tels = $.uniqueSort(tels);
var tel = $("<p><u>Telephones Found:-</u></p>");
for (var i = 0, il = tels.length; i < il; i++) {
if (tels.length > 4)
tel.append("<p>" + (i + 1) + ". " + tels[i] + "</p>");
}
$("#twoColumn > div:nth-child(2)").html(tel);
}
$(document).ajaxStart(function () {
$("img").show();
});
$(document).ajaxStop(function () {
$("img").hide();
});
function Validate() {
var errorMessage = "";
if ($("#urlInput").val() == "") {
errorMessage += "► Enter URL<br/>";
}
else if (!(isUrlValid($("#urlInput").val()))) {
errorMessage += "► Invalid URL<br/>";
}
return errorMessage;
}
function isUrlValid(url) {
var urlregex = new RegExp(
"^(http[s]?:\\/\\/(www\\.)?|ftp:\\/\\/(www\\.)?|www\\.){1}([0-9A-Za-z-\\.@@:%_\+~#=]+)+((\\.[a-zA-Z]{2,3})+)(/(.)*)?(\\?(.)*)?");
return urlregex.test(url);
}
});
</script>
Explanation – On the button click event the jQuery AJAX method calls the C# function – GetUrlSource of the controller.
Also note, on the success function of the jQuery AJAX method, I have called the jQuery function GetUrlTelePhone and have passed the URL’s HTML code to its parameter.
In the GetUrlTelePhone function I fetched the emails and telephone numbers using regular expressions, finally showing them at the end.
Kindly check the below link to download the codes: