How Can I Scrape Data From A Webpage After Searching Desired Data Using Html Agility
Solution 1:
Results for searches in the website you've mentioned are rendered dynamically using Javascript and the data comes as Json response via Ajax. HtmlAgilityPack is intended to parse Html, not Json.
Consider using Selenium
or iMacros
drivers for .Net, or WebBrowser
class provided Microsoft Framework. These tools run a browser in background, so they can run Javascript code in that page and render Html you want to scrape.
Just need to set up proper time out, so they will keep waiting until search results appear onto the page.
Solution 2:
As @derloopkat already said. Just use Selenium.
The site uses javascript and ajax to update the HTML of the page. Even if you did a HTTP request like with the following url:
https://enquiry.indianrail.gov.in/ntes/NTES?action=getTrainsViaStn&viaStn=NDLS&toStn=null&withinHrs=2&trainType=ALL&6iop0ssrpi=1m1ol4ha86
You will only get back the following:
(function(){location.reload();/*ho ho ho ho*/})()
Meaning that the last parameter of the url:
&6iop0ssrpi=1m1ol4ha86
Is somekind of "password"(for lack of better word). That makes sure you can't just replay the replay the requests. Now you could try to crack this. But it is obscured in a javascript file that is 3396 lines of very dense code. So it is very hard(maybe even impossible) to find out what to send the server in order to receive the data you want.
Even better is that the response from the server will never be HTML but rather JSON. Formatted like this:
_obj_1511003507337 = {
trainsInStnDataFound:"trainRunningDataFound",
allTrains:[
{
trainNo:"14316",
startDate:"18 Nov 2017",
trainName:"INTERCITY EXP",
trnName:function(){return _LANG==="en-us"?"INTERCITY EXP":"इंटरसिटीएक्स."},
trainSrc:"NDLS",
trainDstn:"BE",
runsOn:"NA",
schArr:"Source",
schDep:"16:35, 18 Nov",
schHalt:"Source",
actArr:"Source",
delayArr:"RIGHT TIME",
actDep:"16:35, 18 Nov",
delayDep:"RIGHT TIME",
actHalt:"Source",
trainType:"MEX",
pfNo:"9"
} ,
trainNo:"12625",
startDate:"16 Nov 2017",
trainName:"KERALA EXPRESS",
trnName:function() { return _LANG === "en-us" ? "KERALA EXPRESS" : "केरलएक्स."},
trainSrc:"TVC",
trainDstn:"NDLS",
runsOn:"NA",
schArr:"13:45, 18 Nov",
schDep:"Destination",
schHalt:"Destination",
actArr:"16:56, 18 Nov",
delayArr:"03:11",
actDep:"Destination",
delayDep:"RIGHT TIME",
actHalt:"Destination",
trainType:"SUF",
pfNo:"4"
}
]
}
Here is the solution to get the HTML and data using Selenium.
using System;
using System.Collections.Generic;
using System.Net;
using HtmlAgilityPack;
using OpenQA.Selenium.Firefox;
using OpenQA.Selenium;
using System.Threading;
namespacetest
{
classProgram
{
publicstaticvoidMain(string[] args)
{
string url = "https://www.google.com";
IWebDriver driver = new FirefoxDriver();
driver.Navigate().GoToUrl("https://enquiry.indianrail.gov.in");
Console.WriteLine("Step 1");
driver.FindElement(By.XPath("//a[@id='ui-id-2']")).Click();
Thread.Sleep(10000);
Console.WriteLine("Step 2");
driver.FindElement(By.XPath("//input[@id='viaStation']")).SendKeys("NEW DELHI [NDLS]");
Thread.Sleep(2000);
Console.WriteLine("Step 3");
driver.FindElement(By.XPath("//button[@id='viaStnGoBtn']")).Click();
//PRESS A KEY WHEN THE HTML IS FULLY LOADED
Console.ReadKey();
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(driver.PageSource);
HtmlNodeCollection nodeCol = doc.DocumentNode.SelectNodes("//body//tr[@class='altBG']");
foreach(HtmlNode node in nodeCol){
Console.WriteLine("Trip:");
foreach(HtmlNode child in node.ChildNodes)
{
Console.WriteLine("\t" + child.InnerText);
}
}
//Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
}
The Thread.Sleep()'s should not be necessary. I just put them in as a precaution. Also the speed can be optimized if you use a different driver like PhantomJS which is a headless driver.
Post a Comment for "How Can I Scrape Data From A Webpage After Searching Desired Data Using Html Agility"