- How to Instruct IWebBrowser to Do Not Download Images?
- Posted by darko.vukoje@gmail.com on February 16th, 2006
Hi,
I am writing a simple web crawler application that should navigate on a
couple of web sites and saves data locally. For that purpose, I am
using IWebBrowser2 and Navigate2 method. Everything is done with C++
and ATL. In order to improve performance (only text is needed) the idea
is to turn off picture downloading, like it can be done manually in
Internet Explorer (Tools->Internet Options->Advanced etc).
Is there any way I can set this programmatically? Or instruct
IWebBrowser to do not download images. This would dramatically improve
application performance.
Thanks,
Darko Vukoje
- Posted by Jerry Coffin on February 16th, 2006
darko.vukoje@gmail.com wrote:
I'd forget about using the web browser control at all, and just use
InternetOpen, InternetOpenUrl and InternetReadFile. This will do what
you want by default. If you're embedding a web browser control, chances
are that you're using MFC, in which case you can use CInternetSession
and CStdioFile to make life a bit easier still (though the difference
is pretty minor, since it's pretty simple either way).
--
Later,
Jerry.
The universe is a figment of its own imagination.
- Posted by Lucian Wischik on February 16th, 2006
"darko.vukoje@gmail.com" <darko.vukoje@gmail.com> wrote:
As someone else said, you should use InternetOpen &c. instead of the
MSHTML control.
Actually, I wrote a similar data-crawling program. But last month I
decided to rewrite my entire app in PHP. That means that my app now
resides on a website: the user visits my site, fills out some fields,
clicks the SUBMIT button; then my PHP app crawls those websites and
retreives the data and parses it and generates output. This means that
anyone in the world can use my crawler. So maybe you should consider
switching platform... Anyway, here's my app:
http://www.wischik.com/lu/travel/sunmoon.html
--
Lucian
- Posted by darko.vukoje@gmail.com on February 17th, 2006
Using WinInet (or even WinHTTP library) was my first try. But, it is
not good enough. I can retrieve HTML response as a string, but in most
cases web site that should be crawled returns just a small HTML, which
containts a tons of JavaScript, which trigger and furthemore open new
pages. So the end resulting HTML is something entirely different.
What I can see WinInet (or InternetOpen etc.) returns only the initiall
html source. So, I would have to parse it, execute javascript manually
etc. Then, there are also potential problem with frames. That is a
reason I choosed to play with IE Active X control, which does all this
automatically.
I also had an idea to write this in PHP. I guess PHP behaves the same
like IE rather than WinInet library. Any guidelines for PHP?
thanks,
Darko
- Posted by Lucian Wischik on February 17th, 2006
"darko.vukoje@gmail.com" <darko.vukoje@gmail.com> wrote:
it will work like WinInet.
--
Lucian