Theos PowerBasic Museum 2017

Archive => Archived Posts => Topic started by: Edwin Knoppert on September 14, 2009, 02:41:56 PM

Title: get plain text from webpage
Post by: Edwin Knoppert on September 14, 2009, 02:41:56 PM
I need to save the html as text from my webcontrol.
But it needs to save to a predetermined file and no user interaction.
Do i need the document stuff?
I don't mind if it saves it to a tempfile first but streaming may be nice.
Title: Re: get plain text from webpage
Post by: Edwin Knoppert on September 14, 2009, 02:57:50 PM
The simpliest approach would be:
Object Get pWB.Document.body.innertext To v
but this may skip the head section?
Title: Re: get plain text from webpage
Post by: José Roca on September 14, 2009, 05:41:45 PM
 
You can call the Document property and then query for the IPersistFile interface.

The ExecWB method, with the OLECMDID_SAVEAS flag, can be used, but, for secutiry reasons, it shows a save dialog even if you use the OLECMDEXECOPT_DONTPROMPTUSER flag, so you will need to install a hook.

See: http://www.codeproject.com/KB/shell/iesaveas.aspx

There have been also suggestions to use UrlDownloadToFile or INet for the purpose:

http://support.microsoft.com/kb/q244757/
Title: Re: get plain text from webpage
Post by: Edwin Knoppert on September 14, 2009, 06:00:41 PM
These are all rather poor solutions haha :)
For the moment i stick with my solution.
If that fails i'll try to parse the stream with regular expression, i ever seen one on the PB forum.
Maybe that helps.
The IPersist is a good (and simple to do) tip.

Thanks,
Title: Re: get plain text from webpage
Post by: José Roca on September 14, 2009, 06:25:58 PM
 
If they are poor it is because the good ones are forbidden. Otherwise, a malicious web page could use an script to save anything it wants in your computer without you ever noticing.

If the IPersistFile way is still available it is only because scripts can't use it.