egghelp/eggheads community

wondering

2006-05-10T20:26:36-04:00

As more as I understand your question, as less I understand what you actually ask .

So you want to know if someone used XPath by tDOM in a script and managed to use a more flexible statement than your example in the FAQ?

Statistics: Posted by De Kus — Wed May 10, 2006 8:26 pm

wondering

2006-05-10T12:13:46-04:00

dude, do you actually read my posts???

I am NOT asking you - or anyone else for that matter - to show parsing webpages via regexps or otherwise or/and to elaborate on that; I am simply advocating XPath - which you obviously don't know and haven't used - as a superior tool for the job

I was asking if anyone knows a script using XPath - and I already gathered you don't - so let put this to rest and move on

Statistics: Posted by demond — Wed May 10, 2006 12:13 pm

wondering

2006-05-10T05:25:48-04:00

so you are looking for something like

Code:

set goal 5set id 0set num 0while {$num < $goal} {  if {[set id [string first $body "" $id] > [set t [string first $body "bar='id'" $id]] && $t != -1} {      set id $t      incr num    } elseif {$t == -1} {      return -1    } else {      incr id [string length "

continue with each condition... to find the end of "stuff" continue to find the index (you can also count the  belongs to the wanted open tag) and string range the stuff between the > <. Though I REALLY doubt this is any easier than regexp (however I am sure it would be faster).
However you will now run into trouble with case sensitivity and will have trouble to equal bar=id, bar='id' and bar="id". you could of course temporarily convert all " to ' and check for ' (refering to W3C bar=id is wrong syntax anyway).

or do you want to split the XML tree into a multimentional array? but then I wonder how to *match* paremeters. Dont know if an endless sublist with tag and data would be possible. Maybe parents would just contain a list of direct childs as "data". And still then searching would be difficult in this non-linear list tree.Statistics: Posted by De Kus — Wed May 10, 2006 5:25 am



wondering
2006-05-10T00:04:30-04:00

what you seem to be unable to comprehend is that any regexp emulation of XPath's predicates would be ridiculously complicated and hard to read/understand

it's like doing numerical analysis in Roman numbers - if you know what I meanStatistics: Posted by demond — Wed May 10, 2006 12:04 am




wondering
2006-05-09T12:52:37-04:00

Well you are asking for XPath without using XPath. XPath has been developed for almost 10 years now (refering to the given links by you). Do you believe you can write some fast TCL script to emulate it? I am offering alternatives how to archieve similar results without developping a module worth years of time.Statistics: Posted by De Kus — Tue May 09, 2006 12:52 pm




wondering
2006-05-09T11:59:59-04:00

either you are a regexp fanatic, or you don't get my point since you don't know XPathStatistics: Posted by demond — Tue May 09, 2006 11:59 am




wondering
2006-05-09T11:07:11-04:00

no no, you misunderstood that; perhaps my example was bad
basically, if you locate the info you need using XPath positional predicates like for example //foo[@bar='moo'][5], your script will continue to work even if they add tons of stuff under nodes #1 to #4; you can't do that with regexps - there is no notion of expression position 
So you want the regexp to "intellgently" skip unintresting first 4  and beging to really parse whole regexp from there on? Depending on the complexy of the expressions you could use my suggestion of string first and give either the index of the 4th match to regexp or string range it together by using string first to find a logical end. However this will never be exact the same as XPath.
However you could create it as module, but then again people which are unable to compile the bot might not be able to use it.Statistics: Posted by De Kus — Tue May 09, 2006 11:07 am




wondering
2006-05-09T07:26:30-04:00

Just an idea but maybe using tclperl with XML::XPath works?Statistics: Posted by Kappa007 — Tue May 09, 2006 7:26 am




wondering
2006-05-09T03:00:10-04:00

no no, you misunderstood that; perhaps my example was bad

basically, if you locate the info you need using XPath positional predicates like for example //foo[@bar='moo'][5], your script will continue to work even if they add tons of stuff under nodes #1 to #4; you can't do that with regexps - there is no notion of expression position 

in general, XPath is vastly superior to using regexps for parsing webpages; the problem is, users need to install some XML parser extension for it to work, and most eggdrop users are too lame to do thatStatistics: Posted by demond — Tue May 09, 2006 3:00 am




wondering
2006-05-06T07:15:34-04:00

well doesnt the, in your link from yourself mentioned, tDOM support that flexibilty by using something like this?
set node [$root selectNodes {//tr[@id=foo]/td[@id=bar]}]

though I am still confident about the regexp .
Give us an example whereCode: 
regexp {(?i).*?(?:\n\s*|)(.+?)(?:\n\s*|)} $body {} stuff
doesnt find the wanted piece from your example (though I don't want to talk about the speed of such an expression on a 50kb html file. however I successfully used string first and string range to limit the actual string regexp parses). The given example should still work, even if all the \n and \t are truncated or instead of \t spaces are used. (?:\n\s*|) should match the as long as possible (or nothing), and therebefore "eat" the input. Alternately (probably faster way) would be use string trim $stuff " \t\n" on stuff .

You could even go so far to "regsub -all {} {} $stuff stuff" to remove any comments (or on body, to remove them before looking for matchs) .Statistics: Posted by De Kus — Sat May 06, 2006 7:15 am




wondering
2006-05-06T06:32:15-04:00

I don't think there is any script (atleast public) which features flexible HTML parsing but it would be nice to see it implemented in furture scripts, also would be easier for the scripter since he won't have to keep following the website changes.Statistics: Posted by Sir_Fz — Sat May 06, 2006 6:32 am




wondering
2006-05-06T06:28:03-04:00

we aren't on a PHP forum, so:

nope, I meant eggdrop scripts that fetch info from webpages

and no, you can't compensate for webpage changes with regexps alone; it's nowhere near XPath ability to do thatStatistics: Posted by demond — Sat May 06, 2006 6:28 am




wondering
2006-05-06T06:04:05-04:00

are you talking about web script as in PHP? There are tDOM and XML (or was tDOM the XML module?!) modules for PHP, but no idea how to use them. Maybe there manuals give you some hints, if they support your intented way of manipulation.

PS: the advantage of a regular expression to a scan(f) expression, is the flexibility. Using \t+ or \s+ instead of a specific number of chars, should be possible give a certain flexibility.
I mean something like "\n\t+(.+?)\n\t*" should be flexible in the way you just showed.Statistics: Posted by De Kus — Sat May 06, 2006 6:04 am




wondering
2006-05-06T02:01:57-04:00

just out of curiousity:

does anyone know of a webscript which features flexible HTML parsing, utilizing XPath or similar technique? 

e.g. as soon as the following page code:Code: 
...
is changed to:Code: 
...
you are screwed if you use a script that gets to "stuff" using regexp/regsub to locate the  tag with id=bar (which is pretty much every script known to me)

naturally, XPath is not a panacea against web page changes, but I'd imagine it could provide a far greater degree of flexibilityStatistics: Posted by demond — Sat May 06, 2006 2:01 am