egghelp/eggheads community

Help about a link

2006-06-22T07:56:25-04:00

Most of times www.domain is the same with domain, a the www is the default subdomain.

You are correct about 2, I didn't think like this.

I 'll describe you what exactly I want to do:

I want to create a package which extracts data from webpages. I'm going to give it a initial webpage and the script is going to follow every page and check for data inside. I'll have a list with all links that script found, and i'm going to visit every one.

My problem is that a lot of pages have links in different format. It could be a page which has 2 same links ("http://domain/hello.htm" and "/hello.htm") and I want my code to be clever to understand that these links are the same.

That's why I want to add links to the list with format "http://(subdomain.)domain/file.htm" in order to could check if a link already exists to the list and don't loose time to parse it again.

So, I need a procedure which is going to return a link in this format (like a web browser does with links)

Statistics: Posted by cerberus_gr — Thu Jun 22, 2006 7:56 am

Help about a link

2006-06-22T07:19:04-04:00

Your request is weird.

1) "www.domain" != "domain"
2) links not starting with a protocol are relative, so the absolute version of "www.domain" would be "http://base.href/www.domain"
3) your last example doesn't make any sense to me at all

Statistics: Posted by user — Thu Jun 22, 2006 7:19 am

Help about a link

2006-06-22T07:46:29-04:00

Let's try again

I have a webpage in html format with 100 links inside. The links don't have the same format . The formats of the links for the file file.htm are:

1) http://www.domain/folder/file.htm">
2) www.domain/folder/file.htm">
3) http://domain/folder/file.htm">
4)
5) (relative)

I have written a code which extracts all the links from the webpage and adds them to a list. So, I have a list like the following:

Code:

(bin) 49 % echo $links{http://www.domain/folder/file.htm www.domain/folder/file.htm http://domain/folder/file.htm /folder/file.htm file.htm}

Now, I want to create a procedure which takes each one of the links and returns it on the format:

http://www.domain/folder/file.htm or
http://domain/folder/file.htm

Example:

Code:

proc format_url { link_found parent_link } {}

(bin) 50 % set a [format_url "http://www.domain/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm

(bin) 51 % set a [format_url "www.domain/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm

(bin) 52 % set a [format_url "http://domain/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm

(bin) 53 % set a [format_url "/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm

(bin) 54 % set a [format_url "file.htm" "http://www.domain/lala"]
http://www.domain/[b]lala[/b]/file.htm

I'm not so good with regural expressions, so i need some help with this.

Statistics: Posted by cerberus_gr — Wed Jun 21, 2006 10:30 am

Help about a link

2006-06-21T08:55:57-04:00

try to be clearer...

Statistics: Posted by SaPrOuZy — Wed Jun 21, 2006 8:55 am

Help about a link

2006-06-20T21:50:55-04:00

Hello,

I have a code which gets all the links from a webpage. The formats could be:

1) http://www.domain/folder/file.htm
2) www.domain/folder/file.htm
3) http://domain/folder/file.htm
4) /folder/file.htm
5) file.hmt (relative)

I want to create a procedure which takes as parameters the link and the link from the html which parsed and returns the link in the format:

1) http://domain/folder/file.htm or
2) http://www.domain/folder/file.htm

Example:

Code:

proc format_url { link parent } {   ...}

Thanks

Statistics: Posted by cerberus_gr — Tue Jun 20, 2006 9:50 pm