Discussion:
uri regex
cjackson
2005-06-15 02:54:24 UTC
Permalink
Hi,

I flunked the IQ test so I need some help. I want to match all domains
in the body that are not in .com,.org.us,.edu,.gov and .mil. But there's
more. I need to match some characters at the end of the URI that can
often be found there such as >.?)*!"';

The rule would match http://www.go.za and http://www.go.za), but not
match http://www.go.com

Here's my regex that does not work...

m{https?://[^\s/:"')!?>*]+(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(?:"|'|:|\?|!|>|\*|\)|$)}


It works for all of the characters except for an ending "." such as
http://www.go.com.

I have grappled with this for some time and read the pcrepattern.txt
accompanying Exim source, but damn if I can get it to work. Anybody want
to spit out the answer?

Thanks,
Craig Jackson
Bret Miller
2005-06-15 16:04:28 UTC
Permalink
Post by cjackson
I flunked the IQ test so I need some help. I want to match
all domains
in the body that are not in .com,.org.us,.edu,.gov and .mil.
But there's
more. I need to match some characters at the end of the URI that can
often be found there such as >.?)*!"';
The rule would match http://www.go.za and http://www.go.za), but not
match http://www.go.com
Here's my regex that does not work...
m{https?://[^\s/:"')!?>*]+(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.
gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(?:"|'|:|\?|!|>|\*|\)|$)}
It works for all of the characters except for an ending "." such as
http://www.go.com.
I have grappled with this for some time and read the pcrepattern.txt
accompanying Exim source, but damn if I can get it to work.
Anybody want to spit out the answer?
I'm no regex expert, but your ending (?:"|'|:|\?|!|>|\*|\)|$) doesn't
list a ., so it wouldn't catch it.

Maybe
(?:"|'|:|\?|!|>|\*|\)|$|\.)
Would be better?

Bret
Craig Jackson
2005-06-15 16:25:24 UTC
Permalink
Post by Bret Miller
Post by cjackson
I flunked the IQ test so I need some help. I want to match
all domains
in the body that are not in .com,.org.us,.edu,.gov and .mil.
But there's
more. I need to match some characters at the end of the URI that can
often be found there such as >.?)*!"';
The rule would match http://www.go.za and http://www.go.za), but not
match http://www.go.com
Here's my regex that does not work...
m{https?://[^\s/:"')!?>*]+(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.
gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(?:"|'|:|\?|!|>|\*|\)|$)}
It works for all of the characters except for an ending "." such as
http://www.go.com.
I have grappled with this for some time and read the pcrepattern.txt
accompanying Exim source, but damn if I can get it to work.
Anybody want to spit out the answer?
I'm no regex expert, but your ending (?:"|'|:|\?|!|>|\*|\)|$) doesn't
list a ., so it wouldn't catch it.
Maybe
(?:"|'|:|\?|!|>|\*|\)|$|\.)
Would be better?
Bret
Thanks, Bret, but I tried that and got matches up to all the "."s in the
uri.
Craig Jackson
2005-06-15 16:34:09 UTC
Permalink
Post by cjackson
Hi,
I flunked the IQ test so I need some help. I want to match all domains
in the body that are not in .com,.org.us,.edu,.gov and .mil. But
there's more. I need to match some characters at the end of the URI
that can often be found there such as >.?)*!"';
The rule would match http://www.go.za and http://www.go.za), but not
match http://www.go.com
Here's my regex that does not work...
m{https?://[^\s/:"')!?>*]+(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(?:"|'|:|\?|!|>|\*|\)|$)}
It works for all of the characters except for an ending "." such as
http://www.go.com.
I have grappled with this for some time and read the pcrepattern.txt
accompanying Exim source, but damn if I can get it to work. Anybody
want to spit out the answer?
Assuming that you are creating a SA rule, have you considered using a
uri test? That way you wouldn't have to worry about the extra
characters at the end. SA would take care of it for you.
Yes, it is a uri test which I patterned after WEIRD_PORTS in 20_uri

Mine is like this...

uri SUSPECT_DOM_CJ =~ <expression>
score SUSPECT_DOM_CJ <score>

I didn't know that SA took care of the ending characters in uri tests.
I'll take another look to consider this. Thanks.
Bret Miller
2005-06-15 17:28:09 UTC
Permalink
Post by cjackson
Post by cjackson
I flunked the IQ test so I need some help. I want to match
all domains
Post by cjackson
in the body that are not in .com,.org.us,.edu,.gov and .mil. But
there's more. I need to match some characters at the end
of the URI
Post by cjackson
that can often be found there such as >.?)*!"';
The rule would match http://www.go.za and
http://www.go.za), but not
Post by cjackson
match http://www.go.com
Here's my regex that does not work...
m{https?://[^\s/:"')!?>*]+(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.
gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(?:"|'|:|\?|!|>|\*|\)|$)}
Post by cjackson
It works for all of the characters except for an ending
"." such as
Post by cjackson
http://www.go.com.
I have grappled with this for some time and read the
pcrepattern.txt
Post by cjackson
accompanying Exim source, but damn if I can get it to
work. Anybody
Post by cjackson
want to spit out the answer?
Assuming that you are creating a SA rule, have you
considered using a
uri test? That way you wouldn't have to worry about the extra
characters at the end. SA would take care of it for you.
Yes, it is a uri test which I patterned after WEIRD_PORTS in 20_uri
Mine is like this...
uri SUSPECT_DOM_CJ =~ <expression>
score SUSPECT_DOM_CJ <score>
I didn't know that SA took care of the ending characters in
uri tests. I'll take another look to consider this. Thanks.
That I do know a little about. The developers have been working on
handling extra characters on the end of URIs. I think the fix got into
3.0.4 so you should probably upgrade if you haven't.

Bret
Stuart Johnston
2005-06-15 16:07:07 UTC
Permalink
Post by cjackson
Hi,
I flunked the IQ test so I need some help. I want to match all domains
in the body that are not in .com,.org.us,.edu,.gov and .mil. But there's
more. I need to match some characters at the end of the URI that can
often be found there such as >.?)*!"';
The rule would match http://www.go.za and http://www.go.za), but not
match http://www.go.com
Here's my regex that does not work...
m{https?://[^\s/:"')!?>*]+(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(?:"|'|:|\?|!|>|\*|\)|$)}
It works for all of the characters except for an ending "." such as
http://www.go.com.
I have grappled with this for some time and read the pcrepattern.txt
accompanying Exim source, but damn if I can get it to work. Anybody want
to spit out the answer?
Assuming that you are creating a SA rule, have you considered using a
uri test? That way you wouldn't have to worry about the extra
characters at the end. SA would take care of it for you.

Loading...