| |
Posted: Sep 24, 2010 |
[ # 16 ]
|
|


Experienced member
Total posts: 42
Joined: Oct 16, 2009
|
Erwin Van Lun - Sep 23, 2010: this his grown a cool thread. I actually would love to be a RegEx expert. Anyone to recommend manual, or tools?
...
Anyone experienced?
http://regex.info//
Then pick the owl of your preferred language. 
Quote:
“What’s New
New in the Third Edition are a new chapter on PHP (and upgraded PHP coverage throughout the core chapters),...”
|
|
|
|
|
| |
Posted: Sep 24, 2010 |
[ # 17 ]
|
|



Administrator
Total posts: 841
Joined: Aug 14, 2006
|
tx Richard, I’ll check this out!
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 18 ]
|
|



Senior member
Total posts: 466
Joined: Dec 16, 2010
|
Arthur De Wolf - Sep 23, 2010: I think I figured out a regex that does exactly what I want:
$term = “chatbo*(t|ts)”;
$link = “http://www.chatbots.org/chatbot/”;
$str = preg_replace(’/(?!(?:[^<]+>|[^>]+</a>))b(’.$term.’)b(. |.$| )/is’,’<a class=“term” href=”’.$link.’”>$1</a>$3 ‘, $str, 1);
I’ve added (. |.$| ) behind the word boundary b. So after the term it needs either 1. a period and a space, or 2. a period at the end of the string, or 3. a space.
I’ve also added $3 behind the link so that the period is repeated in case there was one.
This way the term “chatbots” does not get matched in chatbots.org but it does when chatbots. is at the end of a sentence or string.
Arthur De Wolf - Sep 18, 2010:
Now I would also like it to find words with “s” behind it, so that ‘chatbots’ would be matched for ‘chatbot’. I tried this by adding (|s) behind the term, but then some things went bad.
I don’t know if you are still working on this but the correct way to add an “s” to chatbot is
/chatbots?/ - the question mark indicates 0 or 1 of the letter before it, in this case, “s”.
The regex that I am using to pick this up is:
/chat+(?:er)? ?bots?/
This picks up:
chatbot
chat bot
chatbots
chat bots
chaterbot
chater bot
chaterbots
chater bots
chatterbot
chatter bot
chatterbots
chatter bots
It is another good example of “category compression” that I talked about in the JAIL thread.
Your pattern “chatbo*(t|ts)” is looking for 0 or more “o” (chatboot|chatbt). I don’t think that is what you want.
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 19 ]
|
|


Administrator
Total posts: 73
Joined: Aug 7, 2009
|
Thank you so much for that, Merlin. I ended up doing it differently because adding an “s” doesn’t always work to make something plural, so we made an extra plural field for each word and search for that.
I am actually having a few of other regex problems that I could really use your help with.
The first problem is about adding a class to hyperlinks with a certain URL. The purpose is to add for instance class=“dead_link” to URLs in an article that we’ve found to give a 404. What I have now is this:
$url = "http://www.google.com"; $class_name = "dead_link";
$pattern = "/<a href=[\'\"](".$url.")[\\"]>(.*)<\/a>/is"; $replace = "<a href=\"$1\" class=\"". $class_name ."\">$2</a>"; $html = preg_replace($pattern, $replace, $html);
$html = preg_replace($pattern, $replace, $html);
But this only works if the original <a> tag has no special attributes, such as an id, style, target, title or another class.
Also, if the hyperlink already has a class I want to replace it with my new class.
I have been struggling with this for a while but don’t know enough about regex to make it work.
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 20 ]
|
|


Administrator
Total posts: 73
Joined: Aug 7, 2009
|
Now I ran into the same problem as you did; it removed everything between <a and > in both the $pattern and $replace lines. I found a workaround by replacing < with the lt code but this seems to be another bug besides the backslash disappearing.
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 21 ]
|
|



Senior member
Total posts: 466
Joined: Dec 16, 2010
|
If you have control of the input you should be doing a JavaScript
“encodeURIComponent(uri)”
on everything in the code blocks. That might take care of it.
Ok, now on to links (be aware that I don’t do PHP so all of my regex is from the JavaScript perspective),
First, regexs in their most complex form can be tricky. But as you continue to build experience they can be very powerful and beautiful. I found sifting through things like links to be some of the most challenging.
If I understand what you are trying to do, you are trying to add a “class=$class_name” To a “A”.
Unless there is a reason I don’t know about you should not be trying include the linked text and the final closing “A” in your pattern. First, you don’t need it. Secondly, wildcards in a regex are greedy. the (.*) will match anything. If you have 2 links on the same line it will match from the start of the first to the end of the last.
The simplest form of what you are trying to do is the following:
/(<a CLASS=“DEADLINK”
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 22 ]
|
|



Senior member
Total posts: 466
Joined: Dec 16, 2010
|
Ok, now thats aggravating.
Yours system at my Regex.
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 23 ]
|
|



Senior member
Total posts: 466
Joined: Dec 16, 2010
|
Second Try-
If you have control of the input you should be doing a JavaScript
“encodeURIComponent(uri)”
on everything in the code blocks. That might take care of it.
Ok, now on to links (be aware that I don’t do PHP so all of my regex is from the JavaScript perspective),
First, regexs in their most complex form can be tricky. But as you continue to build experience they can be very powerful and beautiful. I found sifting through things like links to be some of the most challenging.
If I understand what you are trying to do, you are trying to add a “class=$class_name” To a “A”.
Unless there is a reason I don’t know about you should not be trying include the linked text and the final closing “A” in your pattern. First, you don’t need it. Secondly, wildcards in a regex are greedy. the (.*) will match anything. If you have 2 links on the same line it will match from the start of the first to the end of the last.
The simplest form of what you are trying to do is the following:
/(\<a CLASS=“DEADLINK”
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 24 ]
|
|



Senior member
Total posts: 466
Joined: Dec 16, 2010
|
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 25 ]
|
|



Senior member
Total posts: 466
Joined: Dec 16, 2010
|
Third Try.
If you have control of the input you should be doing a JavaScript
“encodeURIComponent(uri)”
on everything in the code blocks. That might take care of it.
Ok, now on to links (be aware that I don’t do PHP so all of my regex is from the JavaScript perspective),
First, regexs in their most complex form can be tricky. But as you continue to build experience they can be very powerful and beautiful. I found sifting through things like links to be some of the most challenging.
If I understand what you are trying to do, you are trying to add a “class=$class_name” To a “A”.
Unless there is a reason I don’t know about you should not be trying include the linked text and the final closing “A” in your pattern. First, you don’t need it. Secondly, wildcards in a regex are greedy. the (.*) will match anything. If you have 2 links on the same line it will match from the start of the first to the end of the last.
The simplest form of what you are trying to do is the following:
“(<a CLASS=“DEADLINK”
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 26 ]
|
|


Administrator
Total posts: 73
Joined: Aug 7, 2009
|
Sorry, this is very annoying. I reported this bug to the makers of the forum.
If you replace <a with <a then it doesn’t remove everything behind <a
Otherwise you can send me your code via email: http://www.chatbots.org/expert/contact/354/
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 27 ]
|
|



Senior member
Total posts: 466
Joined: Dec 16, 2010
|
Try 5
Delete the others and replace the [OPENBRACKET] and [CLOSEBRACKET] with the right 1 character symbol.
If you have control of the input you should be doing a JavaScript
“encodeURIComponent(uri)”
on everything in the code blocks. That might take care of it.
Ok, now on to links (be aware that I don’t do PHP so all of my regex is from the JavaScript perspective),
First, regexs in their most complex form can be tricky. But as you continue to build experience they can be very powerful and beautiful. I found sifting through things like links to be some of the most challenging.
If I understand what you are trying to do, you are trying to add a “class=$class_name” To a “A”.
Unless there is a reason I don’t know about you should not be trying include the linked text and the final closing “A” in your pattern. First, you don’t need it. Secondly, wildcards in a regex are greedy. the (.*) will match anything. If you have 2 links on the same line it will match from the start of the first to the end of the last.
The simplest form of what you are trying to do is the following:
/([OPENBRACKET]a )(.+?[CLOSEBRACKET])/i
This capture says look for an “open bracket-A-SPACE” and save it in 1 group. Then look for “anything followed by-close bracket” and save it in a group. The question mark after the .+ makes the wildcard non-greedy.
You then simply rebuild the TAG with your class between the 2 groups.
$1CLASS=“DEADLINK” $2
The one thing that we might still have to do is change it a bit if your “A” links already have a class value in them.
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 28 ]
|
|



Senior member
Total posts: 466
Joined: Dec 16, 2010
|
Of course that should have been “Your system ATE my RegEX”. Could have used that edit button.
Tell me how this works. Chatbot.org link search
I have shared the regex at the on-line regex tester
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 29 ]
|
|


Administrator
Total posts: 73
Joined: Aug 7, 2009
|
The Edit button should work for 15 minutes after you post the message. Do you not see the Edit button?
But (<a )(.+?>) matches every hyperlink. How do we add the class only to hyperlinks with a certain URL? Like here: http://regexr.com?2srcn
RegExr is a very handy app. Thanks! But it looks like it too has a problem with < as it changes it to <
|
|
|
|
|
| |
Posted: Jan 5, 2011 |
[ # 30 ]
|
|



Senior member
Total posts: 466
Joined: Dec 16, 2010
|
Did not see the edit button.
Try this.
(<a )(.+?(?:www.yahhoo.com|www.other.com).+?>)
The “(?:” is a non-capturing group. If you remove it it would also capture the site.
You could add any site you want inside the group separated with the “|” (OR).
The regex site did let me see exactly what you wanted to do though it was also having open bracket issues.
|
|
|
|
|