Regular Expressions:
Amazing and Dangerous

https://www.sw.it.aoyama.ac.jp/2021/pub/RubyKaigi/

Ruby Kaigi Takeout 2021, Online, September 9, 2021

Martin J. DÜRST (マーティンと呼んでください)

duerst@it.aoyama.ac.jp, Aoyama Gakuin University

Ruby Programming Language

© 2021 Martin J. Dürst, Aoyama Gakuin University

Abstract

Many Ruby programmers use regular expressions frequently. They are an amazingly powerful tool for many different kinds of text processing. However, if not used carefully, they can also be dangerous: They may not exactly match what their writer thinks they match, and they may execute very slowly on certain inputs. This talk will help you understand regular expressions better, so that you can make good use of their amazing power while avoiding their dangerous sides. It will also discuss recent changes to Ruby in the area of regular expressions.

For Best Viewing

These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

 

Topic

Regular Expressions can be Amazingly convenient but also extremely Dangerous

 

Motivation

Feature #17837: Add support for Regexp timeouts (by Sam Saffron)

How to detect/prevent dangerous regular expressions

Dangerous: Very slow (on some inputs)

Current ideas:

 

Background




[1] On the Impact and Defeat of Regular Expression Denial of Service, James C. Davis, PhD thesis, Virginia Politechnic, 2020.

 

Conventions Used

Code is mostly green, monospace
puts 'Hello Ruby!'

Variable parts are orange
puts "some string"

Results are indicated with "⇒"
1 + 12

 

Audience Self-Intro

 

Speaker Self-Intro

 

Contributions to Ruby

Mainly in the following areas:

 

Why Amazing

 

Amazing Examples




[1] Fun with Regular Expressions: An Implementation of the Unicode Bidi Algorithm, Martin J. Dürst, 26th Internationalization & Unicode Conference, San Jose, USA, Sept. 2004

 

Surreal: Smallest Prime Factor

/^(11+?)\1*$/

How does it work?

/^___$/: match from start to end, not just a part

1+: one or more 1s

11+: two or more 1s

11+?: two or more 1s, lazy (prefer as few as possible)

(___): remember what you matched (as \1 and $1)

\1*: what you remembered, 0 or more times

 

Surreal at Work

"1"*9 =~ /^(11+?)\1*$/: Smallest prime factor of 9

First try: (11) 11 11 11 1  ⇒ fail

Second try: (111) 111 111 ⇒ succeed, 3

 

Surreal gets Slow (Exhibit P)

"1"*63 =~ /^(11+?)\1*$/; puts $1.size7, 0.054s

"1"*11939 =~ /^(11+?)\1*$/; puts $1.size11939, 0.078s

"1"*99371 =~ /^(11+?)\1*$/; puts $1.size99371, 2.698s

"1"*193939 =~ /^(11+?)\1*$/; puts $1.size193939, 10.930s

⇒ Regular expressions are dangerous

(prime numbers from Wikipedia)

 

Dangerous Regular Expressions

"Some people, when confronted with a problem, think
'I know, I'll use regular expressions.'
Now they have two problems."

- Jamie Zawiski

background on this quote


[1] The 'mailto' URI Scheme, M. Dürst, L. Masinter, and J. Zawinski, RFC 6068, October 2010.

 

Why Dangerous

 

Unpredictable Time Behavior

 

A Very Flat Regular Expression

irb> 'a'.unicode_normalize :nfc

irb> p UnicodeNormalize::REGEXP_C

/[̀́̓̈́ʹ;·क़-य़ড়ঢ়য়ਲ਼ਸ਼ਖ਼-ਜ਼ਫ਼ଡ଼ଢ଼གྷཌྷདྷབྷཛྷཀྵཱཱིུྲྀླཱྀྀྒྷྜྷྡྷྦྷྫྷྐྵάέήίόύώΆιΈΉΐΊΰΎ΅`ΌΏ´  ΩKÅ〈〉⫝̸豈-嗀塚晴凞-羽蘒諸逸都飯-舘並-龎יִײַשׁ-זּטּ-לּמּנּסּףּפּצּ-פֿ텞-텤톻-퇀-精][̀-͎͐-ͯ҃-֑҇-ׇֽֿׁׂׅׄؐ-ًؚ-ٰٟۖ-ۜ۟-۪ۤۧۨ-ܑۭܰ-݊߫-߽߳ࠖ-࠙ࠛ-ࠣࠥ-ࠧࠩ-࡙࠭-࡛࣓-ࣣ࣡-़्ࣿ॑-়॔া্ৗ਼઼଼੍્৾ା୍ୖୗா்ௗ಼్ౕౖೂ್ೕೖ഻഼ാ്ൗ්ාෟุ-ฺ่-๋ຸ-຺່-໋ཱིེུ༹༘༙༵༷-ཽྀྂ-྄࿆྆྇ီ့္်ႍ፝-᜔᜴្᤹ᢩ፟៝-᩠᤻ᨘᨗ᩵-᩿᩼᪰-᬴᪽ᬵ᭄᭫-᯦᰷᮪᮫᯲᯳᭳᳐-᳔᳒-᳢᳠-᳨᳭᳴᳸᳹᷀-᷹᷻-᷿⃐-⃥⃜⃡-⃰⳯-⵿⳱ⷠ-〪ⷿ-゙゚〯꙯ꙴ-꠆꣄꙽ꚞꚟ꛰꛱꣠-꤫꣱-꦳꥓꧀꤭ꪰꪲ-꫶꯭ﬞꪴꪷꪸꪾ꪿꫁︠-︯ǽˠͶ-ͺ਍ਏਸ-਺ਿ૥૦ത-ധཆ-ཐ၆ၿႹႺᄀ-ᄂᄧᄳᄴᅳᇀᇊስሶዩዪጻጼጾፍፗ፦-፬፰-፴ᑂᑆᑞᒰᒺᒽᓂᓃᖯᖿᗀᘿᚶᚷᜫᠹᠺ᧠ᨴᩇ᪙᰿ᵂᵄᵅᶗ櫰-櫴欰-欶벞텥-텩텭-텲텻-톂톅-톋톪-톭퉂-퉄--------]*|[<->A-PR-Za-pr-z¨À-ÏÑ-ÖØ-Ýà-ïñ-öø-ýÿ-ďĒ-ĥĨ-İĴ-ķĹ-ľŃ-ňŌ-őŔ-ťŨ-ſƠơƯưƷǍ-ǜǞ-ǣǦ-ǰǴǵǸ-țȞȟȦ-ȳʒ΅ΆΈ-ΊΌΎ-ΑΕΗῚ?ΡΥΩ-αεηιορυω-ώϒ-ϔЀЁЃІЇЌ-ЎАГЕ-КОУЧЫЭаге-коучыэѐёѓіїќ-ўѴ-ѷӁӂӐ-ӓӖ-ӟӢ-ӵӸӹآ-اويۀ-ۂےۓەनऩरऱळऴেোৌେୈୋୌஒஔெேொ-ௌెైಿೀೆ-ೈೊೋെേൊ-ൌෙේො-ෞဥဦᬅ-ᬎᬑᬒᬺ-ᭃḀ-ẙẛẠ-ỹἀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ὰὲὴὶὸὺὼᾀ-ᾴᾶ-Ὰᾼ᾿῁-ῄῆ-ῈῊῌ-ῒῖ-Ὶ῝-ῢῤ-ῪῬ῭ῲ-ῴῶ-ῸῺῼ῾←→↔↚↛↮⇍-⇐⇒⇔∃∄∈∉∋∌∣-∦∼≁≃-≅≇-≉≍≠-≢≤≥≭-≽⊀-⊉⊑⊒⊢⊨⊩⊫-⊯⊲-⊵⋠-⋣⋪-⋭うか-ぢつ-どは-ぽゔゝゞウカ-ヂツ-ドハ-ポワ-ヲヴヷ-ヺヽヾ႙-ႜႥႫᄮᄯᄱᄲፇፋፌᒹᒻᒼᒾᖸ-ᖻ]?[̀-͎͐-ͯ҃-֑҇-ׇֽֿׁׂׅׄؐ-ًؚ-ٰٟۖ-ۜ۟-۪ۤۧۨ-ܑۭܰ-݊߫-߽߳ࠖ-࠙ࠛ-ࠣࠥ-ࠧࠩ-࡙࠭-࡛࣓-ࣣ࣡-़्ࣿ॑-়॔া্ৗ਼઼଼੍્৾ା୍ୖୗா்ௗ಼్ౕౖೂ್ೕೖ഻഼ാ്ൗ්ාෟุ-ฺ่-๋ຸ-຺່-໋ཱིེུ༹༘༙༵༷-ཽྀྂ-྄࿆྆྇ီ့္်ႍ፝-᜔᜴្᤹ᢩ፟៝-᩠᤻ᨘᨗ᩵-᩿᩼᪰-᬴᪽ᬵ᭄᭫-᯦᰷᮪᮫᯲᯳᭳᳐-᳔᳒-᳢᳠-᳨᳭᳴᳸᳹᷀-᷹᷻-᷿⃐-⃥⃜⃡-⃰⳯-⵿⳱ⷠ-〪ⷿ-゙゚〯꙯ꙴ-꠆꣄꙽ꚞꚟ꛰꛱꣠-꤫꣱-꦳꥓꧀꤭ꪰꪲ-꫶꯭ﬞꪴꪷꪸꪾ꪿꫁︠-︯ǽˠͶ-ͺ਍ਏਸ-਺ਿ૥૦ത-ധཆ-ເ?၆ၿႹႺᄀ-ᄂᄧᄳᄴᅳᇀᇊስሶዩዪጻጼጾፍፗ፦-፬፰-፴ᑂᑆᑞᒰᒺᒽᓂᓃᖯᖿᗀᘿᚶᚷᜫᠹᠺ᧠ᨴᩇ᪙᰿ᵂᵄᵅᶗ櫰-櫴欰-欶벞텥-텩텭-텲텻-톂톅-톋톪-톭퉂-퉄--------]+|[가개갸걔거게겨계고과괘괴교구궈궤귀규그긔기까깨꺄꺠꺼께껴꼐꼬꽈꽤꾀꾜꾸꿔꿰뀌뀨끄끠끼나내냐냬너네녀녜노놔놰뇌뇨누눠눼뉘뉴느늬니다대댜댸더데뎌뎨도돠돼되됴두둬뒈뒤듀드듸디따때땨떄떠떼뗘뗴또똬뙈뙤뚀뚜뚸뛔뛰뜌뜨띄띠라래랴럐러레려례로롸뢔뢰료루뤄뤠뤼류르릐리마매먀먜머메며몌모뫄뫠뫼묘무뭐뭬뮈뮤므믜미바배뱌뱨버베벼볘보봐봬뵈뵤부붜붸뷔뷰브븨비빠빼뺘뺴뻐뻬뼈뼤뽀뽜뽸뾔뾰뿌뿨쀄쀠쀼쁘쁴삐사새샤섀서세셔셰소솨쇄쇠쇼수숴쉐쉬슈스싀시싸쌔쌰썌써쎄쎠쎼쏘쏴쐐쐬쑈쑤쒀쒜쒸쓔쓰씌씨아애야얘어에여예오와왜외요우워웨위유으의이자재쟈쟤저제져졔조좌좨죄죠주줘줴쥐쥬즈즤지짜째쨔쨰쩌쩨쪄쪠쪼쫘쫴쬐쬬쭈쭤쮀쮜쮸쯔쯰찌차채챠챼처체쳐쳬초촤쵀최쵸추춰췌취츄츠츼치카캐캬컈커케켜켸코콰쾌쾨쿄쿠쿼퀘퀴큐크킈키타태탸턔터테텨톄토톼퇘퇴툐투퉈퉤튀튜트틔티파패퍄퍠퍼페펴폐포퐈퐤푀표푸풔풰퓌퓨프픠피하해햐햬허헤혀혜호화홰회효후훠훼휘휴흐희히][ᆨ-ᇂ]|[ᄀ-ᄒ][ᅡ-ᅵ][ᆨ-ᇂ]?/x

 

ReDoS

  

We Need Real Data

 

Regular Expression Data Origin


[1] Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions,
J.C. Davis, L.G. Michael IV, C.A Coghlan, F. Servant, and D. Lee,
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, August 2019, pp. 443–454

 

Regular Expression Evaluation

 

Hardware Used

 

 

Exhibit E

 

Exhibit E, Structure

 

Exhibit E, Attack

 

Exhibit E, Timing

 

Evaluation Result

 

How to Improve Ruby

My opinion:
Fixing ReDoS problem would make Ruby a "better citizen"
Basics available for evaluation of solutions, but more work needed

 

Exhibit Q

 

Recommendations (Summary)

 

Acknowledgments

 

Q & A

Send questions and comments to Martin Dürst
(mailto:duerst@it.aoyama.ac.jp)
or open/contribute to a bug report or feature request



The latest version of this presentation is available at:

https://www.sw.it.aoyama.ac.jp/2021/pub/RubyKaigi/