Wednesday, February 10, 2010

A Regular Expression for Bible Cross References

I've been working on this bit-by-bit for weeks. It's a regular expression (see here for a tutorial) that recognizes Bible cross references (e.g., 1 Sam 1:1; Matt 2:1) in a variety of formats. The hardest thing about it was dealing with book names that contain numbers (1 Kings, 2 Corinthians, etc). Even now I'm not thrilled with the way it handles those, but I'm satisfied.

The "flavor" of reg-ex that this was developed in is VBScript. Other flavors might be able to handle things more elegantly. For example, I wish that I could just use \w for whitespace (including non-breaking spaces), but that shorthand character class doesn't seem to work. The only thing I've found that works for the space between a number and a word in a book name is ([^A-Za-z0-9]| | ), where the final character is a non-breaking space.

Anyway, here it is:

"\b((G(e(nesis)|e?n)|Ex(o(d(us)?)?)?|L(eviticus|e?v)|N(u(mbers)?|u?m)|D(euteronomy|(eu)?t)|J(os(hua)?|o?sh)|J(udg(es)?|gs|d)|Ru(th?)?|Ezra?|Ne(h(emiah)?)?|Est(h(er)?)?|Jo?b|Ps(alm)?s?|Pr(ov(erbs)?)?|Ec(c(les(iastes)?)?)?|S(o(ng( of (Songs|Solomon))?)?|g)|Is(a(iah)?)?|J(e(remiah)?|e?r)|L(a(mentations)?|a?m)|Ez(e(kiel)?|e?k)|D(a(niel)?|a?n)|Ho(s(ea)?)?|J(oe)?l|Am(os)?|Ob(a(d(iah)?)?)?|Jon(ah)?|M(i(c(ah)?)?|c)|N(a(h(um)?)?|h)|Hab(akkuk)?|Z(ep(h(aniah)?)?|p)|H(ag(g(ai)?)?|g)|Z(ec(h(ariah)?)?|c)|M(al(a(chi)?)?|l)|M(at(thew)?|(at)?t)|M(ar)?k|L(uke|[uk])|J(oh)?n|Ac(ts)?|R(o(mans)?|o?m)|G(al(atians)?|l)|Ep(h(esians)?)?|Ph(il(ippians)?|p)|C(o(l(ossians)?)?|l)|Ti(t(us)?)?|Ph(ile(m(on)?)?|l?m)|H(e(b(rews)?)?|b)|Ja((me)?s|m)|J(ude?|d)|R(e(velation)?|e?v)|Bar(uch)?|Add([^A-Za-z0-9]| | )?Dan|Pr(ayer)?[^A-Za-z0-9]?(of )?Azar(iah)?|Bel( and the Dragon)?|S(on)?g( of the |([^A-Za-z0-9]| | )?)Three( Children)?|Sus(anna)?|Add(itions to |([^A-Za-z0-9]| | )?)Esth(er)?|Ep(istle of |([^A-Za-z0-9]| | )?)Jer(emiah)?|J(udith|dt)|Pr(ayer of([^A-Za-z0-9]| | )?)Man(asseh)?|Sir(ach)?|Tob(it)?|Wis(dom of Solomon)?)|(([1-4]|First|Second|Third|Fourth|I{1,3}|IV)([^A-Za-z0-9]| | )?(S(amuel|a?m)|K((in)?gs)?|Ch(r(on(icles)?)?)?|Co(r(inthians)?)?|Th(ess?(alonians)?)?|T(i(mothy)?|i?m)|P(eter|e?t)?|J((oh)?n)?|Esdr(as)?|Macc(abees)?)|(SAM|KGS|CHR|COR|THE|TIM|PET|JOH)[1-3]))(\.?)([^A-Za-z0-9]| | )?([0-9]{1,3})(([:\.,]([0-9]{1,3})(f{1,2}|[a-z])?[-–]([0-9]{1,3})[:\.,]([0-9]{1,3})(f{1,2}|[a-z])?|[:\.,]([0-9]{1,3})(f{1,2}|[a-z])?[-–]([0-9]{1,3})(f{1,2}|[a-z])?|[:\.,]([0-9]{1,3})(f{1,2}|[a-z])?|[-–]([0-9]{1,3}))?)([,;]([^A-Za-z0-9]| | )?(((([1-4]|First|Second|Third|Fourth|I{1,3}|IV)([^A-Za-z0-9]| | )?(S(amuel|a?m)|K((in)?gs)?|Ch(r(on(icles)?)?)?|Co(r(inthians)?)?|Th(ess?(alonians)?)?|T(i(mothy)?|i?m)|P(eter|e?t)?|J((oh)?n)?|Esdr(as)?|Macc(abees)?)|(SAM|KGS|CHR|COR|THE|TIM|PET|JOH)[1-3])(\.?)([^A-Za-z0-9]| | )?([0-9]{1,3})([:\.,]([0-9]{1,3})(f{1,2}|[a-z])?[-–]([0-9]{1,3})[:\.,]([0-9]{1,3})(f{1,2}|[a-z])?|[:\.,]([0-9]{1,3})(f{1,2}|[a-z])?[-–]([0-9]{1,3})(f{1,2}|[a-z])?|[:\.,]([0-9]{1,3})(f{1,2}|[a-z])?|[-–]([0-9]{1,3}))?)|([0-9]{1,3})([:\.,]([0-9]{1,3})(f{1,2}|[a-z])?[-–]([0-9]{1,3})[:\.,]([0-9]{1,3})(f{1,2}|[a-z])?|[:\.,]([0-9]{1,3})(f{1,2}|[a-z])?[-–]([0-9]{1,3})(f{1,2}|[a-z])?|[:\.,]([0-9]{1,3})(f{1,2}|[a-z])?|[-–]([0-9]{1,3}))?))*\b"

2 comments:

Carlos Alvidrez said...

I am working on something similar... how can I put your regex to work? Could you please provide an example? I am trying to make it work with javascript (jQuery).

Thanks!

Terrance Wood said...

This is very helpful, thank you.