Le 24/11/2012 03:29, Kevin Kenny a écrit :
> Tcl is slower than it needs to be, because of all the conversion> back and forth between UTF-8 (CESU-8, actually) and UCS-2 that it> does. That has to be revisited sometime fairly soon anyway, if we're> to break the BMP barrier. If we can get [string index] (and> [string range] and friends), and [regexp] working on UTF-8 - which> ought to be possible with a little bit of auxiliary indexing> in place of the UCS-2 representation - then a lot of that disappears.
As a matter of fact, I've made some improvements to the regexp package
for my CoATL library (Colibri Advanced Type Library) that could be
reused for native UTF-8 support. The original code uses an abstraction
layer for character types, but direct character adressing in flat
arrays. This implies fixed-width chars and thus excludes variable-width
schemes such as UTF-8 (and UTF-16 with surrogates is a mess).
As you may know, Colibri strings support the full Unicode range and
various character representation:
- Fixed-width UCS-1 (actually BMP for characters 0-255), UCS-2 (AKA BMP)
and UCS-4 (full 32-bit Unicode range, though only the lower 21 bits are
significant)
- Variable-width UTF-8 and UTF-16 (with surrogate pairs)
Moreover Colibri ropes can be composed of string chunks with distinct
representations.
To abstract things away, Colibri provides an iterator API (plus a chunk
enumerator but this is of no use here). Colibri rope iterators are
relatively heavy stack-allocated structures (10 machine words) because
they store traversal info for better performances. However at a higher
level an iterator is simply a glorified index variable.
Back to the regexp package. To allow regexps over Colibri ropes I've
replaced all direct char array addressing by accessor macros. The CoATL
regexp package provides macro definitions for both Colibri rope
iterators and char arrays. Earlier development versions of CoATL had a
compile-time option to switch between the two as a proof-of-concept (the
char array version converted ropes to flat arrays, like Tcl does with
UTF-8 and byte arrays), but the release only provides the iterator
version, however the regexp package works with both. Here are the
current definitions (in file re/regcustom.h):
#ifdef REGEXP_USE_ITERATORS
typedef Col_RopeIterator rchr; /* Reference to chr. */
#define RCHR_INDEX(p,start) Col_RopeIterIndex(p)
#define RCHR_FWD(p,o) Col_RopeIterForward((p),(o))
#define RCHR_BWD(p,o) Col_RopeIterBackward((p),(o))
#define RCHR_CHR(p) Col_RopeIterAt((p))
#define RCHR_LT(p1,p2) (RCHR_ISNULL(p2)?0:RCHR_ISNULL(p1)?1: \
(Col_RopeIterCompare((p1),(p2))<0))
#define RCHR_GT(p1,p2) RCHR_LT((p2),(p1))
#define RCHR_EQ(p1,p2) \
(RCHR_ISNULL(p1)?RCHR_ISNULL(p2):RCHR_ISNULL(p2)?0:\
(Col_RopeIterCompare((p1),(p2))==0))
#define RCHR_INIT(begin,end,data,len) \
(Col_RopeIterString((begin),COL_UCS4,(data),(len)),\
Col_RopeIterSet((end),(begin)), Col_RopeIterForward((end),\
(len)))
#define RCHR_SET(p1,p2) Col_RopeIterSet((p1), (p2))
#define RCHR_ISNULL(p) Col_RopeIterNull(p)
#define RCHR_SETNULL(p) Col_RopeIterSetNull(p)
#define RCHR_NULL COL_ROPEITER_NULL
#else
typedef const chr *rchr; /* Reference to chr. */
#define RCHR_INDEX(p,start) ((p)-(start))
#define RCHR_FWD(p,o) ((p) += (o))
#define RCHR_BWD(p,o) ((p) -= (o))
#define RCHR_CHR(p) (*(p))
#define RCHR_LT(p1,p2) ((p1)<(p2))
#define RCHR_GT(p1,p2) ((p1)>(p2))
#define RCHR_EQ(p1,p2) ((p1)==(p2))
#define RCHR_INIT(begin,end,data,len) ((begin) = (data),\
(end) = (begin)+(len))
#define RCHR_SET(p1,p2) ((p1)=(p2))
#define RCHR_ISNULL(p) ((p)==NULL)
#define RCHR_SETNULL(p) ((p)=NULL)
#define RCHR_NULL NULL
#endif
Backporting the modified package to Tcl would be easy. Once done, we
just have to define proper accessor macros to get direct UTF-8 support.
We only need a pretty basic iterator-like structure storing a character
address along with its numeric index, the rest is simple UTF-8 arithmetics.
The CoATL source distrib is available here:
http://sourceforge.net/projects/tcl9/files/colibri/colibri0.14/colibri0.14.src.zip/download
I don't know if it will improve the raw performances of regular
expressions, but it will certainly improve memory usage and prevent
shimmering by removing the need for UTF-8 -> byte array conversion.
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core