| Store | Cart

Re: [TCLCORE] Making Tcl 9 go (way) faster

From: Frédéric Bonnet <fred...@free.fr>
Sun, 25 Nov 2012 00:31:30 +0100
Le 24/11/2012 03:29, Kevin Kenny a écrit :
> Tcl is slower than it needs to be, because of all the conversion> back and forth between UTF-8 (CESU-8, actually) and UCS-2 that it> does.  That has to be revisited sometime fairly soon anyway, if we're> to break the BMP barrier.  If we can get [string index] (and> [string range] and friends), and [regexp] working on UTF-8 - which> ought to be possible with a little bit of auxiliary indexing> in place of the UCS-2 representation - then a lot of that disappears.

As a matter of fact, I've made some improvements to the regexp package 
for my CoATL library (Colibri Advanced Type Library) that could be 
reused for native UTF-8 support. The original code uses an abstraction 
layer for character types, but direct character adressing in flat 
arrays. This implies fixed-width chars and thus excludes variable-width 
schemes such as UTF-8 (and UTF-16 with surrogates is a mess).

As you may know, Colibri strings support the full Unicode range and 
various character representation:

- Fixed-width UCS-1 (actually BMP for characters 0-255), UCS-2 (AKA BMP) 
and UCS-4 (full 32-bit Unicode range, though only the lower 21 bits are 
significant)

- Variable-width UTF-8 and UTF-16 (with surrogate pairs)

Moreover Colibri ropes can be composed of string chunks with distinct 
representations.

To abstract things away, Colibri provides an iterator API (plus a chunk 
enumerator but this is of no use here). Colibri rope iterators are 
relatively heavy stack-allocated structures (10 machine words) because 
they store traversal info for better performances. However at a higher 
level an iterator is simply a glorified index variable.

Back to the regexp package. To allow regexps over Colibri ropes I've 
replaced all direct char array addressing by accessor macros. The CoATL 
regexp package provides macro definitions for both Colibri rope 
iterators and char arrays. Earlier development versions of CoATL had a 
compile-time option to switch between the two as a proof-of-concept (the 
char array version converted ropes to flat arrays, like Tcl does with 
UTF-8 and byte arrays), but the release only provides the iterator 
version, however the regexp package works with both. Here are the 
current definitions (in file re/regcustom.h):


     #ifdef REGEXP_USE_ITERATORS
     typedef Col_RopeIterator rchr;    /* Reference to chr. */
     #define RCHR_INDEX(p,start)    Col_RopeIterIndex(p)
     #define RCHR_FWD(p,o)   Col_RopeIterForward((p),(o))
     #define RCHR_BWD(p,o)   Col_RopeIterBackward((p),(o))
     #define RCHR_CHR(p)     Col_RopeIterAt((p))
     #define RCHR_LT(p1,p2)  (RCHR_ISNULL(p2)?0:RCHR_ISNULL(p1)?1: \
         (Col_RopeIterCompare((p1),(p2))<0))
     #define RCHR_GT(p1,p2)  RCHR_LT((p2),(p1))
     #define RCHR_EQ(p1,p2)  \
         (RCHR_ISNULL(p1)?RCHR_ISNULL(p2):RCHR_ISNULL(p2)?0:\
         (Col_RopeIterCompare((p1),(p2))==0))
     #define RCHR_INIT(begin,end,data,len) \
         (Col_RopeIterString((begin),COL_UCS4,(data),(len)),\
         Col_RopeIterSet((end),(begin)), Col_RopeIterForward((end),\
         (len)))
     #define RCHR_SET(p1,p2) Col_RopeIterSet((p1), (p2))
     #define RCHR_ISNULL(p)  Col_RopeIterNull(p)
     #define RCHR_SETNULL(p) Col_RopeIterSetNull(p)
     #define RCHR_NULL       COL_ROPEITER_NULL
     #else
     typedef const chr *rchr;    /* Reference to chr. */
     #define RCHR_INDEX(p,start)    ((p)-(start))
     #define RCHR_FWD(p,o)   ((p) += (o))
     #define RCHR_BWD(p,o)   ((p) -= (o))
     #define RCHR_CHR(p)     (*(p))
     #define RCHR_LT(p1,p2)  ((p1)<(p2))
     #define RCHR_GT(p1,p2)  ((p1)>(p2))
     #define RCHR_EQ(p1,p2)  ((p1)==(p2))
     #define RCHR_INIT(begin,end,data,len) ((begin) = (data),\
         (end) = (begin)+(len))
     #define RCHR_SET(p1,p2) ((p1)=(p2))
     #define RCHR_ISNULL(p)  ((p)==NULL)
     #define RCHR_SETNULL(p) ((p)=NULL)
     #define RCHR_NULL        NULL
     #endif


Backporting the modified package to Tcl would be easy. Once done, we 
just have to define proper accessor macros to get direct UTF-8 support. 
We only need a pretty basic iterator-like structure storing a character 
address along with its numeric index, the rest is simple UTF-8 arithmetics.

The CoATL source distrib is available here:

 
http://sourceforge.net/projects/tcl9/files/colibri/colibri0.14/colibri0.14.src.zip/download

I don't know if it will improve the raw performances of regular 
expressions, but it will certainly improve memory usage and prevent 
shimmering by removing the need for UTF-8 -> byte array conversion.


------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Karl Lehenbauer Nov 21, 2012 02:27 am
Larry McVoy Nov 21, 2012 02:37 am
Brian Griffin Nov 21, 2012 03:12 am
Donal K. Fellows Nov 21, 2012 10:57 am
jima Nov 21, 2012 11:04 am
Donal K. Fellows Nov 21, 2012 11:42 am
Trevor Davel (Twylite) Nov 21, 2012 11:45 am
miguel sofer Nov 21, 2012 02:11 pm
Larry McVoy Nov 21, 2012 02:24 pm
miguel sofer Nov 21, 2012 02:38 pm
Larry McVoy Nov 21, 2012 02:44 pm
Donal K. Fellows Nov 21, 2012 02:57 pm
Larry McVoy Nov 21, 2012 03:08 pm
Andreas Kupries Nov 21, 2012 06:14 pm
Trevor Davel (Twylite) Nov 21, 2012 03:14 pm
Andreas Kupries Nov 21, 2012 06:11 pm
Trevor Davel (Twylite) Nov 22, 2012 09:44 am
Alexandre Ferrieux Nov 22, 2012 01:27 pm
Andreas Kupries Nov 22, 2012 06:52 pm
Andreas Kupries Nov 22, 2012 07:21 pm
Gustaf Neumann Nov 23, 2012 12:16 pm
Gustaf Neumann Nov 24, 2012 06:34 pm
Kevin Kenny Nov 24, 2012 06:54 pm
Gustaf Neumann Nov 25, 2012 10:46 am
miguel sofer Nov 24, 2012 09:14 pm
Gustaf Neumann Nov 25, 2012 10:58 am
miguel sofer Nov 25, 2012 02:34 pm
miguel sofer Nov 26, 2012 03:35 pm
Gustaf Neumann Nov 25, 2012 03:33 pm
Donal K. Fellows Nov 22, 2012 11:56 am
Andreas Kupries Nov 22, 2012 07:07 pm
Donal K. Fellows Nov 23, 2012 11:42 am
Jeff Rogers Nov 21, 2012 07:41 pm
Andreas Kupries Nov 21, 2012 08:09 pm
Porter, Don Nov 22, 2012 07:17 am
Donal K. Fellows Nov 22, 2012 11:14 am
Trevor Davel (Twylite) Nov 21, 2012 02:49 pm
Larry McVoy Nov 21, 2012 02:55 pm
Trevor Davel (Twylite) Nov 22, 2012 10:16 am
Donal K. Fellows Nov 21, 2012 02:49 pm
Alexandre Ferrieux Nov 21, 2012 03:04 pm
Larry McVoy Nov 21, 2012 03:15 pm
Jeff Hobbs Dec 28, 2012 12:06 am
Larry McVoy Dec 28, 2012 12:09 am
Jeff Hobbs Dec 28, 2012 12:17 am
Larry McVoy Dec 28, 2012 12:24 am
Donal K. Fellows Dec 29, 2012 03:33 pm
Jeff Hobbs Jan 04, 2013 02:29 pm
Donal K. Fellows Jan 04, 2013 03:29 pm
miguel sofer Jan 04, 2013 03:35 pm
Kevin Walzer Jan 04, 2013 03:50 pm
Jan Nijtmans Jan 04, 2013 04:05 pm
Trevor Davel (Twylite) Jan 04, 2013 05:13 pm
Donal K. Fellows Jan 04, 2013 11:02 pm
Alexandre Ferrieux Jan 04, 2013 04:44 pm
Donal K. Fellows Jan 04, 2013 11:26 pm
Trevor Davel (Twylite) Jan 04, 2013 11:41 pm
Lars Hellström Jan 05, 2013 12:37 am
Jan Nijtmans Jan 06, 2013 09:47 pm
Donal K. Fellows Jan 06, 2013 10:29 pm
Frédéric Bonnet Jan 05, 2013 12:23 pm
Lars Hellström Nov 21, 2012 03:42 pm
Donal K. Fellows Nov 22, 2012 11:15 am
Lars Hellström Nov 23, 2012 03:58 pm
Kevin Kenny Nov 23, 2012 07:51 pm
Larry McVoy Nov 24, 2012 01:20 am
Kevin Kenny Nov 24, 2012 02:29 am
Larry McVoy Nov 24, 2012 02:45 am
Larry McVoy Nov 24, 2012 03:28 am
Jeff Hobbs Dec 28, 2012 12:12 am
Kevin Kenny Nov 24, 2012 03:35 am
Larry McVoy Nov 24, 2012 04:01 am
Kevin Kenny Nov 24, 2012 05:54 am
Frédéric Bonnet Nov 24, 2012 11:31 pm
Karl Lehenbauer Nov 21, 2012 04:31 pm
Messages in this thread