[CWB] Large Corpus Port (64bit)

Tino Didriksen tino at didriksen.cc
Tue Jun 18 11:18:19 CEST 2019


G'day...

Has anyone else done serious work on porting the tools to support large
corpora?

I have forked the core tools to https://github.com/TinoDidriksen/cwb and am
working on changing the internals to make use of 64bit types and algorithms.

Step one was to make it build natively on Windows in Visual Studio with
vcpkg, which the https://github.com/TinoDidriksen/cwb/tree/cmake branch
does. This branch is compatible with existing code and corpora - it's
really just build fixes and using CMake for the cross-platform project.

Step two will be a new branch with a thorough overhaul of the codebase,
using C99 and possibly C++ if I get tired of C's limitations. This will be
incompatible with existing code and corpora, since everything from hash
functions to random generator algorithms need to be bumped to 64bit, and
most storage changes from int to uint64_t.

But if anyone else has already done this, I'd like to know. I am aware of
the 4.0 effort and the papers from 2015 promising new features, but they're
still too far off for our use.

-- Tino Didriksen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190618/dc256f83/attachment.html>


More information about the CWB mailing list