skeletonz - The simple Python CMS system

What is Loso?

Loso (aka 囉嗦 in Chinese, means wordy) is a Chinese segmentation system written in Python.
It was originally developed for improving Plurk search for Chinese segmentation for Traditional Chinese but was applicable to Simplified Chinese too.

Source

You can download a copy of loso through our trac: http://opensource.plurk.com/tr...

Installation
To install loso, clone the repo and run following command

cd loso
python setup.py develop

Also, you need to run a redis database for storing the lexicon database. Also, you need to copy configuration template and modify it.

cp default.yaml myconf.yaml
vim myconf.yaml

To use your configuration, you have to set the configuration environment variable LOSO_CONFIG_FILE. For example:

LOSO_CONFIG_FILE=myconfig.yaml python setup.py server

How to use it?

Loso determines segmentation according to the lexicon database, and the algorithm is based on Hidden Makov Model, therefore, it is not possible to use the service before building a lexicon database.

To feed a text file to the database, here you can run:

python setup.py feed -f /home/victorlin/plurk_src/realtime_search/word_segment/sample_data/sample_tr_ch

And clean the database with the following command:

python setup.py reset

And here's what you should do to interact and test for splitting terms:

python setup.py interact

Example

Original Text: 留下鉅細靡遺的太空梭發射影片,供世人回味

Segmented by Loso: 留下 鉅細靡遺 的 太空梭 發射 影片 供 世人 回味

Start as a XMLRPC service

To start the segmentation service as XMLRPC service, here you can run

python setup.py serve

Here's a sample Python program to demostrate how to use it:

import xmlrpclib

proxy = xmlrpclib.ServerProxy("http://localhost:5566/")

terms = proxy.splitTerms(u'留下鉅細靡遺的太空梭發射影片,供世人回味')
print ' '.join(terms)

And the output should be:

留下 鉅細靡遺  太空梭 發射 影片  世人 回味

License
Loso is copyrighted by Plurk Inc and is licensed under the BSD license.

Victor Lin wrote it for Plurk Inc.

Powered by Skeletonz