Coaxing Open Semantic Server Into Operating Condition

An associate earlier this week mentioned having trouble getting Open Semantic Desktop Search to behave. This system offers an intriguing collection of capabilities, including an interface for Elasticsearch. Many hours later, we are picking our way through a minefield. This project is about to shift from Debian 9 to 10, and things are in terrible disarray.

First, some words about frustration free experimentation. If you store your virtual machines on a ZFS file system you can snapshot each time you complete and install step. If something goes wrong later, the snapshot/rollback procedure is essentially instantaneous. This is dramatically more useful than exporting VMs to OVA as a checkpoint. Keep in mind the file system will be dismounted during rollback; it’s best to have some VM specific space set aside.

The project wants Debian proper, so take the time to get Debian 9.9 installed. The desktop OVA wanted a single processor and five gig of ram. Four cores and eight gig seemed to be a sensible amount for a server. Do remember to add a host-only interface under VirtualBox so you can have direct ssh and web access.2

There are some precursors that you will need to put in place before trying to install the monolithic package.

  • apt install celeryd
  • apt install python3-pip
  • apt install python3-celery
  • apt install python-flower

Celery is a task queue manager and Flower provides a graphical interface to it at port 5555. These are missing from the monolithic package. You will also need to add the following to your /etc/security/limits.conf

Now Reboot The System So The Limits Are In Effect

Now you’re ready to install the monolithic package. This is going to produce an error indicating there are packages missing. You correct this problem with this command:

apt install -f

This is going to take a long time to run, maybe ten or fifteen minutes. It will reach 99% pretty quickly – and that’s about the 50% mark in terms of time and tasks. Once this is done, shut the system down, and take a snapshot. Be patient when you reboot it, the services are complex, hefty, and took a couple minutes to all become available on our i7 test system. This is what the system looks like when fully operational.

  • 25672 – RabbitMQ message broker
  • 8080 – spaCy natural language processing
  • 4369 – RabbitMQ beam protocol
  • 22 – ssh, installed for remote access
  • 25 – SMTP for local email, part of Debian
  • 7687 – Neo4j BOLT (server to server) protocol
  • 5672 – RabbitMQ
  • 9998 – Apache Tika document handling service
  • 7983 – Apache Solr
  • 80 – Apache web server
  • 7473 – Neo4j SSL web console
  • 7474 – Neo4j web console
  • 8983 – Apache Solr

Once this is done, you must address Github issue #29 flower doesn’t start automatically. You’ll need this /etc/rc.local file, which their process installs early on, then later removes.

The Celery daemon config also needs attention. The config in /etc/default/celeryd must be edited so that it is ENABLED, and the chroot to /opt/Myproject will cause a failure to start due to missing directory. It seems safe to just turn this off.

Neo4j will be bound to just localhost and will not have a password. Since we’re building a server, rather than a specialty desktop, let’s fix this, too. The file is /etc/neo4j/neo4j.conf, these steps will permit remote access.

  • dbms.security.auth_enabled=true
  • dbms.connectors.default_listen_address=0.0.0.0
  • systemctl restart neo4j
  • visit http://yoursolrIP:7474 and set password
  • visit Config area in OSS web interface, add Neo4j credentials

Having completed these tasks, reboot the system to ensure it starts cleanly. You should find the Open Semantic Search interface here:

http://<IP of VM>/search

This seems like a good stopping point, but we are by no means finished. You can manually add content from the command line with the opensemanticsearch commands:

  • opensemanticsearch-delete
  • opensemanticsearch-enrich
  • opensemanticsearch-filemonitoring
  • opensemanticsearch-index-dir
  • opensemanticsearch-index-file
  • opensemanticsearch-index-rss
  • opensemanticsearch-index-sitemap
  • opensemanticsearch-index-sparql
  • opensemanticsearch-index-web
  • opensemanticsearch-index-web-crawl

There are still many problems to be resolved. Periodic collection from data sources is not working, and web interface submissions are problematic as well. Attempts to parse RSS feeds generate numerous parse errors. Web pages do not import smoothly from our WordPress based site as well as one hosted on the WordPress commercial site.

We will keep coming back to this area, hopefully quickly moving past the administration details, and getting into some actual OSINT collection tradecraft.