Migrating To Elasticsearch 6.8.1

Consulting our combined collection this morning, we found the oldest to be this 2008 anniversary tweet from @vipe248.

This tweet came into the collection thanks to a study of #Qanon we did earlier this year. The actual inception of our current cluster hardware appears to have been on January 29th of 2019. The very earliest it could have been created was December 19th of 2018 – the release date for Elasticsearch 6.5.4.

The system is resilient to the loss of any one system, which was given an unintended test last night, with an inadvertent shutdown of one of the servers in the cluster. Recovery takes a couple of minutes given the services and virtual machines, but there was not even an interruption in processing.

Today, for a variety of reasons, we began the process of upgrading to the June 20th, 2019 release of Elasticsearch 6.8.1. There are a number of reasons for doing this:

  • Index Life Cycle Management (6.6)
  • Cross Cluster Replication (6.6)
  • Elasticsearch 7 Upgrade Assistant (6.6)
  • Rolling Upgrade To Elasticsearch 7 (6.7)
  • Better Index Type Labeling (6.7)
  • Security Features Bundled for Community Edition (6.8)
  • Conversion From Ubuntu to Debian Linux

We are not jumping directly to Elasticsearch 7.x due to some fairly esoteric issues involving field formats and concerns regarding some of the Python libraries that we use. Ubunt1.u has been fine for both desktop and server use, but we recently began using the very fussy Open Semantic Search, and it behaves well with Debian. Best of all, the OVA of a working virtual machine with the Netwar System code installed and running is just 1.9 gig.

Alongside the production ready Elasticsearch based system we are including Neo4j with some example data and working code. The example data is a small network taken from Canadian Parliament members and the code produces flat files suitable for import as well as native GML file output for Gephi. We ought to be storing relationships to Neo4j as we see them in streams, but this is still new enough that we are not confident shipping it.

Some questions that have cropped up and our best answers as of today:

Is Open Semantic Search going to be part of Netwar System?

We are certainly going to be doing a lot of work with OSS and this seems like a likely outcome, given that it has both Elasticsearch and Neo4j connectors. The driver here is the desire to maintain visibility into Mastodon instances as communities shift off Twitter – we can use OSS to capture RSS feeds.

Will Netwar System still support Search Guard?

Yes, because their academic licensing permits things that the community edition of Elasticsearch does not. We are not going to do Search Guard integration into the OVA, however. There are a couple reasons for that:

  • Doesn’t make sense on a single virtual machine.
  • Duplicate configs means a bad actor would have certificate based access to the system.
  • Eager, unprepared system operators could expose much more than just their collection system if they try to use it online.
  • Netdata monitoring provides new users insight into Elasticsearch behavior, and we have not managed to make that work with SSL secured systems.
  • We are seeking a sensible free/paid break point for this system. It’s not clear where a community system would end and an enterprise system would begin.
Is there a proper FOSS license?

Not yet, but we are going to follow customs in this area. A university professor should expect to be able to run a secure system for a team oriented class project without incurring any expense. Commercial users who want phone support will incur an annual cost. There will be value add components that will only be available to paying customers. Right now 100% of revenue is based on software as a service and we expect that to continue to be the norm.

So the BSD license seems likely.

When will the OVA be available?

It’s online this morning for internal users. If it doesn’t explode during testing today, a version with our credentials removed should be available Tuesday or Wednesday. Most of the work required to support http/https transparently was finished during first quarter. One it’s up we’ll post a link to it here and there will be announcements on Twitter and LinkedIn.

Coaxing Open Semantic Server Into Operating Condition

An associate earlier this week mentioned having trouble getting Open Semantic Desktop Search to behave. This system offers an intriguing collection of capabilities, including an interface for Elasticsearch. Many hours later, we are picking our way through a minefield. This project is about to shift from Debian 9 to 10, and things are in terrible disarray.

First, some words about frustration free experimentation. If you store your virtual machines on a ZFS file system you can snapshot each time you complete and install step. If something goes wrong later, the snapshot/rollback procedure is essentially instantaneous. This is dramatically more useful than exporting VMs to OVA as a checkpoint. Keep in mind the file system will be dismounted during rollback; it’s best to have some VM specific space set aside.

The project wants Debian proper, so take the time to get Debian 9.9 installed. The desktop OVA wanted a single processor and five gig of ram. Four cores and eight gig seemed to be a sensible amount for a server. Do remember to add a host-only interface under VirtualBox so you can have direct ssh and web access.2

There are some precursors that you will need to put in place before trying to install the monolithic package.

  • apt install celeryd
  • apt install python3-pip
  • apt install python3-celery
  • apt install python-flower

Celery is a task queue manager and Flower provides a graphical interface to it at port 5555. These are missing from the monolithic package. You will also need to add the following to your /etc/security/limits.conf

Now Reboot The System So The Limits Are In Effect

Now you’re ready to install the monolithic package. This is going to produce an error indicating there are packages missing. You correct this problem with this command:

apt install -f

This is going to take a long time to run, maybe ten or fifteen minutes. It will reach 99% pretty quickly – and that’s about the 50% mark in terms of time and tasks. Once this is done, shut the system down, and take a snapshot. Be patient when you reboot it, the services are complex, hefty, and took a couple minutes to all become available on our i7 test system. This is what the system looks like when fully operational.

  • 25672 – RabbitMQ message broker
  • 8080 – spaCy natural language processing
  • 4369 – RabbitMQ beam protocol
  • 22 – ssh, installed for remote access
  • 25 – SMTP for local email, part of Debian
  • 7687 – Neo4j BOLT (server to server) protocol
  • 5672 – RabbitMQ
  • 9998 – Apache Tika document handling service
  • 7983 – Apache Solr
  • 80 – Apache web server
  • 7473 – Neo4j SSL web console
  • 7474 – Neo4j web console
  • 8983 – Apache Solr

Once this is done, you must address Github issue #29 flower doesn’t start automatically. You’ll need this /etc/rc.local file, which their process installs early on, then later removes.

The Celery daemon config also needs attention. The config in /etc/default/celeryd must be edited so that it is ENABLED, and the chroot to /opt/Myproject will cause a failure to start due to missing directory. It seems safe to just turn this off.

Neo4j will be bound to just localhost and will not have a password. Since we’re building a server, rather than a specialty desktop, let’s fix this, too. The file is /etc/neo4j/neo4j.conf, these steps will permit remote access.

  • dbms.security.auth_enabled=true
  • dbms.connectors.default_listen_address=0.0.0.0
  • systemctl restart neo4j
  • visit http://yoursolrIP:7474 and set password
  • visit Config area in OSS web interface, add Neo4j credentials

Having completed these tasks, reboot the system to ensure it starts cleanly. You should find the Open Semantic Search interface here:

http://<IP of VM>/search

This seems like a good stopping point, but we are by no means finished. You can manually add content from the command line with the opensemanticsearch commands:

  • opensemanticsearch-delete
  • opensemanticsearch-enrich
  • opensemanticsearch-filemonitoring
  • opensemanticsearch-index-dir
  • opensemanticsearch-index-file
  • opensemanticsearch-index-rss
  • opensemanticsearch-index-sitemap
  • opensemanticsearch-index-sparql
  • opensemanticsearch-index-web
  • opensemanticsearch-index-web-crawl

There are still many problems to be resolved. Periodic collection from data sources is not working, and web interface submissions are problematic as well. Attempts to parse RSS feeds generate numerous parse errors. Web pages do not import smoothly from our WordPress based site as well as one hosted on the WordPress commercial site.

We will keep coming back to this area, hopefully quickly moving past the administration details, and getting into some actual OSINT collection tradecraft.