Online conflicts are evolving rapidly and escalating. Three years ago we judged that it was best to not release a conflict oriented tool, even one that is used purely for observation. Given the events since then, this notion of not proliferating seems … quaint.
So we released the Netwar System code, the companion ELK utilities, and this week we are going to revisit the Twitter Utils, a set of small scripts that are part of our first generation software, and which are still used for some day to day tasks.
When you live with a programming language and a couple of fairly complex distributed systems, there are troubles that arise which can be dispatched almost without thought. A new person attempting to use such a system might founder on one of these, so this post is going to memorialize what is required for a from scratch install on a fresh Ubuntu 18.04 installation.
We converted to Python 3 a while ago. The default install includes Python 3.6.7, but you need pip, and git, too.
apt install python3-pip
apt install git
ln -s /usr/bin/python3 /usr/bin/python
ln -s /usr/bin/pip3 /usr/bin/pip
The next step is cloning the Netwar System repository into your local directory, make the commands executable, and place them on your path.
git clone [email protected]:NetwarSystem/NetwarSystem.git
chmod 755 tw-*
chmod 755 F-queue
cp tw-* /usr/local/bin/
cp F-queue /usr/local/bin/
Once that’s done, it’s time to install lots of packages. This is normally done like this:
pip install -r REQUIREMENTS.txt
But our REQUIREMENTS.txt for the Netwar System was pretty stale. We think it’s OK now, but here is how we updated it. A little bit of grep/sort/uniq provided this list of missing packages.
You can manually install those and they’ll all work, except for squish2, the name for our internal package that contains the code to “squish” bulky, low value fields out of tweets and user profiles. This requires special handling like so.
pip install -e .
If you have any errors related to urllib3, SSL, or XML, those might be subtle dependency problems. Post them as issues on Github.
There are a bunch of Elasticsearch related scripts in the ELKSG repository. You should clone them and then copy them into your path.
git clone [email protected]:NetwarSystem/ELKSG.git
chmod 755 elk*
cp elk* /usr/local/bin/
The ELK software can handle a simple install, or one with Search Guard. This is the simple setup, so add this final line to your ~/.profile so the scripts know where to find Elasticsearch.
You need the following four pieces of software to get the system running in standalone mode.
Redis and Netdata are simple.
apt install redis
There is an install procedure for Netdata that is really slick. Copy one command, paste it in a shell, it does the install, and makes the service active on port 19999.
Elasticsearch and Neo4j require a bit more work to get the correct version:
apt install oracle-java8-installer
apt install curl apt-transport-https
curl -s https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add -
echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | tee /etc/apt/sources.list.d/elastic-6.x.list
apt install elasticsearch=6.5.4
apt install kibana=6.5.4
mv /etc/apt/sources.list.d/elastic-6.x.list /etc/apt/sources.list.d/elastic-6.x.idle
systemctl enable elasticsearch
systemctl start elasticsearch
systemctl enable kibana
systemctl start kibana
The mv line leaves the Elasticsearch repository file in your sources directory, but it disables it. This is so you can update the rest of your system without stepping on the specific version needed.
Neo4j is similar, but it’s fine to track the latest version. Also note that Neo4j is a Java app – it needs the same Java installer we added for Elasticsearch.
wget -O - https://debian.neo4j.org/neotechnology.gpg.key | apt-key add -
echo 'deb https://debian.neo4j.org/repo stable/' | tee -a /etc/apt/sources.list.d/neo4j.list
apt install neo4j=1:3.5.4
Note that the version mentioned there is just what happens to be in the Neo4j install instructions on the day this article was written. This is not sensitive the way Elasticsearch is.
At this point you should have all four applications running. The one potential problem is Kibana, which may fail to start because it depends on Elasticsearch, which takes a couple minutes to come alive the first time it is run. Try these commands to verify:
systemctl status redis
systemctl status elasticsearch
systemctl status kibana
systemctl status neo4j
In terms of open TCP ports, try the following, which checks the access ports for Kibana, Redis, Neo4j, and Elasticsearch.
netstat -lan | awk '/:5601|:6379|:7474|:9200|:19999/'
And that’s that – you’ve got the software installed. Now we need to configure some things.
Linux & Packages Configuration
There are a number of things that need adjusting in order for the system to run smoothly. Elasticsearch will cause dropped packets under a load, so lets add these two lines to /etc/sysctl.conf
And then make them immediately active:
sysctl -w net.core.netdev_budget=3500
sysctl -w net.core.netdev_budget_usecs=35000
We also need to adjust the file handles and process limits upward for Elasticsearch’s Lucene component and Neo4j’s worker threads. Add these lines to /etc/security/limits.conf, and note that there are tab stops in the actual file, this looks terrible on the blog. Here it’s just best to reboot to make these settings active.
elasticsearch - nofile 300000
neo4j - nofile 300000
root - nofile 300000
neo4j hard nproc 10000
neo4j soft nproc 10000
If you’re running this software on your desktop, pointing a web browser at port 5601 will show Kibana and 7474 will show Neo4j. If you’re using a standalone or virtual machine, you’ll need to open some access. Here are three one liners with sed that will do that.
sed -i 's/#network.host: 192.168.0.1/network.host: 0.0.0.0/' /etc/elasticsearch/elasticsearch.yml
sed -i 's/#server.host: \"localhost\"/server.host: 0.0.0.0/' /etc/kibana/kibana.yml
sed -i 's/#dbms.connectors.default_listen/dbms.connectors.default_listen/' /etc/neo4j/neo4j.conf
systemctl restart elasticsearch
systemctl restart kibana
systemctl restart neo4j
Elasticsearch doesn’t require a password in this configuration, but Neo4j does, and it’ll make you change it from the default of ‘neo4j’ the first time you log in to the system.
OK, point your browser at port 19999, and you should see this:
Notice the elasticsearch local and Redis local tabs at the lower right. You can get really detailed information on what Elasticsearch is doing, which is helpful when you are just starting to explore its capabilities.
Configuring Your First Twitter Account
You must have a set of Twitter application keys to take the next step. You’ll need to add the Consumer Key and Consumer Secret to the tw-auth command. Run it, paste the URL it offers into a browser, log in with your Twitter account, enter the seven digit PIN from the browser into the script, and it will create a ~/.twitter file that looks something like this.
You’ll need to enter the Neo4j password you set earlier. The elksg variable has to point to the correct host and port. The elksguser/elksgpass entries are just placeholders. If you got this right, this command will cough up your login shell name and Twitter screen name.
Next, you can check that your Elasticsearch commands are working:
Now is the time to get Elasticsearch ready to accept Twitter data. Mostly this involves making sure it recognizes timestamps. Issue these commands:
The first three ensure that timestamps work for the master user index, any tu* index related to a specific collection, and any tw* index containing tweets. The mylog command ensures the perflog indec is searchable. The last command bumps the field limit on indices. Experienced Elasticsearch users will be scratching their heads on this one – we still have much to learn here, feel free to educate us on how to permanently handle that problem.
If you want to see what these did, this command will show you a lot of JSON.
And now we’re dangerously close to actually getting some content in Elasticsearch. Try the following commands:
tw-friendquick NetwarSystem > test.txt
This should produce a file with around 180 numeric Twitter IDs that are followed by @NetwarSystem, load them into Redis for processing, and the last command will give you a count of how many are loaded. This is the big moment, try this command next:
That command should spew a bunch of JSON as it runs. The preceding time command will tell you how long it took, a useful thing when performance tuning long running processes.
Now try this one:
You should get back two very long lines of text – one for the usertest index, show about 180 documents, and the other for perflog, which will just have a few.
There, you’ve done it! Now let’s examine the results.
Your next steps require the Kibana graphical interface. Point your browser at port 5601 on your system. You’ll be presented with the Kibana welcome page. You can follow their tutorial if you’d like. Once you’ve done that, or skipped it, you will do the following:
- Go the Management tab
- Select Index Patterns
- Create an Index Pattern for the usertest index
There should be a couple of choices for time fields – one for when the user account was created, the other is the date for their last tweet. Once you’ve done this, go to the Discover tab, which should default to your newly created Index Pattern. Play with the time picker at the upper right, find the Relative option, and set it to 13 years. You should see a creation date histogram something like this:
Writing this post involved grinding off every burr we found in the Github repositories, which was an all day job, but we’ve come to the point where you have cut & pasted all you can. The next steps will involve watching videos about how to use Kibana, laying hands on a copy of Elasticsearch: The Definitive Guide, and installing Graphileon so you can explore the Neo4j data.