If you are reading this blog post via a 3rd party source it is very likely that many parts of it will not render correctly (usually, the interactive graphs). Please view the post on dogesec.com for the full interactive viewing experience.

tl;dr

txt2stix + stix2arango + arango_taxii_server = a robust and flexible setup for storing and distributing cyber threat intelligence you’ve produced.

The problem

You’ve spent years developing your skills.

Your intel gathering techniques are second-to-none.

The data you’ve curated is unmatched.

And then you get bogged down with the question; can you build an integration to product X?

Where X is different for every-single-one of your customers.

Then you waste weeks trying to build these integrations, neglecting the research that sets you apart, by which time said product has changed their API so you’re back to square one.

To solve this problem we built a series of open-source tools that can be knitted together to create, store, and retrieve cyber threat intelligence, allowing us to focus on the quality of our intelligence versus the tooling around it.

This post will walk you through how you can knit them together to distribute your own intel.

STIX as a data format for your intel

If you’ve followed this blog for any period of time, you’ll know that all the intelligence we produce is structured as STIX 2.1 objects.

My recommendation is you go down this route too.

If you’re new to STIX, I’d recommend taking a look at my old STIX posts to try an understand why;

  1. A Beginners Guide to STIX 2.1 Objects
  2. A Quick-start Guide for the STIX 2 Python Library
  3. Creating Your Own Custom STIX Objects

Putting this theory into practice, here’s an example of some code I wrote to convert Abuse.ch’s SSL Certificate Blacklist feed into STIX Objects.

Above, is a sample output of my code. It nicely demonstrates the numerous benefits of using STIX 2.1 to represent your intelligence;

  1. Python libraries already exist to generate objects (stix2 on pypi)
  2. There almost 50 object types that already exist to use (or you can create your own)
  3. You can represent the intelligence in a graph structure, with rich relationships
  4. It’s natively understood by a lot of software (like the graph I’m using to render it)

Alternatively, if you write up intelligence reports you can automate this process, without writing any code, using txt2stix.

Run your Word doc or PDF file through txt2stix and it will automatically identify the STIX objects and the relationships between them described in the document.

OK, now the structure of the data is decided. We need somewhere to store it, where you’ll consumers can access it.

Graph Databases to store your intel

STIX 2.1 has a concept of a bundle. Put in simple terms, a JSON wrapper for the JSON STIX objects.

Bundling all objects around a theme in a single JSON file makes it very easy to share. Here’s the bazaloader bundle rendered in the graph above. In txt2stix a bundle is created for each file entered, containing all the STIX objects extracted from it.

This works nicely when curating distinct intelligence reports, however, if you’re producing the intelligence these bundles can often grow quickly as new information is identified. Similarly, you’ll quickly start publishing the same data (e.g. observed IP addresses) in numerous reports.

Having this data in a database makes for easier storage and retrieval. As STIX is represented as a graph structure, it clearly makes sense to use a graph database.

If you have used any of our tools you will know we use ArangoDB to store STIX objects. I won’t explain why we chose ArangoDB here, because this post from Sekoia explains perfectly.

The good;

  • ArangoDB supports JSON objects by default (which is how STIX objects are constructed).

The bad;

Additional logic is required to…

  • to separate graph vertices and edges (STIX objects and relationships respectively)
  • perform updated to STIX objects already in the database
  • handle embedded references (in _ref and _refs properties) to other STIX objects

To do this, we built a lightweight wrapper called stix2arango that deals with this logic.

stix2arango will take a bundle of STIX objects and insert into ArangoDB correctly.

Let me add the bazaloader bundle using stix2arango;

python3 stix2arango.py \
  --file bazaloader.json \
  --database blog_demo \
  --collection abuse_ch_sslblacklist

For more information about how this command is constructed and further options available for other use-cases, check the stix2arango documentation.

I’ll jump straight to the ArangoDB query interface to show you how you can retrieve this data once it’s imported.

ArangoDB abuse.ch sslblacklist collections

Above, you can see stix2arango has split the objects into two collections, one representing the STIX SDOs and SCOs (vertices), the other covering the relationships (edges) between them.

I can now use the Arango Query Language (AQL) to search, filter and retrieve the STIX objects.

A simple query could be to return all Malware objects in the database.

FOR doc IN abuse_ch_sslblacklist_vertex_collection
  FILTER doc.type == "malware"
  LET keysToKeep = ATTRIBUTES(doc, true)[* FILTER NOT CURRENT LIKE "_%"]  // Find keys that do NOT start with "_"
  RETURN [KEEP(doc, keysToKeep)]  // Return only the keys to keep

Note, the keysToKeep is used to hide the internal ArangoDB keys attached to the object (i.e. those that start with _).

[
  {
    "created": "2020-05-01T15:07:34.000Z",
    "created_by_ref": "identity--a1cb37d2-3bd3-5b23-8526-47a22694b7e0",
    "id": "malware--b1ab6e24-6ed8-585c-b497-d2b8c4b0a23b",
    "is_family": true,
    "malware_types": [
      "remote-access-trojan"
    ],
    "modified": "2021-08-02T17:30:25.000Z",
    "name": "BazaLoader",
    "object_marking_refs": [
      "marking-definition--94868c89-83c2-464b-929b-a1a8aa3c8487",
      "marking-definition--a1cb37d2-3bd3-5b23-8526-47a22694b7e0"
    ],
    "sample_refs": [
      "file--571a2e2b-ee77-52d3-b6da-2715269873fc",
      "file--f87f30ba-5997-55c4-b3cb-b45720654143",
      "<OTHER ENTRIES IN LIST REMOVED FOR BREVITY>"
    ],
    "spec_version": "2.1",
    "type": "malware"
  }
]

I could also find out what objects have a relationship to this one (malware--b1ab6e24-6ed8-585c-b497-d2b8c4b0a23b).

FOR edge IN abuse_ch_sslblacklist_edge_collection
  FILTER edge.source_ref == "malware--b1ab6e24-6ed8-585c-b497-d2b8c4b0a23b"
  RETURN [edge]

ArangoDB abuse.ch sslblacklist graph

Here you can start to see the benefits of having the data in a graph structure. In the middle is the Malware Object, with relationships to all the File object (listed in the sample_refs property above – a good example of an embedded relationship in stix2arango).

Both of these queries are VERY simple. AQL does allow for much more advance query logic. Similarly, the commercial editions of ArangoDB also offer the ability to use the stored data in conjunction AI models so you can start to ask questions using LLMs, for example.

However, lets be honest; consumers don’t want to learn AQL.

Exposing this data to consumers

They probably don’t want to have to understand a custom API you’ve built to surface these objects in order to integrate it with their other tools. Nor do you want to write and manage the queries to power these API endpoints.

That’s where TAXII comes in.

In short the TAXII specification defines a single API for sharing threat intelligence. The idea being that all upstream and downstream technology can be built to support one standard, versus many individual API designs.

A TAXII Server is responsible for storing and distributing the intel. A TAXII Client consumes and posts intel from and to a TAXII server.

At this point, a TAXII Server is what’s needed, and yes, we have built a TAXII implementation on top of ArangoDB removing the need for you to have to write the logic yourself. It’s called Arango TAXII Server.

Follow the Arango TAXII Server docs to get it up and running.

The only other step that’s required is to create an account for an intelligence consumer on ArangoDB.

ArangoDB user permissions

Above I am creating demo_user and only providing them Write access to the blog_demo_database (and all the Collections housed within it – currently only 2).

Opening up the TAXII Server Swagger docs (running at http://127.0.0.1:8000/api/schema/swagger-ui/, if you’re running Arango TAXII Server locally), you can start to explore the endpoints.

Arango TAXII Server Swagger

Show the available TAXII API Roots (aka ArangoDB databases) for this user;

curl -X 'GET' \
  'http://127.0.0.1:8000/api/taxii2/' \
  -H 'accept: application/taxii+json;version=2.1' \
  -H 'Authorization: Basic ZGVtb191c2VyOnBhc3N3b3Jk'
{
  "title": "Arango TAXII Server",
  "description": "https://github.com/muchdogesec/arango_taxii_server/",
  "contact": "[email protected]",
  "api_roots": [
    "http://127.0.0.1:8000/api/taxii2/blog_demo_database/"
  ]
}

Show the available TAXII Collections in this API Root for this user;

curl -X 'GET' \
  'http://127.0.0.1:8000/api/taxii2/blog_demo_database/collections/' \
  -H 'accept: application/taxii+json;version=2.1' \
  -H 'Authorization: Basic ZGVtb191c2VyOnBhc3N3b3Jk'
{
  "collections": [
    {
      "id": "abuse_ch_sslblacklist",
      "title": "abuse_ch_sslblacklist",
      "description": null,
      "can_read": true,
      "can_write": false,
      "media_types": [
        "application/stix+json;version=2.1"
      ]
    }
  ]
}

Note, Arango TAXII Server will automatically merge the vertex and edge collections into a single TAXII Collection when shown to consumers (e.g. the above TAXII Collection abuse_ch_sslblacklist, considers the ArangoDB Collections abuse_ch_sslblacklist_vertex_collection and abuse_ch_sslblacklist_edge_collection).

And then finally, shown the objects in this TAXII Collection;

curl -X 'GET' \
  'http://127.0.0.1:8000/api/taxii2/blog_demo_database/collections/abuse_ch_sslblacklist/objects/' \
  -H 'accept: application/taxii+json;version=2.1' \
  -H 'Authorization: Basic ZGVtb191c2VyOnBhc3N3b3Jk'
{
  "more": true,
  "next": "49628380_2024-08-29T08:55:09.469475Z",
  "objects": [
    {
      "hashes": {
        "SHA-1": "14d0b902caad60435ad3c32a025a24c1f97929be"
      },
      "id": "file--571a2e2b-ee77-52d3-b6da-2715269873fc",
      "spec_version": "2.1",
      "type": "file"
    },
    {
      "hashes": {
        "SHA-1": "511cdfe4eb4b2aa10b6e4e153c7f8d2fde0baaa0"
      },
      "id": "file--f87f30ba-5997-55c4-b3cb-b45720654143",
      "spec_version": "2.1",
      "type": "file"
    },
    {
      "hashes": {
        "SHA-1": "be741a5045c0ca95f8b78683d004e4a34562e3a9"
      },
      "id": "file--30b28bc3-4107-5cd1-87b9-13e6ac0d82ef",
      "spec_version": "2.1",
      "type": "file"
    },
    {
      "hashes": {
        "SHA-1": "1d515bdb771dad480db077e214ac7de947e593ff"
      },
      "id": "file--e6b2e942-55b8-5c85-b1fa-40bc306ba320",
      "spec_version": "2.1",
      "type": "file"
    },
    {
      "hashes": {
        "SHA-1": "f760eef17a056d0dbca8ffa7614ac2965997f8eb"
      },
      "id": "file--080f40a3-adc3-5d18-8ef5-f9caceffa770",
      "spec_version": "2.1",
      "type": "file"
    },
    {
      "hashes": {
        "SHA-1": "fa310de69957a073acb83219ebead3d3d8c2b380"
      },
      "id": "file--bc128c8f-757e-5825-b96c-80d76e037835",
      "spec_version": "2.1",
      "type": "file"
    },
    {
      "hashes": {
        "SHA-1": "140b7a09d2448d688ab2569cee7e932dce7cc6dc"
      },
      "id": "file--de2d78a4-e868-5608-95d4-5bcda0cc9bd8",
      "spec_version": "2.1",
      "type": "file"
    },
    {
      "hashes": {
        "SHA-1": "3f9ff233186cf48138a90190b0af5801404064f8"
      },
      "id": "file--f510fd98-f96c-593a-8b18-aa81483ae010",
      "spec_version": "2.1",
      "type": "file"
    },

The above response it cut for brevity.

I’ve glossed over a lot of the Arango TAXII Server features (e.g. the URL filtering parameters), however, lets be honest with ourselves again; end-users don’t want to be writing these API requests – that’s the job of the TAXII client I described earlier.

Consuming the exposed data

Most leading security tools have a TAXII Server built in; Microsoft Sentinel, Palo Alto Network XSOAR, Filigran OpenCTI, etc.

Lets take OpenCTI, because it’s open-source and allows you to continue to follow along.

To install OpenCTI, follow their docs here. I use the Docker install, and the rest of the blog will assume this.

Once logged into the portal, navigate to Ingestion > TAXII Feeds (http://127.0.0.1:8080/dashboard/data/ingestion/taxii).

Most settings are obvious, however for clarity;

  • TAXII server URL: http://127.0.0.1:8000/api/taxii2/blog_demo_database/collections
  • TAXII Collection: abuse_ch_sslblacklist
  • Authentication type: Basic user / password
  • Username: <YOUR ARANGODB USER>
  • Password: <THE PASSWORD>

OpenCTI Results

If I run a search in OpenCTI (http://127.0.0.1:8080/dashboard/search/knowledge/?sortBy=_score&orderAsc=false) after the TAXII job is run, I can see all the imported objects from the Collection.

As new objects are added to the ArangoDB Collection, the OpenCTI Connector will continue to import them.

In summary

  1. use the stix2 Python library to create intel (or use txt2stix)
  2. import it to ArangoDB using stix2arango
  3. expose it using Arango TAXII Server
  4. let your consumers connect their TAXII Clients to it

Posted by:

David Greenwood

David Greenwood, Do Only Good Everyday




Discuss this post


Head on over to the DOGESEC community to discuss this post.

DOGESEC community

Never miss an update


Sign up to receive new articles in your inbox as they published.

Your subscription could not be saved. Please try again.
Your subscription has been successful.