Turn any Blog Post into Structured Threat Intelligence

If you are reading this blog post via a 3rd party source it is very likely that many parts of it will not render correctly (usually, the interactive graphs). Please view the post on dogesec.com for the full interactive viewing experience.

tl;dr

Obstracts is the blog feed reader used by the worlds most targetted cyber-security teams. Let me show you why.

Overview

Most security teams use cyber-security blogs to keep up with the latest research. Our Awesome Threat Intel Blogs repository tracks some of the most popular (now over 300 stars!).

Almost all of the analysts we spoke to before building Obstracts told us they spend a significant amount of time copying and pasting data from blog posts into their intelligence tooling.

This is both time consuming and prone to error.

What’s more, intelligence tooling doesn’t always support contextual relationships between the intelligence described in blog posts (e.g. Domain fake.com resolves to IP address 1.1.1.1).

Obstracts extracts IoCs and TTPs from blog posts, including capturing the relationships between them (as STIX 2.1 Objects).

We’ve recently open-sourced the core Obstracts engine. I therefore wanted to take the opportunity to walk you through a quick start guide to get up and running.

Before I continue

We offer a fully hosted web version of Obstracts which includes many additional features over those in the backend engine. You can find out more about the web version here.

Step 0: Install and setup

You can read the installation instructions here.

Once you’ve got Obstracts running, you can use the Swagger UI that ships with Obstracts to interact with the API (see the history4feed docs).

Obstracts Swagger UI Default

Step 1: Configuring profiles

Obstracts uses txt2stix in the backend to extract data using;

Extractors, what extract the data from the text which is then converted into STIX objects.
Whitelists, provide a list of values to be ignored for Extractors.
Aliases, replace strings in the blog post with values defined in the Alias.

You can see the available options available for each of the above concepts. For example, to get a full list of Extractors;

curl -X 'GET' \
  'http://127.0.0.1:8001/api/v1/extractors/' \
  -H 'accept: application/json'

{
  "page_size": 50,
  "page_number": 1,
  "page_results_count": 50,
  "total_results_count": 68,
  "extractors": [
    {
      "id": "pattern_ipv4_address_only",
      "name": "IPv4 Address Only",
      "type": "pattern",
      "description": "Extracts IPv4 addresses",
      "notes": "The logic for this is covered in the Python Validators library: https://validators.readthedocs.io/en/latest/#module-validators.ip_address. A good description of IPv4/IPv6 formats can be read here: https://www.ibm.com/docs/en/ts4500-tape-library?topic=functionality-ipv4-ipv6-address-formats.",
      "created": "2020-01-01",
      "modified": "2020-01-01",
      "created_by": "DOGESEC",
      "version": "1.0.0",
      "stix_mapping": "ipv4-addr"
    },
    {
      "id": "pattern_ipv4_address_cidr",
      "name": "IPv4 Address with CIDR",
      "type": "pattern",
      "description": "Extracts IPv4 addresses with CIDRs",
      "notes": "The pattern_ipv4_address_only base extration is used, in addition to logic to detect a port. Port numbers must be within the range 0-65535.",
      "created": "2020-01-01",
      "modified": "2020-01-01",
      "created_by": "DOGESEC",
      "version": "1.0.0",
      "stix_mapping": "ipv4-addr"
    },

To save you having to define what extractions, whitelists and aliases you want to use when processing blog posts you can create a Profile with these concepts defined.

Obstracts profiles

Here is an example request to create a very simple profile (no whitelists or aliases are set);

curl -X 'POST' \
  'http://127.0.0.1:8001/api/v1/profiles/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "Basic threat intel extractions. AI relationship. Extract text from images.",
  "extractions": [
    "pattern_ipv4_address_only",
    "pattern_domain_name_only",
    "pattern_email_address",
    "lookup_mitre_attack_enterprise_id"
  ],
  "relationship_mode": "ai",
  "extract_text_from_image": true,
  "defang": true
}'

The extractions I’ve shown above are very simple, intended only to give you an idea of how profiles work.

Obstracts supports hundreds of extraction types, including AI based TTP detection.

You’ll see I also define extract_text_from_image, defang and relationship_mode in the body of the request to create a profile.

relationship_mode is also a txt2stix setting and determines how the extractions should be connected. There are two options; standard or ai. standard creates basic relationships back to master Report STIX Object generated for the post. Its useful when you only want the IoCs. ai runs the report through an AI model to identify the relationships between the extracted data and will create STIX Relationship objects to represent them.

Obstracts also runs file2txt in the backend (in html_article mode). extract_text_from_image can be used to turn images into text (which extractions can be made from). defang removes fangs from observables (e.g. 1.1.1[.].1 becomes 1.1.1.1).

The request returns a response looks like this:

{
  "id": "bcf09ec5-d124-528a-bb21-480114231795",
  "created": "2024-10-18T13:54:33.688257Z",
  "name": "Basic threat intel extractions. AI relationship. Extract text from images.",
  "extractions": [
    "pattern_ipv4_address_only",
    "pattern_domain_name_only",
    "pattern_email_address",
    "lookup_mitre_attack_enterprise_id"
  ],
  "whitelists": [],
  "aliases": [],
  "relationship_mode": "ai",
  "extract_text_from_image": true,
  "defang": true
}

I can now use the profile id when adding a blog to define how text in each post should be extracted.

Step 2: Adding a blog

The request takes the form;

curl -X 'POST' \
  'http://127.0.0.1:8001/api/v1/feeds/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "profile_id": "bcf09ec5-d124-528a-bb21-480114231795",
  "url": "http://feeds.feedburner.com/Unit42",
  "include_remote_blogs": true
}'

In addition to the profile_id, I include the url of the RSS or ATOM feed of the blog. Here I use Unit42’s Threat Research feed.

You will also note I set include_remote_blogs. If set to false, this will get Obstracts to ignore any feeds not on the same domain as the URL of the feed. In this request I set it to true because the Unit42 posts are on not on the domain http://feeds.feedburner.com/Unit42, they are found at https://unit42.paloaltonetworks.com/.

This request will return a response will return a Job object, with a Job id.

{
  "id": "e922574f-f4d4-4409-8f3d-3d43115a2eb7",
  "feed_id": "b4e3f13c-0ad6-5abe-be01-2475d341bf84",
  "profile_id": "bcf09ec5-d124-528a-bb21-480114231795",
  "created": "2024-10-18T14:11:13.385904Z",
  "state": "retrieving",
  "history4feed_status": "pending",
  "history4feed_job": null,
  "item_count": 0,
  "processed_items": 0,
  "failed_processes": 0
}

Jobs are responsible for crawling the history of the blog (history4feed), converting each post to markdown (file2txt), extracting threat intel from the markdown for each post (txt2stix), and then finally storing it into the database (stix2arango).

Jobs can take a few minutes for blogs with a large archive whilst these processes happen.

You can track a job using its id as follows;

curl -X 'GET' \
  'http://127.0.0.1:8001/api/v1/jobs/e922574f-f4d4-4409-8f3d-3d43115a2eb7/' \
  -H 'accept: application/json'

{
  "id": "e922574f-f4d4-4409-8f3d-3d43115a2eb7",
  "feed_id": "b4e3f13c-0ad6-5abe-be01-2475d341bf84",
  "profile_id": "bcf09ec5-d124-528a-bb21-480114231795",
  "created": "2024-10-18T14:11:13.385904Z",
  "state": "retrieving",
  "history4feed_status": "running",
  "history4feed_job": {
    "id": "e922574f-f4d4-4409-8f3d-3d43115a2eb7",
    "info": "",
    "urls": {
      "failed": [],
      "skipped": [],
      "retrieved": [
        {
          "id": "3a0b4921-1318-5166-8330-2925b7477880",
          "url": "https://unit42.paloaltonetworks.com/script-based-malware/"
        },
        {
          "..."
        },
        {
          "id": "cc0f4faa-da4c-51b8-a907-776ed0cb8d5c",
          "url": "https://unit42.paloaltonetworks.com/cloud-threat-report-2h-2021/"
        }
      ]
    },
    "state": "running",
    "feed_id": "b4e3f13c-0ad6-5abe-be01-2475d341bf84",
    "profile_id": "bcf09ec5-d124-528a-bb21-480114231795",
    "run_datetime": "2024-10-18T14:11:13.359660Z",
    "count_of_items": 76,
    "include_remote_blogs": true,
    "latest_item_requested": "2024-10-18T14:11:13.359169Z",
    "earliest_item_requested": "2020-01-01T00:00:00Z"
  },
  "item_count": 18,
  "processed_items": 0,
  "failed_processes": 0
}

I’ve shortened the above response for brevity, but once the state of the job is processed, it means Obstracts has completed the collection of posts and the extraction of intelligence from them.

Step 3: Reading a blog

Once the job is complete, I can get the feed information for the feed_id shown in the job from the Feeds endpoint;

curl -X 'GET' \
  'http://127.0.0.1:8001/api/v1/feeds/b4e3f13c-0ad6-5abe-be01-2475d341bf84/' \
  -H 'accept: application/json'

{
  "id": "b4e3f13c-0ad6-5abe-be01-2475d341bf84",
  "count_of_posts": 158,
  "title": "Unit 42",
  "description": "Palo Alto Networks",
  "url": "http://feeds.feedburner.com/Unit42",
  "earliest_item_pubdate": "2020-06-04T02:00:17Z",
  "latest_item_pubdate": "2024-10-17T10:00:05Z",
  "datetime_added": "2024-10-18T14:11:13.355532Z",
  "feed_type": "rss"
}

One really cool feature I want to point out here is that Obstracts is not limited to the post shown in the feed URL entered.

If you go to the feed I used, http://feeds.feedburner.com/Unit42, you’ll see there are only 15 <items> (posts) shown. Obstracts is able to crawl the entirety of a blog. As you can see above, Obstracts retrieves 158 posts for this feed.

This becomes very useful when searching for historic intelligence across posts – you’ll be able to uncover everything ever published (more on that later in this post).

I can request the posts for this feed like so;

curl -X 'GET' \
  'http://127.0.0.1:8001/api/v1/feeds/b4e3f13c-0ad6-5abe-be01-2475d341bf84/posts/' \
  -H 'accept: application/json'

{
  "page_size": 50,
  "page_number": 1,
  "page_results_count": 50,
  "total_results_count": 158,
  "posts": [
    {
      "id": "a3cda6e7-e0bc-5379-b908-9c9f196a64b0",
      "profile_id": "bcf09ec5-d124-528a-bb21-480114231795",
      "datetime_added": "2024-10-18T14:12:54.696154Z",
      "datetime_updated": "2024-10-18T14:12:58.608481Z",
      "title": "Gatekeeper Bypass: Uncovering Weaknesses in a macOS Security Mechanism",
      "description": "<html><body><div><div class=\"be__contents-wrapper\">\n                <h2><a id=\"post-137110-_nz6n5y1svbul\"></a><strong>Executive Summary</strong></h2>\n<p>Unit 42 researchers have found that certain third-party utilities and applications pertaining to archiving, virtualization and Apple’s native command-line tools do not enforce the quarantine attribute. This can pose a threat to the integrity of a security feature on macOS known as Gatekeeper, which is responsible for ensuring that only trusted software runs on the system. A bypass of Gatekeeper could leave the user unprotected from risky applications that may attempt to execute malicious content.</p>\n<p>One of the key components of the Gatekeeper security feature on macOS is a metadata attribute that the browser adds to downloaded files, which triggers Gatekeeper to validate the application. Apple assumes that developers will comply with their security guidelines regarding the inheritance of extended attributes, to ensure that this scanning mechanism can properly function. Because this is not necessarily the case, this can pose a weakness in the Gatekeeper mechanism.</p>\n<p>We urge all third-party developers to comply with Gatekeeper’s security requirements by enforcing the attribute on all files their applications handle. This will help to reduce the risk of malicious Gatekeeper bypasses.</p>\n<p>Gatekeeper is an essential security mechanism...",
      "link": "https://unit42.paloaltonetworks.com/gatekeeper-bypass-macos/",
      "pubdate": "2024-10-17T10:00:05Z",
      "author": "Adva Gabay and Maor Dokhanian",
      "is_full_text": true,
      "content_type": "text/html; charset=utf-8",
      "added_manually": false,
      "categories": [
        "threat-research",
        "vulnerabilities",
        "macos",
        "apple-gatekeeper",
        "third-party-applications"
      ]
    },

I’ve cut a lot of content from the report object for brevity in this post.

I’ve only posted the first post for this blog for brevity, but it shows another very useful ability of Obstracts; the ability to get the full text of an article.

If you go back to the feed http://feeds.feedburner.com/Unit42 you’ll see that is only contains summaries of each post. Obstracts crawls for the entire post content and stores it as HTML in the description property of the report.

As new posts are added to blogs, you can run a request to PATCH /api/v1/feeds/{feed_id}/ to poll the blog for any new posts since the time of the last post indexed by Obstracts.

Obstracts also creates a markdown copy of the post using file2txt. This is what the extractions are made from. You can see the created markdown using the markdown endpoint;

curl -X 'GET' \
  'http://127.0.0.1:8001/api/v1/feeds/b4e3f13c-0ad6-5abe-be01-2475d341bf84/posts/a3cda6e7-e0bc-5379-b908-9c9f196a64b0/markdown/' \
  -H 'accept: */*'

[comment]:<> (===START PAGE 1===)

## **Executive Summary**

Unit 42 researchers have found that certain third-party utilities and applications pertaining to archiving, virtualization and Apple’s native command-line tools do not enforce the quarantine attribute.This can pose a threat to the integrity of a security feature on macOS known as Gatekeeper, which is responsible for ensuring that only trusted software runs on the system.A bypass of Gatekeeper could leave the user unprotected from risky applications that may attempt to execute malicious content.

One of the key components of the Gatekeeper security feature on macOS is a metadata attribute that the browser adds to downloaded files, which triggers Gatekeeper to validate the application.Apple assumes that developers will comply with their security guidelines regarding the inheritance of extended attributes, to ensure that this scanning mechanism can properly function.Because this is not necessarily the case, this can pose a weakness in the Gatekeeper mechanism.

We urge all third-party developers to comply with Gatekeeper’s security requirements by enforcing the attribute on all files their applications handle.This will help to reduce the risk of malicious Gatekeeper bypasses.

Gatekeeper is an essential security mechanism on macOS; ideally its integrity will not rely on the goodwill of developers but on Apple’s enforcement of the quarantine attribute propagation where relevant.While Apple expects third-party application developers to keep a certain standard, some built-in utilities do not comply with this standard.Should Apple choose to do so, addressing this issue could be a positive step toward making the system more secure.

Palo Alto Networks customers are protected from malicious content from third-party applications through
[Cortex XDR](https://docs-cortex.paloaltonetworks.com/p/XDR)
and
[XSIAM](https://docs-cortex.paloaltonetworks.com/p/XSIAM)
.

I haven’t included the full markdown of the post here for brevity.

Obstracts also stores a copy of all images found in the blog post locally…

curl -X 'GET' \
  'http://127.0.0.1:8001/api/v1/feeds/b4e3f13c-0ad6-5abe-be01-2475d341bf84/posts/a3cda6e7-e0bc-5379-b908-9c9f196a64b0/images/' \
  -H 'accept: application/json'

{
  "page_size": 50,
  "page_number": 1,
  "page_results_count": 4,
  "total_results_count": 4,
  "images": [
    {
      "name": "0_image_0.png",
      "url": "http://127.0.0.1:8001/staticfiles/a3cda6e7-e0bc-5379-b908-9c9f196a64b0/files/0_image_0.png"
    },
    {
      "name": "0_image_1.png",
      "url": "http://127.0.0.1:8001/staticfiles/a3cda6e7-e0bc-5379-b908-9c9f196a64b0/files/0_image_1.png"
    },
    {
      "name": "0_image_2.png",
      "url": "http://127.0.0.1:8001/staticfiles/a3cda6e7-e0bc-5379-b908-9c9f196a64b0/files/0_image_2.png"
    },
    {
      "name": "0_image_3.png",
      "url": "http://127.0.0.1:8001/staticfiles/a3cda6e7-e0bc-5379-b908-9c9f196a64b0/files/0_image_3.png"
    }
  ]
}

Here is what 0_image_2.png looks like;

Example Obstract extracted image

Remember earlier I mentioned the extract_text_from_image setting in the profile. Here is an example of what that text extraction looks like for the above image in the markdown.

[comment]:<> (===START IMAGE DETECTED===)

![Screenshot of a computer terminal displaying commands related to xattr and gzip operations, focused on handling a file named 'bypass.app.gz' and interacting with macOS extended attributes.](http://127.0.0.1:8001/api/v1/feeds/b4e3f13c-0ad6-5abe-be01-2475d341bf84/posts/a3cda6e7-e0bc-5379-b908-9c9f196a64b0/markdown/a3cda6e7-e0bc-5379-b908-9c9f196a64b0/files/0_image_3.png)

[comment]:<> (===START EMBEDDED IMAGE EXTRACTION===)
% xattr bypass.app.gz
com.apple.macl
com.apple.quarantine
% gzip -d bypass.app.gz
% xattr bypass.app
%
[comment]:<> (===END EMBEDDED IMAGE EXTRACTION===)

[comment]:<> (===END IMAGE DETECTED===)

This is incredibly useful, because Obstracts will use the text found in images when creating extractions (the text in this image will create host name type extractions).

Step 4: Browsing extracted intelligence

As noted, Obstracts uses txt2stix to extract intelligence from a post.

To get all the objects extracted for a single post, you can use the objects endpoint like so;

curl -X 'GET' \
  'http://127.0.0.1:8001/api/v1/feeds/b4e3f13c-0ad6-5abe-be01-2475d341bf84/posts/3cc32201-473f-5667-82b2-7f7dddea21ab/objects/?include_txt2stix_notes=false' \
  -H 'accept: application/json'

{
  "page_size": 50,
  "page_number": 2,
  "page_results_count": 50,
  "total_results_count": 283,
  "objects": [
    {
      "id": "domain-name--fed14a83-377e-58c2-916c-fc61641576b2",
      "spec_version": "2.1",
      "type": "domain-name",
      "value": "www.virusbulletin.com"
    },
    {
      "id": "email-addr--f0771d0e-5475-5da1-b79d-d906a2de73d3",
      "spec_version": "2.1",
      "type": "email-addr",
      "value": "[email protected]"
    },
    {
      "id": "file--04aab333-b837-5078-b795-64308d8500e6",
      "name": "SysInfo_21_04_94.txt",
      "spec_version": "2.1",
      "type": "file"
    },

All objects in Obstracts are represented as STIX 2.1 objects. This makes it possible to easily sync each the intelligence found in each post with downstream security tools that support STIX.

Obstracts x Arango TAXII Server

Obstracts also works seamlessly Arango TAXII Server if your product supports TAXII to make this integration even simpler.

Here is an example of the objects extracted from a blog post by Obstracts with ai relationship mode enabled…

You can see how the descriptive relationships in the post (see the description of the report object) have been observed by the AI model in the network graph, e.g. 1.1.1.1 resolves-to fakedomain.com.

Step 5: Starting from an IoC or TTP

Blogs are an incredibly useful starting point for research, however, when you’re trying to understand everything about a particular thing – whether that thing is an IP address, domain name, MITRE ATT&CK technique – you don’t want to have to go through all blog posts to try and find other reports of it.

The useful thing about STIX SCOs (Observables/IoCs) is that the id of the object is created from the value property, meaning the id of an SCO, lets say 1.1.1.1, will be the same in all posts. Read my post A Beginners Guide to Creating Threat Intelligence using STIX 2.1 Objects if this logic is unclear.

That means you can easily search for it across blogs. Let me show you how…

Lets start by assuming you have the email [email protected] (found in the example I showed previously). To get the STIX ID of that email you can use the SCO endpoint;

curl -X 'GET' \
  'http://127.0.0.1:8001/api/v1/objects/scos/?value=cfa4a551515dc742s%40gmail.com' \
  -H 'accept: application/json'

{
  "page_size": 50,
  "page_number": 1,
  "page_results_count": 1,
  "total_results_count": 1,
  "objects": [
    {
      "id": "email-addr--f0771d0e-5475-5da1-b79d-d906a2de73d3",
      "spec_version": "2.1",
      "type": "email-addr",
      "value": "[email protected]"
    }
  ]
}

Now you can search all reports the email address appears in using its ID;

curl -X 'GET' \
  'http://127.0.0.1:8001/api/v1/object/email-addr--f0771d0e-5475-5da1-b79d-d906a2de73d3/reports/' \
  -H 'accept: application/json'

{
  "page_size": 50,
  "page_number": 1,
  "page_results_count": 3,
  "total_results_count": 3,
  "objects": [
    "report--3cc32201-473f-5667-82b2-7f7dddea21ab",
    "report--44dedfb7-4fd4-49dd-a3e2-0d371478c156",
    "report--00ac4a6d-76c5-4c5f-9089-a7ef8341912f"
  ]
}

And then finally, you can pivot to each report ID using the Get Object using the Report ID to get the full details of the blog post it represents;

curl -X 'GET' \
  'http://127.0.0.1:8001/api/v1/object/report--3cc32201-473f-5667-82b2-7f7dddea21ab/' \
  -H 'accept: application/json'

{
  "page_size": 50,
  "page_number": 1,
  "page_results_count": 1,
  "total_results_count": 1,
  "objects": [
    {
      "confidence": 0,
      "created": "2024-11-05T19:54:17.066162Z",
      "created_by_ref": "identity--a1f2e3ed-6241-5f05-ac2e-3394213b8e08",
      "description": "[comment]:<> (===START PAGE 1===)\n\n\n\n\n**Executive Summary**\n---------------------\n\n\n\n Unit 42 researchers discovered two malware samples used by the Sparkling Pisces (aka Kimsuky) threat group.This includes an undocumented keylogger, called...",
      "external_references": [
        {
          "source_name": "txt2stix_report_id",
          "external_id": "3cc32201-473f-5667-82b2-7f7dddea21ab"
        },
        {
          "source_name": "txt2stix Report MD5",
          "description": "4266cad2c25b7210803fdb62f63add24"
        },
        {
          "source_name": "post_link",
          "url": "https://unit42.paloaltonetworks.com/kimsuky-new-keylogger-backdoor-variant/"
        },
        {
          "source_name": "obstracts_feed_id",
          "external_id": "b4e3f13c-0ad6-5abe-be01-2475d341bf84"
        },
        {
          "source_name": "obstracts_profile_id",
          "external_id": "da4dddc2-86bd-52b7-8c09-37fc0f72b679"
        }
      ],
      "id": "report--3cc32201-473f-5667-82b2-7f7dddea21ab",
      "modified": "2024-11-05T20:01:21.747423Z",
      "name": "Unraveling Sparkling Pisces’s Tool Set: KLogEXE and FPSpy",
      "object_marking_refs": [
        "marking-definition--94868c89-83c2-464b-929b-a1a8aa3c8487",
        "marking-definition--f92e15d9-6afc-5ae2-bb3e-85a1fd83a3b5"
      ],
      "object_refs": [
        "indicator--e8ea42ab-52c8-5551-a411-672a50ab7e27",
        "..."
      ],
      "published": "2024-11-05T20:01:21.747351Z",
      "spec_version": "2.1",
      "type": "report"
    }
  ]
}