Supercharging txt2stix with AI

If you are reading this blog post via a 3rd party source it is very likely that many parts of it will not render correctly (usually, the interactive graphs). Please view the post on dogesec.com for the full interactive viewing experience.

tl;dr

txt2stix is a Python script that is designed to identify and extract IoCs and TTPs from text files, identify the relationships between them, convert them to STIX 2.1 objects, and output as a STIX 2.1 bundle.

In this post I take a look at some of the newer features you might have missed and will find very useful.

AI Extractions

txt2stix first launched with pattern extractions (using regexs) and lookup extractions (extracting strings from text found in lookup tables). These extraction types still exist, however, there are issues with them:

they are very specific: they extract strings as oppose descriptive text referring to an extraction (e.g. describing an ATT&CK technique, but not naming it)
some reasoning is impossible: for example deciding if the extraction is a TLD or file extension (.zip being the most annoying example of this)
they can become outdated quickly: for example determining TLDs for extraction

Strong AI reasoning should make these types of issues easily avoidable.

That said, when I first tried integrating public LLMs with txt2stix about a year-and-a-half ago I ran into lots of problems: hallucinations, lack of structure in response, and clearly erroneous results.

However, in the last year AI models have become much smarter when it comes to performing such logical reasoning tasks.

txt2stix ships with a number of AI based extractions already built in.

Here is an example of how an AI extraction configoration for MITRE ATT&CK Enterprise is structured:

ai_mitre_attack_enterprise:
  type: ai
  name: 'MITRE ATT&CK Enterprise'
  description: ''
  notes: 'lookup_mitre_attack_enterprise_id and lookup_mitre_attack_enterprise_name legacy extractions also exists if you cannot use AI'
  created: 2020-01-01
  modified: 2020-01-01
  created_by: DOGESEC
  version: 1.0.0
  prompt_base: 'Extract all references to MITRE ATT&CK Enterprise tactics, techniques, groups, data sources, mitigations, software, and campaigns described in the text. These references may not be explicit in the text so you should be careful to account for the natural language of the text your analysis. Do not include MITRE ATT&CK ICS or MITRE ATT&CK Mobile in the results.'
  prompt_helper: 'If you are unsure, you can learn more about MITRE ATT&CK Enterprise here: https://attack.mitre.org/matrices/enterprise/'
  prompt_conversion: 'You should respond with only the ATT&CK ID.'
  test_cases: ai_mitre_attack_enterprise
  stix_mapping: ctibutler-mitre-attack-enterprise-id

You can easily add your own AI extractions by adding a new entry in the AI extractions config.yaml.

The key parts are:

prompt_base (required): defines the LLM prompt to extract the data.
prompt_helper (optional): allows you to pass additional contextual information to help the model reason correctly.
prompt_conversion (optional): a final prompt that can be used to convert the output to a desired format.
stix_mapping (required): defines how the data should be modeled in the output. Available options are listed here.

Running the extractions: ai_mitre_attack_enterprise, ai_ipv4_address_only, ai_url, ai_file_name and ai_email_address on the text attack_flow_demo.txt:

Victims receive spear phishing emails with from [email protected] malicious zip files attached named badfile.zip

Due to password protection, the zip files are able to bypass some AV detections.

The zip files are extracted and usually contain a malicious document, such as a .doc, .pdf, or .xls. Some examples are malware.pdf and bad.com

The extracted files contain malicious macros that connect to a C2 server 1.1.1.1

python3 txt2stix.py \
    --relationship_mode ai \
    --ai_settings_relationships openai:gpt-4o \
    --input_file tests/data/manually_generated_reports/attack_flow_demo.txt \
    --name 'dogesec blog att&ck extractions' \
    --tlp_level clear \
    --confidence 100 \
    --use_extractions ai_mitre_attack_enterprise,ai_ipv4_address_only,ai_url,ai_file_name,ai_email_address \
    --ai_settings_extractions openai:gpt-4o \
    --report_id e376e33c-e427-4d7c-afc4-7204e556e7a3

p.s, if you’ve noticed the output also includes Attack Flow objects and you’re not sure what they are, have a read of my post, Beyond the ATT&CK Matrix: How to Build Dynamic Attack Flows with STIX

Extraction Boundaries

Take the following string:

https://subdomain.google.com/file.txt

It contains a:

URL
Subdomain
File

txt2stix also allows for more granular classification of this string. For example using the extractions pattern_url, pattern_url_file, pattern_url_path could potentially create three separate extractions from the example string. This is especially problematic when extractions are not explicitly selected and wildcards are used (e.g. 'pattern_*').

This used to be the default txt2stix behaviour. In the latest release we’ve changed this.

ignore_extraction_boundary is now used to control this logic. Setting to false (the default) ensures only one extraction will exist for a single string.

Take the input test_extraction_boundary.txt:

https://subdomain.google.com/file.txt

python3 txt2stix.py \
  --relationship_mode standard \
  --input_file tests/data/manually_generated_reports/test_extraction_boundary.txt \
  --name 'extraction boundary tests false' \
  --tlp_level clear \
  --confidence 100 \
  --ignore_extraction_boundary false \
  --use_extractions 'pattern_*' \
  --report_id 29516286-d079-46ee-aa13-e1332ba68ed8

Whereas setting ignore_extraction_boundary to true will allow for multiple extractions from the same string.

python3 txt2stix.py \
  --relationship_mode standard \
  --input_file tests/data/manually_generated_reports/test_extraction_boundary.txt \
  --name 'extraction boundary tests true' \
  --tlp_level clear \
  --confidence 100 \
  --ignore_extraction_boundary true \
  --use_extractions 'pattern_*' \
  --report_id 93aa7345-2ca2-4244-8e25-ac86192c1cf9

Ultimately you can choose either of the two options in combination with setting extractions to get the result you want.

Report Analysis

txt2stix is perfect for batch processing of cyber threat intelligence reports.

Reports being batched processed usually have no sort of logical classification beyond the directory they are stored in. By adding additional metadata to a report, for example tags, makes is easier for filtering later on.

It is also very likely when batch processing reports that many will be irrelevant to cyber threat intelligence use-cases. A really good example of this is product marketing provided by security vendors.

If using AI extractions or relationship modes, it makes sense to ensure the token consumption (most of which is spent analysing the report) is used as effectively as possible.

Now you can pass ai_content_check_provider with an AI model to get the AI to provide a JSON object in its response with three properties:

describes_incident (boolean): describes if the report describes some sort of threat intel
incident_classification (dictionary): one or more classifications. At the time of writing the following incident classifications are supported:
- other (the report does not fit into any of the following categories)
- apt_group
- vulnerability
- data_leak
- malware
- ransomware
- infostealer
- threat_actor
- campaign
- exploit
- cyber_crime
- indicator_of_compromise
- ttp
explanation: the AIs reasoning for how it came to these decisions

Using the same document as the first example in this post attack_flow_demo.txt with ai_content_check_provider included in the request:

python3 txt2stix.py \
    --relationship_mode ai \
    --ai_settings_relationships openai:gpt-4o \
    --input_file tests/data/manually_generated_reports/attack_flow_demo.txt \
    --name 'dogesec blog content check example' \
    --tlp_level clear \
    --confidence 100 \
    --use_extractions ai_mitre_attack_enterprise,ai_ipv4_address_only,ai_url,ai_file_name,ai_email_address \
    --ai_settings_extractions openai:gpt-4o \
    --ai_content_check_provider openai:gpt-4o \
    --report_id d20da66a-154d-4803-b897-6243fca7c135

Produces the following response from the AI (found in the data--d20da66a-154d-4803-b897-6243fca7c135.json file in the txt2stix output directory):

{
  "describes_incident": true,
  "explanation": "The document describes a cyber security incident involving spear phishing emails with malicious zip file attachments. These zip files contain documents with malicious macros that connect to a command and control server.",
  "incident_classification": [
    "malware",
    "indicator_of_compromise"
  ]
}

You might be wondering why you can pass three different models for content checking, extractions, and relationship reasoning. We did this to allow flexibility because we’ve noticed different models performing better at different tasks (i.e. Gemini is particularly good at identifying relationships). However, if cost is an issue, selecting the same model across all three options reduces the cost and is generally comparable in output.