1. Introduction
  2. Overview of AI in Communication Surveillance
    • 2.1. Background
      2.2. AI Approach
      2.3. How does it work?
  3. Benefits of AI in Communication Surveillance
    • 3.1. Performance Comparison
      3.1.1. Recall
      3.1.2. Precision
      3.1.3. F1 Score
      3.2. Case Study
      3.2.1. Improved Recall
      3.2.2. Reduction in Alert Volumes
  4. Risks and Challenges Associated with AI
    • 4.1. Explainability
      4.2. Transparency
      4.3. Wider Risks of AI usage in Financial Services
  5. Outlook and Recommendations
    • 5.1. Behavox LLM 2.0
      5.2. AI chatbot
  6. Conclusion

1. Introduction

Behavox is a market leader in the application of Artificial Intelligence to monitoring of text
and voice communications, Behavox’s software protects companies and their employees
from bad actors engaged in illegal and malicious activities including market abuse and
non-financial conduct risk. Behavox provides its software solutions to a number of entities
regulated by the CFTC.

Behavox is grateful to the CFTC for the opportunity to be able to comment on the use of
Artificial Intelligence in CFTC-Regulated Markets. Artificial Intelligence has enormous
potential to enhance the financial services industry, however we fully acknowledge there
are associated risks and appreciate and respect the CFTC’s commitment to responsible AI

2. Overview of AI in Communication Surveillance

2.1. Background

Traditionally, firms have utilized a combination of random sampling and/or lexicon based
rules to monitor firm communications. Random sampling involves the selection of a
random sample of communications for review, while lexicons use keywords combined
with proximity indicators, wildcards and boolean operators to form rules that generate
alerts for communications that meet them.

Until recently, lexicons and random sampling were the only options available to
Compliance teams for monitoring communications. However, risk detection methods
based on lexicon rules suffer from a number of inherent limitations, including:

  1. Variations in communication are almost infinite and rule-based lexicons cannot
    cater for every linguistic pattern, resulting in the risk of missed alerts (false
    negatives). Consider the following as a non-exhaustive list of variations that need
    to be accounted for:

    • Differences in language used when talking to a client vs. a competitor vs. an
      internal colleague
    • Differences in language used based on closeness or importance of the
      relationship with the other party
    • Differences in language used based on demographics of the
      communicating parties
    • Differences in language used in a group vs. one-on-one
    • Differences in language used between employees with different levels of
      proficiency in the spoken language
    • Differences in language used across channels of communication
    • Differences in the way a single person can express the same thing
    • Differences in slang, etc. used across different geographies
    • Differences in language used when talking to a fellow countryman
    • Differences in language used across different desks or businesses
  2. Rule-based solutions are static, with no structured mechanism to incorporate
    feedback to improve the quality of scenario results over time (i.e. to increase true
    positives and reduce false positives)
  3. Poor performance on voice transcripts due to rigidity of the rules not allowing for
    speech disfluency (false starts, filler words such as “um”, corrections, interruptions,
    etc.) and transcription errors
  4. Poor performance due to high volume of typos and grammatical errors which
    cannot be captured in rigid rule structure

The table below provides some examples of how rigid lexicon rules can fail. In this
example, consider the lexicon rule below which has been enhanced to incorporate
intentional concealment and common typos.
Text(value = “spoof” “sp00f” “spoofing” “sp00fing” “spoofin” “sp00fin” “layer” “l@yer”
“layering” “l@yering”)
Text(value = “ur” “you” “u” “we” “i” “I’m” “algos” “algo” “algorithms” “algorithm” “algorithym”
“algoritym” “program” “programs” “programme” “programmes” “sell” “offer” “offers” “ask”
“buy” “bid” “bids” “market” “mkt”)

AI technology has advanced beyond the traditional lexicon rule approach by incorporating
machine learning models designed to identify and exclude certain benign types of
communications such as news, disclaimers, spam, etc. The first generation of Behavox
scenarios used lexical patterns or keyword lists to identify relevant content and then
generic filters based on Machine Learning were used in all applications in order to exclude
common noise, e.g. spam, news, automatically generated content, and the list and quality
of such classifications gradually grew. The lexical patterns and lists were built based on the
examples from enforcement cases and client feedback. This noise filter approach has now
been widely adopted by some of the world’s largest financial institutions.

2.2. AI Approach

Over the last 3 years Behavox has invested heavily in leveraging the recent advances in
NLP technology to bring to market fully AI models to monitor communications. This AI
solution negates the need for lexicons, and addresses the inherent limitations that exist
with this rules-based approach.

2.3. How does it work?

AI models operate at a sentence level

Language is made up of letters that are joined together to form words, and that in turn
get added together to make meaningful sentences. Those sentences are sometimes
grouped into paragraphs and ultimately multiple paragraphs form part of a whole

Lexicons look for keywords within proximity of other keywords. AI on the other hand
analyzes full sentences and therefore benefits from the context and meaning that comes
along with a sentence.

Below is a simplified illustration of the process in which AI models are trained and used to
detect problematic communications:

AI models utilize large language models, developed by Google, Open AI, Mistral etc that
have been pre trained on extremely large volumes of data (e.g. Wikipedia and Google
Books) as its base. As a result of this base model, AI risk policies already understand
relationships between words and how the words in a sentence affect the context. For
example, it will know from the words used in the sentence (i.e. the context) whether the
word “bank” refers to a financial institution or a river bank.

The base language model is an expert in language, but not in Compliance, and therefore
it needs to be fine tuned to our target domain. To do this, we train it on thousands of
examples of each of the target risks (e.g. insider dealing, spoofing, etc.) so that it can
distinguish between BAU communications and compliance risks.

AI models identify correlations in sentences

When the model receives a new sentence to analyze, it considers how similar the
sentence is to what it has been trained to detect. Importantly, it is probabilistic in nature,
meaning that it produces a confidence score in its prediction, unlike lexicon rules that are
binary and require an exact match. Due to the fact that it has a vast language model as its
base, it does not need to have been trained on sentences exactly the same as, or using the
same words as the new sentence it has been presented with for analysis. It will still be able
to identify the semantic similarity between the new sentence and the training data and
generate an alert regardless of the fact that it may not have been trained on those exact
words. It does this by calculating the correlation between the new sentence and the
sentences in the training data.

3. Benefits of AI in Communication Surveillance

Artificial intelligence offers distinct advantages over traditional lexicon-based approaches.
Unlike lexicon methods, which rely on predefined lists of keywords to flag
communications, AI models are capable of understanding context at the sentence level
meaning that the alerts generated are significantly more relevant benefiting from the
additional contextual understanding of the AI models compared to the rule based lexicon
approach. Additionally AI can adapt to changes in language such as typos, slang,
shorthand and acronyms which often render lexicon systems ineffective. As a result, AI
provides a more dynamic, efficient, and effective solution for monitoring communications.

Artificial Intelligence for communications surveillance has a number of key benefits over
lexicon alternatives.

  • Improved performance – Behavox has done multiple Outcomes Analysis
    comparisons between AI models and lexicon approaches and AI outperforms
    Lexicons on Recall, Precision and F1 score. These metrics are defined in Section 3.1
  • Better quality alerts – because the AI models operate at the sentence level the
    alerts are relevant to the risk being targeted.
  • Reduced alert volumes
  • Improved efficiency – the cost of operating is reduced as firms no longer need to
    employ large numbers of surveillance analysts to close out vast numbers of false
    positive alerts.
  • The reduction in alert volumes mean that surveillance teams can be deployed
    more effectively in investigating in depth the higher quality alerts that are

3.1. Performance Comparison

Behavox has performed multiple side by side outcomes analysis tests that have
incontrovertibly demonstrated the superior performance of AI over the incumbent lexicon
solutions. The key metrics used to evaluate the performance of classification models
where you need to distinguish between positive and negative categories are Recall,
Precision and F1 Score.

3.1.1. Recall

Recall is the proportion of actual positives that are correctly identified by the model. It
answers the question, “Of all the true positives in the data, how many did the model
successfully identify?” It is particularly important when the cost of missing a positive
instance is high.

Recall = True Positives / (True Positives + False Negatives)

3.1.2. Precision

Precision is the proportion of positive identifications that were actually correct. It answers
the question, “Of all the positives identified by the model, how many were actually
positive?” This metric is crucial when the cost of a false positive is high.

Precision = True Positives / (True Positives + False Positives)

3.1.3. F1 Score

The F1 score is the harmonic mean of precision and recall. It is used to measure a test’s
accuracy, and it balances the trade-off between precision and recall. The F1 score is
particularly useful when you want to compare two models that have different precision
and recall values.

F1 = 2 * (Precision * Recall) / (Precision + Recall)

In all cases and on all metrics Behavox’s AI models significantly outperform lexicon

3.2. Case Study

This section provides two recent tests that were conducted with a client to compare the
recall of a legacy lexicon solution and Behavox’s AI and volume of alerts generated on a
live client environment for.

The client provided Behavox with 690 true positive sentences representing 6 different
risks. These were sentences that were considered “True Positives” i.e. sentences that you
would want to flag for further investigation if they were to appear in a monitored
employee’s communications. This set of sentences included the following variations:

  1. Typos
  2. Language variations
  3. Slang

3.2.1. Improved Recall

The dataset was then run against a legacy lexicon solution and the AI models, and the
table below shows the results of that test.

3.2.2. Reduction in Alert Volumes

Behavox also conducted an alert volume comparison test between the AI model
approach and the legacy lexicon approach on the client’s live communications over a 4
week period. The results of the volume assessment are found below:


4. Risks and Challenges Associated with AI

4.1. Explainability

Artificial intelligence models, particularly those based on deep learning, often struggle
with issues of explainability. This opacity stems from the complex and multilayered nature
of these models, which can make it difficult for the layman to understand how decisions
are derived. Unlike more transparent, rule-based systems where decision pathways are
clear, the processes within deep learning algorithms involve numerous nonlinear
computations that are not readily interpretable to humans.

This perceived “black box” nature of AI can pose significant challenges, especially in the
financial services sector where understanding the rationale behind decisions is crucial.
The perceived lack of explainability can hinder trust in AI. Addressing these explainability
issues is vital for wider adoption and responsible implementation of AI technologies.

Behavox Approach

In Behavox noise reduction models (i.e. models that are utilized to identify and remove
from the analysis communications that are known to be benign such as spam, news,
disclaimers, signatures, etc.), the model design selected is generally one that is inherently
transparent and for which the features are easily interpretable by humans. It is acceptable
for a noise reduction model’s performance to be slightly lower, as failure to identify (for
example) a spam message may simply result in a false positive alert, however it is highly
unlikely to result in failure to identify misconduct.

On the other hand, for the AI risk detection models, which are used to identify
misconduct, performance of the model is the most critical consideration and design
decisions have been made to optimize for performance. As a result, more complex
methodologies have been incorporated into the design, notably the use of
transformer-based encoders (sBERT/RoBERTa) which feature a deep learning neural
network. Deep learning neural networks are not inherently transparent, with the trade off
for performance improvement being lower explainability.

Academic and industry research has not yet converged on a consensus for an effective
method of explainability for these types of models. Behavox has experimented with
various methods including LIME and Anchors (as described in this paper.), however these
present an over-simplification of the model’s functioning. Given the fact that these
models are heavily impacted by relationships between words, the usage of explainability
techniques that highlight single “important” words, may mislead the user and lead to
incorrect conclusions being drawn. This risk is especially prevalent in the Compliance
domain where issues are rarely black and white, and additionally the order of words can
have a large impact on the riskiness of a communication. The two examples below
provide an illustration of this point.

There are also some attempts to provide explainability for transformer-based models like
BERT based on the analysis of attention heads (for example, in this paper.). However, as
shown in this paper., attention modules do not provide statistically valid, meaningful
explanations and should not be treated as though they do.

As a result of all these experiments and observations, it is not entirely clear at this stage of
research, how such tools would be used meaningfully in a production environment.
Behavox R&D team will continue to assess new methods of explainability as they emerge.
As has previously been stated many financial organizations already utilize AI models such
as filter classifiers (e.g. disclaimer detector, spam detector, news detector, etc.) designed to
reduce noise. As such, Behavox has already engaged with many model risk teams within
our client base and has successfully passed model risk validation.

That said, AI models are a significant change in the approach, and rely more heavily on
more sophisticated AI techniques. As such, where appropriate, model risk teams should
be engaged at the earliest opportunity to enable an independent review to be performed.
While building out the AI models, Behavox has also redesigned its overall processes and
controls and have aligned these wherever possible to the guidance captured under
Fed/OCC SR 11-7, and similar guidance such as the draft guidance from the BoE in CP6/22.

A key element of SR 11-7 (and similar guidance) is ensuring that appropriate
documentation is made available to enable an understanding and independent
assessment of the model. To this end, Behavox has produced the documentation in the
table below (available to all customers and regulators) which describes the models and
their overall process and control environment.

Behavox continues to monitor regulatory developments related to responsible AI usage
(e.g. the EU Draft AI Regulation) and will adapt its processes and controls as needed to
align with any additional guidance and requirements published.

4.2. Transparency

Whilst lexicons are inherently flawed, they do offer clear transparency as they are
rule-driven. It is possible to understand what the rules are looking for, because one can
interpret the rules as a set of keywords/phrases that occur within a certain proximity of

AI models work in the same way. They are fed a set of training sentences, and they look for
similar sentences to those i.e. sentences that correlate closely to the training data. There is
no black box. The replacement for the lexicon rule is therefore the training data. If one
understands the training data, one can understand what type of sentences the risk
policies will detect (i.e. any that are similar to those sentences).

Behavox makes its training (and testing) datasets available for client review (including
client stakeholders such as audit, model risk, regulators and monitors) in secure data
rooms in any major city.

4.3. Wider Risks of AI usage in Financial Services

The proliferation and widespread use of AI in financial services is already happening. For
example in JPMorgan Chase’s 2022 letter to shareholders. it stated:

“We already have more than 300 AI use cases in production today for risk, prospecting,
marketing, customer experience and fraud prevention, and AI runs throughout our
payments processing and money movement systems across the globe.”

Increasingly AI will be used for a wide variety of use cases including interacting with and
advising clients. Klarna recently disclosed. that its chatbot handled ⅔ of all responses to
clients in its first month of being in operation. This raises a number of significant risks:

  • First and foremost among those risks relates to compliance – LLM chatbots need to
    adhere to financial regulations and laws in the same way that finance professionals
    are expected to adhere to the requirements.
  • Misinformation and errors / hallucinations
  • Lack of personalization Chatbots may not fully understand complex individual
    client needs or the nuanced context behind their questions, which can result in
    generic or unsuitable financial advice.
  • Issues related to Ethical and Bias Concerns have been well documented but AI
    systems can perpetuate or amplify biases present in their training data, leading to
    unfair treatment of certain groups or individuals.
  • Currently there is no method of verifying adherence to security rules – no testing or
    evaluating for securities law.

To address these risks we believe that regulators should issue guidelines stipulating that
Compliance and Model Risk teams are involved in every AI implementation where AI is
being deployed to interact with clients to verify and certify that AI is aligned to the
relevant regulations and securities laws.

5. Outlook and Recommendations

The future of communications surveillance will be significantly shaped by the capabilities
and advances in AI technology. For example in the very near future Behavox will be able to
generate alerts based on the context of a whole communication (email or bloomberg
chat) rather than at just the sentence level. The additional contextual understanding will
mean that the quality of alerts generated will continue to improve, helping to detect
market abuse, identify bad actors, and maintain the integrity of the financial markets.

With that in mind we would like to encourage regulators and financial institutions and
technology providers to help foster innovation while ensuring ethical and responsible AI

Behavox has developed two significant technologies to benefit its clients:

5.1. Behavox LLM 2.0

Behavox’s proprietary LLM sets itself apart from Microsoft and OpenAI products as a
specialized AI model tailored for the financial services domain. Unlike large
general-purpose models from Microsoft, Google, Meta, and OpenAI, Behavox LLM 2.0 is
specifically trained with language, concepts, and knowledge relevant to finance. This
specialization enables it to excel in finance-specific tasks such as:

  • Explanation of alerts and regulations – the model will be able to give a reasoned
    explanation on why it thinks a particular communication is problematic making
    reference to regulations and past enforcement cases. The ability to explain the
    reasoning behind an alert being generated will significantly help to address the
    explainability issue.
  • Explanations of financial concepts and jargon
  • Explanation of financial concepts
  • Summarization of technical financial texts and chats
  • Complex financial calculations
  • Customization: add your own documents to customize Behavox LLM 2.0 and
    expand its capabilities to explain compliance policies, operational procedures,
    security policies and many other technical documents.

5.2. AI chatbot

Behavox’s AI Chat Bot powered by Behavox LLM 2.0, that has been trained to be an expert
in Finance and Regulatory Compliance. The Behavox Chat Bot has a number of invaluable
use cases:

  • Up-skilling and improving productivity of alert reviewers:
  • Improve knowledge base and contextual understanding of the QA team
  • Save time for L1 front office supervisors
  • Improve knowledgebase of junior front office, back office and compliance teams
  • Increase productivity of middle office

6. Conclusion

The integration of AI into communication surveillance within financial markets presents
significant opportunities to improve firms’ monitoring capabilities, and ultimately improve
the effectiveness of their internal controls to ensure compliance with regulatory
requirements. In turn AI will help regulators to maintain fair, transparent, and efficient
markets. This paper has explored the substantial improvements AI can deliver in terms of
surveillance accuracy and efficiency, demonstrated through enhanced recall rates and
increased precision. The highlighted case study demonstrated the real-world benefits of
reduced alert volumes and improved detection capabilities and thereby reducing the
operational burden on financial institutions and their compliance teams.

However, alongside these benefits, this paper has also acknowledged the inherent risks
and challenges associated with AI deployment in surveillance, particularly issues related
to explainability and transparency. Addressing these concerns is crucial for maintaining
trust in AI systems.

Looking forward, the paper suggests continued investment in technologies such as the
Behavox LLM 2.0 and further development of AI-driven tools like AI chatbots, which can
significantly improve the quality of communications monitoring. It is recommended that
regulators and financial institutions work collaboratively to establish frameworks and
guidelines that enhance the accountability and ethical use of AI in surveillance. By doing
so, the financial sector can harness the full potential of AI to foster an environment of
compliance and integrity, ultimately contributing to more stable and reliable financial

Behavox is grateful to the CFTC for this opportunity to engage on this topic and would
welcome the opportunity to maintain an ongoing dialogue with the CFTC and other
stakeholders to continue to refine and improve the use of AI in the financial services