Back to Docs

OpenEFA Learning System

Privacy-Preserving Adaptive AI for Intelligent Spam Detection

Overview: The learning system is an adaptive, privacy-preserving AI component that learns from legitimate emails to improve spam detection accuracy. It builds a profile of normal communication patterns and uses this to refine spam score adjustments over time.

1. Core Concept

The system learns from low-spam-score emails (spam_score < 2.5) that are considered legitimate. It uses this knowledge to:

  • Reduce spam scores for emails matching known legitimate patterns
  • Increase scores slightly for unfamiliar patterns when confidence is high
  • Build confidence progressively as more legitimate data is seen

2. What the System Learns

A. Vocabulary Patterns

  • Hashed words (SHA256 with private environment-based salt; plain text is never stored)
  • Only learns words that are:
    • Longer than 3 characters
    • Alphanumeric only
    • Not numbers or email addresses
  • Tracks frequency and last_seen timestamp

Example: If your clients often use "deductible", "premium", or "coverage", these are learned as legitimate terms.

B. Domain Relationships

Tracks sender domain ↔ recipient domain communication frequency

Stores:

  • Message counts
  • Average spam score of communications
  • Last interaction timestamp

Example: Frequent low-spam exchanges between insurance.example and your domain improve relationship confidence.

C. Professional Phrases

Recognizes predefined business phrases such as:

  • "per our discussion"
  • "please find attached"
  • "following up"
  • "invoice", "payment", "contract"
  • "best regards", "sincerely"

Tracks frequency and average spam score per phrase.

D. Conversation Style Indicators

  • Greeting detected in the first 100 characters (e.g., "hi", "dear")
  • Signature or closing detected in the last 200 characters (e.g., "thanks", "regards")
  • Normalized message length range (100–10,000 characters)
  • Structural markers like sentence and question patterns

E. Domain Statistics

Maintains per-domain communication metrics:

  • Total messages
  • Average message length
  • Average spam score

3. Learning Process Overview

  1. Email arrives → spam score is assigned.
  2. Eligibility check → only low-score messages are used for learning.
  3. Feature extraction → vocabulary hashes, domain relationships, phrase matches, and message structure.
  4. Data update → frequency and averages updated in persistent storage.
  5. Learning progress tracking → daily metrics logged.

4. Legitimacy Scoring

The system computes a legitimacy score (0–1) based on several weighted components:

  • Vocabulary similarity
  • Domain relationship strength
  • Professional phrase occurrence
  • Conversation style conformity

These factors are combined to adjust the spam score. Weights and formulas are configurable and can be tuned for specific environments.

5. Configuration Overview

Configuration settings are stored in a database table or configuration file. Example keys:

Config Key Default Description
max_adjustment 2.0 Maximum spam score adjustment (positive or negative)
learning_enabled true Enables/disables the learning module
min_messages_for_learning 10 Minimum number of messages before applying adjustments
vocab_learning_threshold 3 Minimum word frequency before inclusion
relationship_confidence_threshold 5 Number of exchanges needed for high confidence

6. Privacy Protection

Privacy-First Principles

The system follows strict privacy protection guidelines:

1. Hashed Vocabulary

Each word is hashed using SHA256 with an environment-defined private salt:

hash = sha256(f"{env_salt}{word.lower()}").hexdigest()[:16]

The actual words are never stored.

2. Domain-Level Tracking Only

Only domain names are stored, not full email addresses.

3. No Message Storage

Body text and subjects are discarded after analysis.

4. Aggregate Metrics Only

Frequency counts, averages, and timestamps only.

7. Learning Dashboard

A web interface at /learning provides insights into the system's progress:

  • Vocabulary size (unique patterns)
  • Relationship count (domain pairs)
  • Phrase statistics
  • Learning confidence (0–100%)
  • Top communicating domains
  • Common phrases
  • Daily learning rates

The dashboard only displays data for domains the viewer is authorized to access.

8. Training and Manual Input

The system automatically learns during normal operation. You can also manually provide legitimate emails to improve learning:

python3 scripts/feed_good_emails_to_learning.py \
  --sender "client@example.com" \
  --recipients "team@example.com,support@example.com" \
  --subject "Follow-up on contract" \
  --body "Per our discussion, please find attached the updated policy." \
  --score 0.5

This helps bootstrap or correct model learning for new domains.

9. Integration Points

The learning module can be integrated into spam filtering pipelines via a call such as:

def analyze_with_learning(msg, text_content, spam_score):
    learner = ConversationLearner()
    legitimacy = learner.calculate_legitimacy_score(msg, text_content)
    if spam_score < 2.5:
        learner.learn_from_email(msg, text_content, spam_score)
    return legitimacy

This adjustment is applied before final spam-handling decisions.

10. Database Overview

Main logical components:

  • Vocabulary and phrase frequency tables
  • Domain relationship tracking
  • Configuration and learning progress data

Database names and schema details can be found in the developer documentation.

11. Example Scenario

Example: client@insurance.example frequently emails yourcompany.example.

  1. System learns vocabulary: "policy", "coverage", "renewal"
  2. Builds relationship confidence (message_count = 10, avg_spam = 1.2)
  3. Gains phrase familiarity (e.g., "please find attached")
  4. For a new email, the combined familiarity leads to a small spam score reduction, keeping legitimate traffic from being quarantined.

All domains shown here are fictional and for demonstration only.

12. Key Components

  • [project_root]/modules/conversation_learner_mysql.py – MySQL-based learning engine
  • [project_root]/scripts/feed_good_emails_to_learning.py – Manual training script
  • [project_root]/scripts/get_learning_stats.py – Statistics collector
  • [config_dir]/.env – Environment configuration

Summary

The OpenEFA Learning System is a self-adapting, privacy-conscious filter that:

  • Learns from legitimate communications
  • Adjusts spam scores intelligently using contextual patterns
  • Stores only hashed and aggregated data
  • Provides clear insights through a dashboard
  • Encourages community contributions and transparency

The more high-quality emails it sees, the smarter it becomes — continuously improving the accuracy of spam and ham classification without compromising privacy.


Last Updated: October 27, 2025

System Version: OpenEFA Learning Engine v2.0

Author: OpenEFA Team