OpenEFA Learning System - Documentation

Overview: The learning system is an adaptive, privacy-preserving AI component that learns from legitimate emails to improve spam detection accuracy. It builds a profile of normal communication patterns and uses this to refine spam score adjustments over time.

1. Core Concept
2. What the System Learns
3. Learning Process Overview
4. Legitimacy Scoring
5. Configuration Overview
6. Privacy Protection
7. Learning Dashboard
8. Training and Manual Input
9. Integration Points
10. Database Overview
11. Example Scenario
12. Key Components
Summary

1. Core Concept

The system learns from low-spam-score emails (spam_score < 2.5) that are considered legitimate. It uses this knowledge to:

Reduce spam scores for emails matching known legitimate patterns
Increase scores slightly for unfamiliar patterns when confidence is high
Build confidence progressively as more legitimate data is seen

2. What the System Learns

A. Vocabulary Patterns

                        Hashed words (SHA256 with private environment-based salt; plain text is never stored)
Only learns words that are:
                                Longer than 3 characters
Alphanumeric only
Not numbers or email addresses

                            
Tracks frequency and last_seen timestamp

                    

Example: If your clients often use "deductible", "premium", or "coverage", these are learned as legitimate terms.

B. Domain Relationships

Tracks sender domain ↔ recipient domain communication frequency

Stores:

Message counts
Average spam score of communications
Last interaction timestamp

Example: Frequent low-spam exchanges between insurance.example and your domain improve relationship confidence.

C. Professional Phrases

Recognizes predefined business phrases such as:

"per our discussion"
"please find attached"
"following up"

"invoice", "payment", "contract"
"best regards", "sincerely"

Tracks frequency and average spam score per phrase.

D. Conversation Style Indicators

Greeting detected in the first 100 characters (e.g., "hi", "dear")
Signature or closing detected in the last 200 characters (e.g., "thanks", "regards")
Normalized message length range (100–10,000 characters)
Structural markers like sentence and question patterns

E. Domain Statistics

Maintains per-domain communication metrics:

Total messages
Average message length
Average spam score

3. Learning Process Overview

Email arrives → spam score is assigned.
Eligibility check → only low-score messages are used for learning.
Feature extraction → vocabulary hashes, domain relationships, phrase matches, and message structure.
Data update → frequency and averages updated in persistent storage.
Learning progress tracking → daily metrics logged.

4. Legitimacy Scoring

The system computes a legitimacy score (0–1) based on several weighted components:

Vocabulary similarity
Domain relationship strength
Professional phrase occurrence
Conversation style conformity

These factors are combined to adjust the spam score. Weights and formulas are configurable and can be tuned for specific environments.

5. Configuration Overview

Configuration settings are stored in a database table or configuration file. Example keys:

Config Key	Default	Description
`max_adjustment`	2.0	Maximum spam score adjustment (positive or negative)
`learning_enabled`	true	Enables/disables the learning module
`min_messages_for_learning`	10	Minimum number of messages before applying adjustments
`vocab_learning_threshold`	3	Minimum word frequency before inclusion
`relationship_confidence_threshold`	5	Number of exchanges needed for high confidence

6. Privacy Protection

Privacy-First Principles

The system follows strict privacy protection guidelines:

1. Hashed Vocabulary

Each word is hashed using SHA256 with an environment-defined private salt:

hash = sha256(f"{env_salt}{word.lower()}").hexdigest()[:16]

The actual words are never stored.

2. Domain-Level Tracking Only

Only domain names are stored, not full email addresses.

3. No Message Storage

Body text and subjects are discarded after analysis.

4. Aggregate Metrics Only

Frequency counts, averages, and timestamps only.

7. Learning Dashboard

A web interface at /learning provides insights into the system's progress:

Vocabulary size (unique patterns)
Relationship count (domain pairs)
Phrase statistics
Learning confidence (0–100%)

Top communicating domains
Common phrases
Daily learning rates

The dashboard only displays data for domains the viewer is authorized to access.

8. Training and Manual Input

The system automatically learns during normal operation. You can also manually provide legitimate emails to improve learning:

python3 scripts/feed_good_emails_to_learning.py \
  --sender "client@example.com" \
  --recipients "team@example.com,support@example.com" \
  --subject "Follow-up on contract" \
  --body "Per our discussion, please find attached the updated policy." \
  --score 0.5

This helps bootstrap or correct model learning for new domains.

9. Integration Points

The learning module can be integrated into spam filtering pipelines via a call such as:

def analyze_with_learning(msg, text_content, spam_score):
    learner = ConversationLearner()
    legitimacy = learner.calculate_legitimacy_score(msg, text_content)
    if spam_score < 2.5:
        learner.learn_from_email(msg, text_content, spam_score)
    return legitimacy

This adjustment is applied before final spam-handling decisions.

10. Database Overview

Main logical components:

Vocabulary and phrase frequency tables
Domain relationship tracking
Configuration and learning progress data

Database names and schema details can be found in the developer documentation.

11. Example Scenario

Example: client@insurance.example frequently emails yourcompany.example.

System learns vocabulary: "policy", "coverage", "renewal"
Builds relationship confidence (message_count = 10, avg_spam = 1.2)
Gains phrase familiarity (e.g., "please find attached")
For a new email, the combined familiarity leads to a small spam score reduction, keeping legitimate traffic from being quarantined.

All domains shown here are fictional and for demonstration only.

12. Key Components

[project_root]/modules/conversation_learner_mysql.py – MySQL-based learning engine
[project_root]/scripts/feed_good_emails_to_learning.py – Manual training script
[project_root]/scripts/get_learning_stats.py – Statistics collector
[config_dir]/.env – Environment configuration

Summary

The OpenEFA Learning System is a self-adapting, privacy-conscious filter that:

Learns from legitimate communications
Adjusts spam scores intelligently using contextual patterns
Stores only hashed and aggregated data
Provides clear insights through a dashboard
Encourages community contributions and transparency

The more high-quality emails it sees, the smarter it becomes — continuously improving the accuracy of spam and ham classification without compromising privacy.

Last Updated: October 27, 2025

System Version: OpenEFA Learning Engine v2.0

Author: OpenEFA Team

Back to Documentation Index Discuss in Forum

Table of Contents