The Challenge
The client had a email_maildir_ingest table with 103K+ email records — messages ingested from multiple IMAP mailboxes over time. The table was being used for both storage and as the backbone of a customer intelligence system: which contacts were emailing us, what were they asking about, and which ones needed follow-up.
The CRM dashboard was loading slowly. Individual contact lookups were taking 2+ seconds. The AI classification pipeline was processing emails but the results weren't being surfaced efficiently. Each problem had a distinct technical root cause.
The Domain Lookup Problem: 2.2 Seconds
The contact discovery feature works by finding email senders from the same domain as known contacts. For example: if a contact at acme.com is in the CRM, find all emails from other acme.com addresses in the archive.
The initial query used LIKE '%@acme.com' — a leading wildcard pattern that prevents any index from being used. MySQL has to read every row and evaluate the pattern on each one:
WHERE email
LIKE '%@acme.com'
-- EXPLAIN:
-- type: ALL
-- rows: 722,431
-- Extra: Using where
WHERE email_domain
= 'acme.com'
-- EXPLAIN:
-- type: ref
-- rows: 3
-- Extra: Using index
Stored Generated Column
The solution is a MySQL stored generated column. Instead of computing the domain from the email address at query time, MySQL computes it once when each row is inserted or updated — and stores the result. Combined with a regular index on the generated column, lookups become instant:
ALTER TABLE email_maildir_ingest
ADD COLUMN email_domain VARCHAR(100)
GENERATED ALWAYS AS (
SUBSTRING_INDEX(`from_address`, '@', -1)
) STORED,
ADD INDEX idx_email_domain (email_domain)
ALGORITHM=INPLACE LOCK=NONE;
-- The same column on contacts table
ALTER TABLE contacts
ADD COLUMN email_domain VARCHAR(100)
GENERATED ALWAYS AS (
SUBSTRING_INDEX(email, '@', -1)
) STORED,
ADD INDEX idx_sender_domain (email_domain)
ALGORITHM=INPLACE LOCK=NONE;
-- Domain lookup: 2.2s → 0.0003s (one index row vs 722K full scan)
STORED vs VIRTUAL: We chose STORED (persisted to disk) over VIRTUAL (computed on read) because this column is used in an index and in JOIN operations. Stored generated columns can be indexed; virtual ones generally cannot. The storage cost is negligible — VARCHAR(100) per row on 103K rows is ~10MB.
Covering Index for Dashboard Stats
The CRM dashboard shows per-employee email statistics: total emails, unread counts, emails by label, emails by urgency. These queries run on the email_labels table, which had 92K rows for a single employee (emp_id=3).
The original query was doing a full table scan — MySQL's EXPLAIN showed Using where with 92,000 rows examined. The fix was a covering index: an index that includes all the columns the query needs, so MySQL never has to touch the actual table rows at all. EXPLAIN then shows Using index — the entire query is answered from the index B-tree:
SELECT label, COUNT(*) as cnt
FROM email_labels
WHERE emp_id = 3
GROUP BY label;
-- Before: full scan, 92K rows examined
-- EXPLAIN: type=ALL, rows=92000, Extra=Using where
-- Covering index: includes all columns the query touches
CREATE INDEX idx_labels_cover
ON email_labels (emp_id, label)
ALGORITHM=INPLACE LOCK=NONE;
-- After: index-only scan
-- EXPLAIN: type=ref, rows=1, Extra=Using index
AFTER: type=ref key=idx_labels_cover rows=1 Extra=Using index
Contact Discovery
With the domain index in place, we built the contact discovery feature: when viewing an email from an unknown sender, the system automatically checks if any other senders from the same domain are in the contacts table. If yes, it suggests adding this person to the existing company record.
The query is now a simple join on two indexed email_domain columns — millisecond response time even as both tables grow:
SELECT DISTINCT
e.from_address,
e.from_name,
c.company_name
FROM email_maildir_ingest e
JOIN contacts c
ON e.email_domain = c.email_domain -- both indexed
LEFT JOIN contacts known
ON e.from_address = known.email
WHERE known.contact_id IS NULL -- not already in CRM
AND e.email_domain = 'acme.com';
-- Returns unknown senders from known companies instantly
Gemini AI Classification
Every new email ingested into the system is sent to Gemini for classification. The AI extracts:
- Category: Sales inquiry, support request, invoice, spam, partnership, complaint, etc.
- Urgency: critical / high / medium / low
- Sentiment: positive / neutral / negative
- Action needed: Free-text description of what response is required
- Key entities: Order numbers, product names, dollar amounts extracted automatically
Classification results are stored in a separate email_ai_labels table linked to the original email. This allows re-classification with updated models without modifying the source data.
Body Cleanup for Display
Emails stored in the archive are raw MIME — HTML bodies with full CSS stylesheets, tracking pixels, marketing layouts. Displaying raw HTML in the CRM would be a security risk (XSS) and would look terrible. Processing it for the AI with all the markup noise would waste tokens and reduce accuracy.
The cleanup pipeline strips all CSS and HTML, converting the email to clean plain text for both display and AI processing. A DOMParser-based approach handles well-formed HTML; a regex fallback handles malformed markup from older email clients.
Results
| Metric | Before | After |
|---|---|---|
| Domain lookup query | 2.2s (722K scan) | <1ms (1 row via index) |
| Dashboard stats query | Table scan, 92K rows | Using index, 1 row |
| Contact discovery | Not possible | Instant, both tables indexed |
| Email classification | Manual | AI-automated (Gemini) |
| EXPLAIN type (labels) | ALL | ref |
| EXPLAIN Extra (labels) | Using where | Using index |
| Email body in UI | Raw HTML (broken) | Clean plain text |
Tech Stack
Gemini 2.0 Flash API • MySQL 8.0 (InnoDB generated columns) • PHP 8.1 • IMAP • MIME parser • DOMParser • JavaScript (vanilla) • EXPLAIN ANALYZE