Technical Walkthrough

Self-Improving GTM Systems

The complete architecture for building auto-research loops that make your campaigns, copy, and targeting smarter every single week.

By Mitchell Keller|2M+ cold emails sent|18 min read

The Auto-Research Pattern

One loop. Any business metric. Infinite iteration.

Karpathy built auto-research to improve AI models overnight. The same pattern works for anything you can measure: booked calls, conversions, engagement, revenue.

Hypothesis
Experiment
Measure
Keep / Discard
Learn
The loop repeats autonomously. No human in the loop.

The brilliance isn't the loop itself. Marketers have been A/B testing for decades. The brilliance is that AI can now run the loop autonomously, log what it learns, and make each iteration smarter than the last.

But there's a catch. And it's the thing that separates systems that actually compound from systems that waste compute.

The Metric Problem

If you optimize the wrong thing, you build a machine that gets better at being wrong. Faster.

The first decision in any auto-research loop isn't “what should I test?” It's “what metric actually represents the outcome I want?”

Example: Cold Email Metrics

Vanity Metric
4.2% reply rate
Counts angry “remove me” responses, bot auto-replies, and “not interested” as wins. Optimizes for provocation, not pipeline.
Objective Metric
1 in 2,000 contacts books a call
Contacts per booked call. Measures campaign efficiency, not volume. Lower is better. Optimizes for attracting the right people and repelling the wrong ones.

Same campaign. Two completely different stories. A system optimizing for reply rate will write increasingly provocative subject lines that generate reactions. A system optimizing for booking rate will write emails that resonate with people who are ready to buy.

The Metric Cheat Sheet

DomainVanity MetricObjective Metric
Cold EmailReply rateContacts per booked call
Landing PagesPage viewsConversion rate
YouTubeViewsAverage view duration
NewslettersOpen rateRevenue per subscriber
AdsCTRCost per acquisition
Sales CallsMeetings bookedMeeting-to-close rate
Rule of thumb: The right metric is the one closest to revenue that your volume lets you measure. High volume? Contacts per booked call. Lower volume? Contacts per engaged lead. Either way, it's a number you can say out loud that means something. “It takes 2,000 contacts to book a call.” Pick the one your data can actually validate.

Loop 1: The Research Process

Every campaign ends smarter than it started. The next one starts where the last one left off.

Most people treat market research as a one-time activity. Pull some data, build a list, write emails, launch. If it doesn't work, blame the copy.

The research loop treats every campaign as a controlled experiment that produces validated knowledge. Not just metrics. Knowledge.

The Cross-Reference Analysis

After a campaign runs, pull every interested reply and cross-reference across six dimensions:

DIM 1Geography
DIM 2Company Size
DIM 3Messaging Angle
DIM 4Conversion Step
DIM 5Response Intensity
DIM 6Firmographic Segment

This produces patterns invisible to single-dimension analysis: “SaaS companies in North America, 50-200 employees, respond to ROI messaging at Step 1 with HOT intensity.”

That's not a guess. That's a validated finding. And it changes everything about the next campaign.

Finding Classification

Not all findings are equal. Without classification, you can't tell signal from noise.

Raw
Desk research. No data.
Emergent
N=1. Could be noise.
Validated
N≥2 with conversions.
Proven
Led to booked call or closed revenue.
Pseudo-flowresearch-loop.py
# 1. Pull interested replies via API
replies = emailbison.list_replies(campaign_id, status="interested")

# 2. Cross-reference across 6 dimensions
matrix = cross_reference(replies, dims=["geo", "size", "messaging", "step", "intensity", "segment"])

# 3. Classify findings by evidence strength
for pattern in matrix.significant_patterns():
if pattern.booked_calls >= 2:
pattern.status = "VALIDATED"
elif pattern.booked_calls == 1:
pattern.status = "EMERGENT"

# 4. Propagate validated findings to system
update_master_file(client, validated_findings)
update_icp_definition(client, validated_findings)

# 5. Next campaign starts from updated knowledge
The compounding effect: Campaign 1 produces findings. Campaign 2 starts from those findings and produces more. By Campaign 5, you're operating on a completely different level of market intelligence than teams that reset to zero every time.

Loop 2: Email Copy Optimization

AI writes the challenger. Data picks the winner. Learnings compound.

This is the loop most people think of first. But there's a difference between basic A/B testing and a self-improving copy system.

🚫
Basic A/B Test

You write two emails. Send to same list. Check which got more opens. Repeat manually. Learnings live in your head.

⚠️
Automated A/B

AI writes challengers. Measures reply rate. Keeps winner. Better, but still optimizing the wrong metric with no knowledge accumulation.

Auto-Research Copy

AI writes challengers from a growing learnings file. Measures contacts per booked call. Findings get classified. Patterns get codified.

The Copy Optimization Loop

The 6-Step CycleRuns weekly
# Step 1: Pull baseline performance
baseline = emailbison.campaign_stats(campaign_id)

# Step 2: Extract winning copy patterns
winning_copy = emailbison.view_sequence_steps(campaign_id)

# Step 3: Load the learnings file (grows every cycle)
learnings = load_learnings("learnings.md")
# Cycle 1: empty
# Cycle 10: 15 validated patterns
# Cycle 50: a complete playbook

# Step 4: Generate challenger based on learnings + baseline
challenger = ai.generate_variant(
baseline=winning_copy,
learnings=learnings,
metric="contacts_per_booked_call",
constraints=copy_skill
# brand voice, word count, formatting rules
)

# Step 5: Deploy as A/B variant
emailbison.add_sequence_step(
campaign_id,
variant_of=baseline_step_id,
email_subject=challenger.subject,
email_body=challenger.body
)

# Step 6: After test period, log results
results = emailbison.campaign_split_test_stats(campaign_id)
append_learnings("learnings.md", results)

What the Learnings File Looks Like After 20 Experiments

VALIDATED: Emails under 60 words book 1.4x more calls in SaaS
VALIDATED: Questions in subject lines reduce C-suite booking rate by 23%
VALIDATED: Step 2 follow-ups citing a specific metric from Step 1 increase bookings 40%
EMERGENT: “Specifically” line after company reference increases relevance
EMERGENT: Ghost pipeline angle outperforms ROI angle for PE-backed companies
RAW: Hypothesis: Including competitor name may increase booking rate for displacement campaigns

These aren't opinions. They're findings from controlled experiments. By Cycle 50, this file is a playbook no human could have compiled manually.

Loop 3: Reply Intelligence

Everyone optimizes what they send. Almost nobody optimizes what they learn from what comes back.

When someone replies to your cold email, that reply is training data. But most people just read it, respond if positive, and move on. The reply disappears into the inbox.

What Replies Actually Tell You

💬
“We already use [Competitor X]”This is not a rejection. This is market intelligence.

They're in-market. You know their current vendor. You know their language. Next campaign: build a displacement angle for Competitor X users specifically.

👥
“Not my department, try Sarah”This is not a dead end. This is a targeting correction.

You're hitting the wrong title. The system now knows. Next campaign: adjust the title targeting. Also, you just got a warm referral.

📅
“Check back in Q2, we're mid-migration”This is not “no.” This is “not yet.”

Timing signal. They're interested but blocked. Next action: add to Q2 nurture sequence. Don't waste a slot on the next campaign.

The 8-Phase Reply Analysis Pipeline

01Clean Data
02Extract Pain Points
03Map Objections
04Persona Analysis
05Campaign Perf
06Copy Effectiveness
07GTM Recommendations
08Knowledge Update
The key insight: “42% of objections were ‘we already have a solution.’” That's not a copy problem. That's a targeting problem. You're hitting companies that already bought. The fix isn't better emails. The fix is better list building.

Reply intelligence feeds directly back into the research loop. Objection patterns reveal targeting gaps. Pain point language gets recycled into copy. Persona conversion rates sharpen the ICP. Every reply makes the next campaign smarter.

The Compounding Flywheel

Three loops. Each one makes the other two better.

🔍
Research Loop
✍️
Copy Loop
📩
Reply Loop
Compounds Weekly

Research → Copy

Better targeting means the copy optimizer is testing against the right audience. Experiments produce cleaner signal.

Copy → Replies

Better copy produces more replies. More replies means more training data for the reply intelligence system.

Replies → Research

Reply intelligence reveals targeting gaps, competitive intel, and persona patterns that refine the research loop's inputs.

Run this for a month and your campaigns are noticeably better. Run it for a quarter and you're operating on a level that teams resetting to zero every campaign simply cannot match.

Beyond Cold Email

The pattern works anywhere you have a clear metric and a way to test.

🎬
YouTube Content
Metric: Avg. view duration

Analyze every video against viral benchmarks. Score titles, hooks, structure. Generate ideas from what works.

📄
Landing Pages
Metric: Conversion rate

Auto-modify headlines, CTAs, social proof. Test against baseline. Page improves weekly without you touching it.

📢
Ad Creatives
Metric: Cost per acquisition

Generate variations. Test across audiences. Learnings from cold email copy often transfer directly to ad copy.

📨
Newsletter Subjects
Metric: Revenue per subscriber

Two subjects per send. Log winner. After 20 sends, you have a validated playbook for your audience.

💰
Pricing Pages
Metric: Signup conversion

Test anchoring, tier structure, guarantee framing. Small changes here compound into serious revenue differences.

📞
Sales Call Scripts
Metric: Meeting-to-close rate

Test discovery frameworks, objection handling, close sequences. Each call produces learnings for the next.

What You Need to Build This

Three non-negotiable requirements.

1
Objective Metric
Not a vanity metric. A number you can say out loud. “It takes 2,000 contacts to book a call.” Not reply rate. Not raw counts.
2
API Access
You need a way to programmatically deploy experiments and pull results. If your tool doesn't have an API, it's not ready for auto-research.
3
Knowledge Architecture
A structured place to store learnings. Not a flat text file. A classification system: Raw, Emergent, Validated, Proven. Without this, you forget what you learned.
The third one is the one everyone skips. They build the loop, run experiments, and never structure the learnings. Six months later, they're running the same experiments again because nobody remembers what worked.

Want This Built for Your Business?

We build self-improving GTM systems for B2B companies. Research loops, copy optimization, reply intelligence. The whole flywheel.

Talk to LeadGrow →
Or subscribe on YouTube for more technical breakdowns.