Check out our open source tools to give Claude Code GTM superpowers→

Technical Walkthrough

Self-Improving GTM Systems

The complete architecture for building auto-research loops that make your campaigns, copy, and targeting smarter every single week.

By Mitchell Keller|2M+ cold emails sent|18 min read

The Auto-Research Pattern

One loop. Any business metric. Infinite iteration.

The best AI labs use automated experiment loops to improve models overnight. The same pattern works for anything you can measure: booked calls, conversions, engagement, revenue.

Hypothesis

→

Experiment

→

Measure

→

Keep / Discard

→

Learn

↻

The loop runs continuously. You approve before deployment, everything else is automated.

The brilliance isn't the loop itself. Marketers have been A/B testing for decades. The brilliance is that AI can now run the loop autonomously, log what it learns, and make each iteration smarter than the last.

But there's a catch. And it's the thing that separates systems that actually compound from systems that waste compute.

The Metric Problem

If you optimize the wrong thing, you build a machine that gets better at being wrong. Faster.

The first decision in any auto-research loop isn't “what should I test?” It's “what metric actually represents the outcome I want?”

Example: Cold Email Metrics

Vanity Metric

4.2% reply rate

Counts angry “remove me” responses, bot auto-replies, and “not interested” as wins. Optimizes for provocation, not pipeline.

Objective Metric

1 in 2,000 contacts books a call

Contacts per booked call. Measures campaign efficiency, not volume. Lower is better. Optimizes for attracting the right people and repelling the wrong ones.

Same campaign. Two completely different stories. A system optimizing for reply rate will write increasingly provocative subject lines that generate reactions. A system optimizing for booking rate will write emails that resonate with people who are ready to buy.

The Metric Cheat Sheet

Domain	Vanity Metric	Objective Metric
Cold Email	Reply rate	Contacts per booked call
Landing Pages	Page views	Conversion rate
YouTube	Views	Average view duration
Newsletters	Open rate	Revenue per subscriber
Ads	CTR	Cost per acquisition
Sales Calls	Meetings booked	Meeting-to-close rate

Rule of thumb: The right metric is the one closest to revenue that your volume lets you measure. High volume? Contacts per booked call. Lower volume? Contacts per engaged lead. Either way, it's a number you can say out loud that means something. “It takes 2,000 contacts to book a call.” Pick the one your data can actually validate.

Loop 1: The Research Process

Every campaign ends smarter than it started. The next one starts where the last one left off.

Most people treat market research as a one-time activity. Pull some data, build a list, write emails, launch. If it doesn't work, blame the copy.

The research loop treats every campaign as a controlled experiment that produces validated knowledge. Not just metrics. Knowledge.

The Cross-Reference Analysis

After a campaign runs, pull every interested reply and cross-reference across six dimensions:

DIM 1Geography

DIM 2Company Size

DIM 3Messaging Angle

DIM 4Conversion Step

DIM 5Response Intensity

DIM 6Firmographic Segment

This produces patterns invisible to single-dimension analysis: “SaaS companies in North America, 50-200 employees, respond to ROI messaging at Step 1 with HOT intensity.”

That's not a guess. That's a validated finding. And it changes everything about the next campaign.

Finding Classification

Not all findings are equal. Without classification, you can't tell signal from noise.

Raw

Desk research. No data.

Emergent

N=1. Could be noise.

Validated

N≥2 with conversions.

Proven

Led to booked call or closed revenue.

Pseudo-flowresearch-loop.py

# 1. Pull interested replies via your sequencer's API
replies = sequencer.list_replies(campaign_id, status="interested")

# 2. Cross-reference across 6 dimensions
matrix = cross_reference(replies, dims=["geo", "size", "messaging", "step", "intensity", "segment"])

# 3. Classify findings by evidence strength
for pattern in matrix.significant_patterns():
if pattern.booked_calls >= 2:
pattern.status = "VALIDATED"
elif pattern.booked_calls == 1:
pattern.status = "EMERGENT"

# 4. Propagate validated findings to system
update_master_file(client, validated_findings)
update_icp_definition(client, validated_findings)

# 5. Next campaign starts from updated knowledge

The compounding effect: Campaign 1 produces findings. Campaign 2 starts from those findings and produces more. By Campaign 5, you're operating on a completely different level of market intelligence than teams that reset to zero every time.

Loop 2: Email Copy Optimization

AI writes the challenger. Data picks the winner. Learnings compound.

This is the loop most people think of first. But there's a difference between basic A/B testing and a self-improving copy system.

🚫

Basic A/B Test

You write two emails. Send to same list. Check which got more opens. Repeat manually. Learnings live in your head.

⚠️

Automated A/B

AI writes challengers. Measures reply rate. Keeps winner. Better, but still optimizing the wrong metric with no knowledge accumulation.

✅

Auto-Research Copy

AI writes challengers from a growing learnings file. Measures contacts per booked call. Findings get classified. Patterns get codified.

The Copy Optimization Loop

The 6-Step CycleRuns weekly

# Step 1: Pull baseline performance from your sequencer
baseline = sequencer.campaign_stats(campaign_id)

# Step 2: Extract winning copy patterns
winning_copy = sequencer.view_sequence_steps(campaign_id)

# Step 3: Load the learnings file (grows every cycle)
learnings = load_learnings("learnings.md")
# Cycle 1: empty
# Cycle 10: 15 validated patterns
# Cycle 50: a complete playbook

# Step 4: Generate challenger based on learnings + baseline
challenger = ai.generate_variant(
baseline=winning_copy,
learnings=learnings,
metric="contacts_per_booked_call",
constraints=copy_skill
# brand voice, word count, formatting rules
)

# Step 5: Deploy as A/B variant (human approves first)
sequencer.add_sequence_step(
campaign_id,
variant_of=baseline_step_id,
email_subject=challenger.subject,
email_body=challenger.body
)

# Step 6: After test period, log results
results = sequencer.campaign_split_test_stats(campaign_id)
append_learnings("learnings.md", results)

Don't call winners on noise. You need a minimum sample before results mean anything. At cold email volumes, that's typically 500+ contacts per variant. Below that, you're reading tea leaves. The loop should check sample size before promoting a variant to winner.

What the Learnings File Looks Like After 20 Experiments

VALIDATED: Emails under 60 words book 1.4x more calls in SaaS
VALIDATED: Questions in subject lines reduce C-suite booking rate by 23%
VALIDATED: Step 2 follow-ups citing a specific metric from Step 1 increase bookings 40%
EMERGENT: “Specifically” line after company reference increases relevance
EMERGENT: Ghost pipeline angle outperforms ROI angle for PE-backed companies
RAW: Hypothesis: Including competitor name may increase booking rate for displacement campaigns

These aren't opinions. They're findings from controlled experiments. By Cycle 50, this file is a playbook no human could have compiled manually.

Loop 3: Reply Intelligence

Everyone optimizes what they send. Almost nobody optimizes what they learn from what comes back.

When someone replies to your cold email, that reply is training data. But most people just read it, respond if positive, and move on. The reply disappears into the inbox.

What Replies Actually Tell You

💬

“We already use [Competitor X]”This is not a rejection. This is market intelligence.

They're in-market. You know their current vendor. You know their language. Next campaign: build a displacement angle for Competitor X users specifically.

👥

“Not my department, try Sarah”This is not a dead end. This is a targeting correction.

You're hitting the wrong title. The system now knows. Next campaign: adjust the title targeting. Also, you just got a warm referral.

📅

“Check back in Q2, we're mid-migration”This is not “no.” This is “not yet.”

Timing signal. They're interested but blocked. Next action: add to Q2 nurture sequence. Don't waste a slot on the next campaign.

The 8-Phase Reply Analysis Pipeline

01Clean Data

02Extract Pain Points

03Map Objections

04Persona Analysis

05Campaign Perf

06Copy Effectiveness

07GTM Recommendations

08Knowledge Update

The key insight: “42% of objections were ‘we already have a solution.’” That's not a copy problem. That's a targeting problem. You're hitting companies that already bought. The fix isn't better emails. The fix is better list building.

Reply intelligence feeds directly back into the research loop. Objection patterns reveal targeting gaps. Pain point language gets recycled into copy. Persona conversion rates sharpen the ICP. Every reply makes the next campaign smarter.

The Compounding Flywheel

Three loops. Each one makes the other two better.

🔍

Research Loop

✍️

Copy Loop

📩

Reply Loop

Compounds Weekly

Research → Copy

Better targeting means the copy optimizer is testing against the right audience. Experiments produce cleaner signal.

Copy → Replies

Better copy produces more replies. More replies means more training data for the reply intelligence system.

Replies → Research

Reply intelligence reveals targeting gaps, competitive intel, and persona patterns that refine the research loop's inputs.

Run this for a month and your campaigns are noticeably better. Run it for a quarter and you're operating on a level that teams resetting to zero every campaign simply cannot match.

Beyond Cold Email

The pattern works anywhere you have a clear metric and a way to test.

🎬

YouTube Content

Metric: Avg. view duration

Analyze every video against viral benchmarks. Score titles, hooks, structure. Generate ideas from what works.

📄

Landing Pages

Metric: Conversion rate

Auto-modify headlines, CTAs, social proof. Test against baseline. Page improves weekly without you touching it.

📢

Ad Creatives

Metric: Cost per acquisition

Generate variations. Test across audiences. Learnings from cold email copy often transfer directly to ad copy.

📨

Newsletter Subjects

Metric: Revenue per subscriber

Two subjects per send. Log winner. After 20 sends, you have a validated playbook for your audience.

💰

Pricing Pages

Metric: Signup conversion

Test anchoring, tier structure, guarantee framing. Small changes here compound into serious revenue differences.

📞

Sales Call Scripts

Metric: Meeting-to-close rate

Test discovery frameworks, objection handling, close sequences. Each call produces learnings for the next.

What You Need to Build This

Three non-negotiable requirements.

Objective Metric

Not a vanity metric. A number you can say out loud. “It takes 2,000 contacts to book a call.” Not reply rate. Not raw counts.

API Access

You need a way to programmatically deploy experiments and pull results. Smartlead, Instantly, Lemlist, EmailBison all have APIs. If your tool doesn't, it's not ready for auto-research.

Knowledge Architecture

A structured place to store learnings with a classification system: Raw, Emergent, Validated, Proven. A markdown file works. A database works. The format doesn't matter. The classification does. Without it, you forget what you learned.

The third one is the one everyone skips. They build the loop, run experiments, and never structure the learnings. Six months later, they're running the same experiments again because nobody remembers what worked.

Want This Built for Your Business?

We build self-improving GTM systems for B2B companies. Research loops, copy optimization, reply intelligence. The whole flywheel.

Book a Strategy Call →

Or subscribe on YouTube for more technical breakdowns.