Skip to content

Email Document Schema

A document is the clean email record your agent reads. It contains the message body, sender, subject, timestamps, source links, parser metadata, and extracted asset references.

Documents are stored in SQLite and linked to files in the email workspace. A document can come from an .eml file, one message inside an mbox archive, a Gmail message, or an IMAP message.

The schema is alpha. Field names and compatibility guarantees may evolve before a stable schema version is published.

For most agent workflows, start with:

  • subject
  • sender_email
  • received_at
  • body_text
  • metadata.source
  • assets

Those fields give the agent the message content, who sent it, when it arrived, where it came from, and which files or embedded images belong with it.

The document has three groups of data:

  • Core fields: subject, sender, body text, source kind, timestamps, and document IDs.
  • Linked files: raw email, HTML view, and extracted assets in the email workspace.
  • Metadata: source provenance, parser notes, cleaning details, and provider IDs.

JSON export returns this document shape. Markdown, HTML, and PDF exports are derived views created from the stored document and its linked files.

A document is one stored email message. It is the object returned by mailatlas get, the Python API, and JSON export.

Use the document when your agent needs to read the message body, inspect sender and timestamp fields, follow links to source files, or decide which attachments matter.

An asset is a file MailAtlas extracted from the email. Assets include:

  • Embedded images used by HTML email, such as logos, charts, and inline screenshots.
  • Regular attachments, such as PDFs, spreadsheets, calendar files, and images.

Assets are stored as files in the email workspace and listed on the document as structured records. The asset record tells your agent what kind of file it is, where it lives, and whether it was embedded in HTML or attached as a normal file.

Metadata is extra information about where the email came from and how MailAtlas processed it. It is not the main message body.

Metadata answers questions such as:

  • Did this email come from Gmail, IMAP, .eml, or mbox?
  • Which Gmail message ID or IMAP UID produced it?
  • Did MailAtlas detect a forwarded-message chain?
  • Which cleaning options affected body_text?

Cleaning is the process that creates body_text, the plain text your agent usually reads first. Email bodies often include forwarded headers, quoted wrappers, footers, invisible characters, repeated whitespace, and boilerplate. Cleaning records explain what MailAtlas changed while preserving the raw email and HTML view in the workspace.

FieldDescription
idStable MailAtlas document ID used by CLI, Python API, exports, and asset references.
source_kindInput path that produced the document, such as eml, mbox, gmail, or imap.
message_idEmail Message-ID header when present. Used for dedupe when available.
thread_idThread identifier when MailAtlas can derive or store one.
subjectEmail subject.
sender_nameDisplay name from the sender header when present.
sender_emailSender email address when present.
authorNormalized author string for applications that need a single display value.
received_atReceived timestamp when available.
published_atPublication-like timestamp when MailAtlas derives one from the message.
body_textCleaned plain-text body.
body_html_pathPath to the normalized HTML view when available.
raw_pathPath to the stored raw email bytes.
content_hashNormalized hash used for dedupe when no usable message_id exists.
metadataSource details, parser notes, cleaning details, and provenance.
created_atTimestamp when MailAtlas created the stored document record.

An asset record describes one file extracted from the message. Use it when your agent needs to inspect attachments, preserve embedded images, or export a message with its files intact.

Example asset:

{
"id": "asset_1",
"document_id": "doc_123",
"ordinal": 1,
"kind": "inline",
"mime_type": "image/png",
"file_path": "assets/doc_123/001-chart.png",
"cid": "chart-1",
"sha256": "<sha256>"
}
FieldDescription
idAsset ID.
document_idID of the document that owns the asset.
ordinalAsset order within the document.
kindinline for files embedded in HTML, or attachment for regular email attachments.
mime_typeMIME type when known.
file_pathLocal path to the copied asset.
cidContent ID used by HTML email to reference an inline asset when present.
sha256SHA-256 hash of the stored asset file when available.

metadata.source records where the document came from. Use it when your agent needs to trace a document back to a file, Gmail message, or IMAP folder.

Metadata pathDescription
source.kindSource kind, such as eml, mbox, gmail, or imap.
source.pathSource file path for file-based ingest when available.
source.hostIMAP host for received IMAP documents when present.
source.folderIMAP folder for received IMAP documents when present.
source.uidIMAP UID for received IMAP documents when present.
source.uidvalidityIMAP UIDVALIDITY value for received IMAP documents when present.
source.gmail_message_idGmail message ID for Gmail documents when present.
source.gmail_thread_idGmail thread ID for Gmail documents when present.
source.gmail_label_idsGmail label IDs for Gmail documents when present.
source.gmail_history_idGmail history ID for Gmail documents when present.
source.gmail_internal_dateGmail internal date for Gmail documents when present.

Cleaning metadata records how MailAtlas shaped body_text. Use it when your agent or application needs to explain why the clean text differs from the raw email body.

Metadata pathDescription
cleaning.removed_forwarded_headersWhether forwarded headers were removed.
cleaning.dropped_line_countNumber of lines removed during cleaning.
cleaning.stopped_at_footerWhether cleaning stopped at a detected footer.
parser_config.*Parser configuration used for the run.

Provenance metadata records message structure detected during parsing. Use it when forwarded-message structure or parser decisions matter to your workflow.

Metadata pathDescription
provenance.is_forwardedWhether MailAtlas detected forwarded-message structure.
provenance.forwarded_chainForwarded-chain details when detected.
{
"id": "doc_123",
"source_kind": "imap",
"message_id": "<message@example.com>",
"subject": "Daily market digest",
"sender_email": "sender@example.com",
"received_at": "2026-04-18T15:05:00Z",
"body_text": "Cleaned message text...",
"body_html_path": "html/doc_123.html",
"raw_path": "raw/doc_123.eml",
"metadata": {
"source": {
"kind": "imap",
"folder": "INBOX",
"uid": "4812"
},
"cleaning": {
"dropped_line_count": 0
}
},
"assets": [
{
"id": "asset_1",
"kind": "inline",
"mime_type": "image/png",
"file_path": "assets/doc_123/001-chart.png"
}
]
}

For single-message files, source_kind is eml. Metadata can record the source path or equivalent source reference.

For mailbox archives, source_kind is mbox. Metadata can identify the mailbox file and message position or source reference.

For Gmail messages, source_kind is gmail. Metadata can include Gmail message ID, thread ID, label IDs, history ID, internal date, and receive account ID.

For IMAP messages, source_kind is imap. Metadata can include the mailbox host, folder, UID, and UIDVALIDITY value.

Paths such as raw_path, body_html_path, and assets[].file_path resolve inside the email workspace:

raw/doc_123.eml
html/doc_123.html
assets/doc_123/001-chart.png

Export commands can also write separate bundles outside the workspace when you pass --out.