Article Segmentation and Extraction Guidelines
Project Likkutei Sichos | April 08, 2024
Print This Article
View Original PDF

Article Segmentation and Extraction Guidelines

Project Likkutei Sichos | December 10, 2025

Instructions:

Extract all individual articles from the provided PDF content.

Step 1: Article Segmentation

First, scan the full input text and identify all major article headings. Use these headings to accurately divide the text into separate articles. Each heading should correspond to the main title of an article, not the printout title. Ensure no two articles share the same title.

Step 2: Article Extraction

Return a single JSON array of objects. Each object must contain:

  • "title": The article's unique main heading (clean text only, no special characters).
  • "content": Full article text, formatted in HTML using:
    • <p> for paragraphs
    • <strong> for bold
    • <em> for italics
    • <h2>/<h3> for subheadings
    Preserve original formatting and structure. Do NOT truncate, summarize, or add ellipses.
  • "tags": Array of at least 5 relevant tags/keywords derived from article content.
  • "printout_title": The title of the printout document.
  • "source_url": The original printout's URL or reference.
  • "beginning": Integer (1–5). Is the start logical?
  • "ending": Integer (1–5). Does the article end with punctuation or closure?
  • "completeness": Integer (1–5). Is the article a complete thought?
  • "readability": Integer (1–5). Does the article make sense independently?

Rules & Constraints:

  • Ignore page numbers entirely—even if they include article names (e.g., "Pg. 4 Toras Avigdor"), they should not be mistaken as article headings or included in article content.
  • Ignore the page numbers in the text.
  • Avoid making the title a question or answer.
  • Extract every article using its own heading for segmentation.
  • Do not truncate, summarize, or omit content.
  • Use multi-step extraction if needed for long inputs, but ensure full content is included in the final output.
  • Combine all articles into one single JSON array.
  • Output ONLY the array—no prefixes, suffixes, markdown, or explanations.
  • Structure must be directly mappable to a database.

Output Format:

[ { "title": "", "content": "", "tags": [], "printout_title": "", "source_url": "", "beginning": 0, "ending": 0, "completeness": 0, "readability": 0 } ]

Instructions:

Extract all individual articles from the provided PDF content.

Step 1: Article Segmentation

First, scan the full input text and identify all major article headings. Use these headings to accurately divide the text into separate articles. Each heading should correspond to the main title of an article, not the printout title. Ensure no two articles share the same title.

Step 2: Article Extraction

Return a single JSON array of objects. Each object must contain:

  • "title": The article's unique main heading (clean text only, no special characters).
  • "content": Full article text, formatted in HTML using:
    • <p> for paragraphs
    • <strong> for bold
    • <em> for italics
    • <h2>/<h3> for subheadings
    Preserve original formatting and structure. Do NOT truncate, summarize, or add ellipses.
  • "tags": Array of at least 5 relevant tags/keywords derived from article content.
  • "printout_title": The title of the printout document.
  • "source_url": The original printout's URL or reference.
  • "beginning": Integer (1–5). Is the start logical?
  • "ending": Integer (1–5). Does the article end with punctuation or closure?
  • "completeness": Integer (1–5). Is the article a complete thought?
  • "readability": Integer (1–5). Does the article make sense independently?

Rules & Constraints:

  • Ignore page numbers entirely—even if they include article names (e.g., "Pg. 4 Toras Avigdor"), they should not be mistaken as article headings or included in article content.
  • Ignore the page numbers in the text.
  • Avoid making the title a question or answer.
  • Extract every article using its own heading for segmentation.
  • Do not truncate, summarize, or omit content.
  • Use multi-step extraction if needed for long inputs, but ensure full content is included in the final output.
  • Combine all articles into one single JSON array.
  • Output ONLY the array—no prefixes, suffixes, markdown, or explanations.
  • Structure must be directly mappable to a database.

Output Format:

[ { "title": "", "content": "", "tags": [], "printout_title": "", "source_url": "", "beginning": 0, "ending": 0, "completeness": 0, "readability": 0 } ]
PDF Preview