Parsing

R2R supports different parsing providers to extract text from various document formats. To configure the parsing provider:

example r2r.toml
[parsing]
provider = "unstructured_local" # | rag | unstructured_api
excluded_parsers = ["mp4"]

Available providers:

  • r2r: Default offering for light installations, a simple and lightweight parser included in R2R.
  • unstructured_local: Default offering for full installations, makes use of open source Unstructured package.
  • unstructured_api: Cloud offering of Unstructured

Supported File Types

R2R supports parsing for the following file types:

  • BMP (Bitmap Image)
  • CSV (Comma-Separated Values)
  • DOC (Microsoft Word Document)
  • DOCX (Microsoft Word Document)
  • EML (Electronic Mail)
  • EPUB (Electronic Publication)
  • GIF (Graphics Interchange Format)
  • HEIC (High-Efficiency Image Format)
  • HTM (HyperText Markup)
  • HTML (HyperText Markup Language)
  • JPEG (Joint Photographic Experts Group)
  • JPG (Joint Photographic Experts Group)
  • JSON (JavaScript Object Notation)
  • MD (Markdown)
  • MSG (Microsoft Outlook Message)
  • MP3 (MPEG Audio Layer III)
  • MP4 (MPEG-4 Part 14)
  • ODT (Open Document Text)
  • ORG (Org Mode)
  • PDF (Portable Document Format)
  • P7S (PKCS#7)
  • PNG (Portable Network Graphics)
  • PPT (PowerPoint)
  • PPTX (Microsoft PowerPoint Presentation)
  • RST (reStructured Text)
  • RTF (Rich Text Format)
  • SVG (Scalable Vector Graphics)
  • TSV (Tab-Separated Values)
  • TXT (Plain Text)
  • XLS (Microsoft Excel Spreadsheet)
  • XLSX (Microsoft Excel Spreadsheet)
  • XML (Extensible Markup Language)
  • TIFF (Tagged Image File Format)
  • MP4 (MPEG-4 Part 14)
Parsing providers for an R2R system cannot be configured at runtime and are instead configured server side.

Refer to the Unstructured documentation for details about their ingestion capabilities and limitations.

Chunking

R2R uses chunking to break down parsed documents into smaller, manageable pieces for efficient processing and retrieval. Configure the chunking settings in r2r.toml:

r2r.toml
[chunking]
provider = "unstructured_local"
strategy = "auto"
chunking_strategy = "by_title"
new_after_n_chars = 512
max_characters = 1_024
combine_under_n_chars = 128
overlap = 20

Key chunking configuration options:

  • provider: The chunking provider (defaults to “r2r”).

For R2R:

  • chunking_strategy: The chunking method (“recursive”).
  • chunk_size: The target size for each chunk.
  • chunk_overlap: The number of characters to overlap between chunks.
  • excluded_parsers: List of parsers to exclude (e.g., [“mp4”]).

For Unstructured:

  • strategy: The overall chunking strategy (“auto”, “fast”, or “hi_res”).
  • chunking_strategy: The specific chunking method (“by_title” or “basic”).
  • new_after_n_chars: Soft maximum size for a chunk.
  • max_characters: Hard maximum size for a chunk.
  • combine_under_n_chars: Minimum size for combining small sections.
  • overlap: Number of characters to overlap between chunks.

Supported Providers

# Ensure unstructured is installed
# Refer to the full installation docs here - [https://r2r-docs.sciphi.ai/introduction/documentation/installation/full/docker]

# Set 'provider = "unstructured_local"' for `ingestion` in `my_r2r.toml`.
r2r serve --config-path=my_r2r.toml

This is the default full provider, using the open-source Unstructured library for local processing.

Advanced Configuration Options

When using the Unstructured chunking provider, you can specify additional parameters in the configuration file:

[ingestion]
provider = "unstructured_local"
strategy = "auto"  # "auto", "fast", or "hi_res"
chunking_strategy = "by_title"  # "by_title" or "basic"

# Core chunking parameters
combine_under_n_chars = 128
max_characters = 500
new_after_n_chars = 1500
overlap = 0

# Additional chunking options
coordinates = false
encoding = "utf-8"
extract_image_block_types = []  # List of image block types to extract
gz_uncompressed_content_type = null
hi_res_model_name = null
include_orig_elements = true
include_page_breaks = false

languages = []  # List of languages to consider
multipage_sections = true
ocr_languages = []  # List of languages for OCR
output_format = "application/json"
overlap_all = false
pdf_infer_table_structure = true

similarity_threshold = null
skip_infer_table_types = []  # List of table types to skip inference
split_pdf_concurrency_level = 5
split_pdf_page = true
starting_page_number = null
unique_element_ids = false
xml_keep_tags = false

These options allow fine-tuning of the chunking process for specific document types or requirements. Refer to the Unstructured documentation here for more details on the available settings.

Runtime Configuration

The chunking configuration can be specified at runtime with the ingest_files endpoint, allowing dynamic adjustment of chunking parameters based on the input documents or specific use cases.

Combining Chunking with Other R2R Components

Chunking is a crucial part of the document processing pipeline in R2R. It works in conjunction with other components such as parsing, embedding, and retrieval. For example:

response = client.ingest_files(
    file_paths=["document.pdf"],
    ingestion_config={
        "provider": "unstructured_local",
        "chunking_strategy": "by_title",
        "max_characters": 1000
    },
    embedding_config={...},
    # ... other configurations
)

For more detailed information on configuring chunking and other ingestion settings, please refer to the Ingestion Configuration documentation.

Next Steps

To learn more about configuring other components of R2R, explore the following pages: