provider
str
default:"unstructured_local"
Which chunking provider to use. Options are “r2r”, “unstructured_local”, or “unstructured_api”.
max_chunk_size
Optional[int]
default:"None"
Sets a maximum size on output chunks.
Combine chunks smaller than this number of characters.
Maximum number of characters per chunk.
Whether to include coordinates in the output.
Encoding to use for text files.
Types of image blocks to extract.
gz_uncompressed_content_type
Content type for uncompressed gzip files.
Name of the high-resolution model to use.
include_orig_elements
Optional[bool]
default:"False"
Whether to include original elements in the output.
Whether to include page breaks in the output.
List of languages to consider for text processing.
Whether to allow sections to span multiple pages.
Start a new chunk after this many characters.
Languages to use for OCR.
output_format
str
default:"application/json"
Format of the output.
Number of characters to overlap between chunks.
Whether to overlap all chunks.
pdf_infer_table_structure
Whether to infer table structure in PDFs.
Threshold for considering chunks similar.
Types of tables to skip inferring.
split_pdf_concurrency_level
Concurrency level for splitting PDFs.
Whether to split PDFs by page.
Page number to start processing from.
Strategy for processing. Options are “auto”, “fast”, or “hi_res”.
chunking_strategy
Optional[str]
default:"by_title"
Strategy for chunking. Options are “by_title” or “basic”.
Whether to generate unique IDs for elements.
Whether to keep XML tags in the output.