Files Module: Intelligent File Handling

The intellibricks.files module is designed to provide a robust and Pythonic way to handle files within your AI applications. It focuses on representing files as RawFile objects and provides functionalities for parsing and extracting content from various file types.

Core Concepts of the Files Module

  • RawFile Abstraction: The central class is RawFile, which represents a file in a structured manner. It encapsulates:
    • contents: The raw byte content of the file.

    • name: The name of the file (without path).

    • extension: The file extension (e.g., “pdf”, “docx”, “txt”).

  • File Loading: RawFile provides convenient class methods for loading files from:
    • File paths (RawFile.from_file_path)

    • Bytes data (RawFile.from_bytes)

    • In-memory file-like objects (RawFile.from_file_obj)

  • File Saving: You can save the contents of a RawFile to disk using RawFile.to_file_path.

  • File Parsing Infrastructure: The module provides a comprehensive system for file parsing, integrating various file parsers to enable extraction of structured information (text, images, tables) from different file types.

  • File Extension Management: The module helps in managing and determining file extensions, which is crucial for routing files to appropriate parsers.

Working with RawFile Objects

Let’s explore how to create and use RawFile objects.

Creating RawFile from a File Path

Assume you have a file named document.pdf in your project directory. You can create a RawFile object like this:

from intellibricks.files import RawFile

file_path = "document.pdf" # Or any path to your file
raw_file = RawFile.from_file_path(file_path)

print(f"File Name: {raw_file.name}")
print(f"File Extension: {raw_file.extension}")
# raw_file.contents now holds the raw bytes of the PDF file

Creating RawFile from Bytes Data

If you have file content in bytes format (e.g., read from a network stream or generated programmatically), you can create a RawFile using from_bytes:

file_bytes = b"%PDF-1.5... (PDF file content bytes) ..." # Example PDF bytes
raw_file_from_bytes = RawFile.from_bytes(file_bytes, "report.pdf")

print(f"File Name: {raw_file_from_bytes.name}") # Output: report.pdf
print(f"File Extension: {raw_file_from_bytes.extension}") # Output: pdf
# raw_file_from_bytes.contents holds the provided bytes

Creating RawFile from a File-Like Object

You can also create a RawFile from an in-memory file-like object (e.g., from io.BytesIO or when you receive a file object from a web request):

import io

# Simulate an in-memory file object
file_content_str = "This is the content of my text file."
file_obj = io.BytesIO(file_content_str.encode('utf-8'))

raw_file_from_obj = RawFile.from_file_obj(file_obj, "sample.txt")

print(f"File Name: {raw_file_from_obj.name}") # Output: sample.txt
print(f"File Extension: {raw_file_from_obj.extension}") # Output: txt
# raw_file_from_obj.contents holds the bytes read from file_obj

Saving RawFile Contents to Disk

To save the content of a RawFile to a new file path:

output_path = "output_documents/saved_document.pdf" # Define where to save
raw_file.to_file_path(output_path)

print(f"File saved to: {output_path}")

Using File Parsers

IntelliBricks provides a set of file parsers within the intellibricks.files.parsers module to extract structured content from different file formats. Here’s how you can use them:

Available Parsers

Currently, IntelliBricks offers the following file parsers:

  • TxtFileParser: For parsing plain text files (.txt).

  • PdfFileParser: For parsing PDF documents (.pdf).

You can import these parsers from intellibricks.files.parsers.

Basic Usage Example

Let’s demonstrate how to parse a PDF file to extract its content:

from intellibricks.files import RawFile
from intellibricks.files.parsers import PdfFileParser, TxtFileParser

# 1. Load a RawFile (e.g., from a file path)
raw_pdf_file = RawFile.from_file_path("document.pdf") # Replace with your PDF file
raw_txt_file = RawFile.from_file_path("document.txt") # Replace with your TXT file

# 2. Instantiate the appropriate parser
pdf_parser = PdfFileParser()
txt_parser = TxtFileParser()

# 3. Extract content using the parser
parsed_pdf_document = pdf_parser.parse(raw_pdf_file)
parsed_txt_document = txt_parser.parse(raw_txt_file)

# 4. Access parsed content (ParsedFile object)
print(f"Parsed document name: {parsed_pdf_document.name}")
for section in parsed_pdf_document.sections:
   print(f"\nSection {section.number}:")
   print(f"  Text (first 100 chars): {section.text[:100]}...")
   # ... access other parsed content like images, items, etc.

print(f"Parsed document name: {parsed_txt_document.name}")
for section in parsed_txt_document.sections:
   print(f"\nSection {section.number}:")
   print(f"  Text (first 100 chars): {section.text[:100]}...")

Handling Different File Types

To parse different file types, you would:

  1. Create a RawFile object for your file.

  2. Instantiate the appropriate parser class based on the file type (e.g., PdfFileParser for PDFs, DocxFileParser for DOCX files - when available).

  3. Call the parse() method of the parser, passing the RawFile object as input.

  4. Work with the returned ParsedFile object to access the extracted structured content.

Parsed File Structure

The output of file parsing is a ParsedFile object, which contains:

  • Sections: The document is divided into SectionContent objects, representing pages or logical sections. Each SectionContent contains:
    • Text: The extracted text content of the section.

    • Markdown: A Markdown representation of the section content, including headings, lists, and basic formatting.

    • Images: A list of Image objects found in the section, including image data and metadata.

    • Items: A list of structured PageItem objects, representing elements like paragraphs, headings, and tables.

Example of Parsed Content (Illustrative)

from intellibricks.files import RawFile, ParsedFile, PdfFileParser

# Assume you have a RawFile object loaded from a PDF
raw_pdf_file = RawFile.from_file_path("document.pdf")
pdf_parser = PdfFileParser()

parsed_document = pdf_parser.parse(raw_pdf_file)

print(f"Parsed document: {parsed_document.name}")
for section in parsed_document.sections:
   print(f"\nSection {section.number}:")
   print(f"  Text (first 100 chars): {section.text[:100]}...")
   if section.images:
      print(f"  Images found: {len(section.images)}")
   if section.items:
      print(f"  Items found: {len(section.items)}")
      # Iterate through items (TextPageItem, HeadingPageItem, TablePageItem, etc.)

API Reference