tests fixed

k
2026-02-17 15:55:45 +00:00 · 2025-03-06 16:58:27 -08:00 · 2025-03-06 16:51:17 -08:00 · 2025-03-06 16:37:45 -08:00 · 2025-03-06 16:30:07 -08:00 · 2025-03-06 16:30:07 -08:00
379 changed files with 15851 additions and 5516 deletions
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@@ -0,0 +1 @@
+* @onyx-dot-app/onyx-core-team
--- a/.github/workflows/nightly-scan-licenses.yml
+++ b/.github/workflows/nightly-scan-licenses.yml
@@ -53,24 +53,90 @@ jobs:
          exclude: '(?i)^(pylint|aio[-_]*).*'
          
      - name: Print report
-        if: ${{ always() }}
+        if: always()
        run: echo "${{ steps.license_check_report.outputs.report }}"
      
      - name: Install npm dependencies
        working-directory: ./web
        run: npm ci
-        
-      - name: Run Trivy vulnerability scanner in repo mode
-        uses: aquasecurity/trivy-action@0.28.0
-        with:
-          scan-type: fs
-          scanners: license
-          format: table
-#           format: sarif
-#           output: trivy-results.sarif
-          severity: HIGH,CRITICAL

-#       - name: Upload Trivy scan results to GitHub Security tab
-#         uses: github/codeql-action/upload-sarif@v3
+        # be careful enabling the sarif and upload as it may spam the security tab
+        # with a huge amount of items. Work out the issues before enabling upload.       
+#       - name: Run Trivy vulnerability scanner in repo mode
+#         if: always()
+#         uses: aquasecurity/trivy-action@0.29.0
 #         with:
-#           sarif_file: trivy-results.sarif
+#           scan-type: fs
+#           scan-ref: .
+#           scanners: license
+#           format: table
+#           severity: HIGH,CRITICAL
+# #           format: sarif
+# #           output: trivy-results.sarif
+# 
+# #       - name: Upload Trivy scan results to GitHub Security tab
+# #         uses: github/codeql-action/upload-sarif@v3
+# #         with:
+# #           sarif_file: trivy-results.sarif
+
+  scan-trivy:
+    # See https://runs-on.com/runners/linux/
+    runs-on: [runs-on,runner=2cpu-linux-x64,"run-id=${{ github.run_id }}"]
+      
+    steps:
+    - name: Set up Docker Buildx
+      uses: docker/setup-buildx-action@v3
+
+    - name: Login to Docker Hub
+      uses: docker/login-action@v3
+      with:
+        username: ${{ secrets.DOCKER_USERNAME }}
+        password: ${{ secrets.DOCKER_TOKEN }}
+
+    # Backend
+    - name: Pull backend docker image
+      run: docker pull onyxdotapp/onyx-backend:latest
+
+    - name: Run Trivy vulnerability scanner on backend
+      uses: aquasecurity/trivy-action@0.29.0
+      env:
+        TRIVY_DB_REPOSITORY: 'public.ecr.aws/aquasecurity/trivy-db:2'
+        TRIVY_JAVA_DB_REPOSITORY: 'public.ecr.aws/aquasecurity/trivy-java-db:1'
+      with:
+        image-ref: onyxdotapp/onyx-backend:latest
+        scanners: license
+        severity: HIGH,CRITICAL
+        vuln-type: library
+        exit-code: 0  # Set to 1 if we want a failed scan to fail the workflow
+
+    # Web server
+    - name: Pull web server docker image
+      run: docker pull onyxdotapp/onyx-web-server:latest
+          
+    - name: Run Trivy vulnerability scanner on web server
+      uses: aquasecurity/trivy-action@0.29.0
+      env:
+        TRIVY_DB_REPOSITORY: 'public.ecr.aws/aquasecurity/trivy-db:2'
+        TRIVY_JAVA_DB_REPOSITORY: 'public.ecr.aws/aquasecurity/trivy-java-db:1'
+      with:
+        image-ref: onyxdotapp/onyx-web-server:latest
+        scanners: license
+        severity: HIGH,CRITICAL
+        vuln-type: library
+        exit-code: 0
+
+    # Model server
+    - name: Pull model server docker image
+      run: docker pull onyxdotapp/onyx-model-server:latest
+
+    - name: Run Trivy vulnerability scanner
+      uses: aquasecurity/trivy-action@0.29.0
+      env:
+        TRIVY_DB_REPOSITORY: 'public.ecr.aws/aquasecurity/trivy-db:2'
+        TRIVY_JAVA_DB_REPOSITORY: 'public.ecr.aws/aquasecurity/trivy-java-db:1'
+      with:
+        image-ref: onyxdotapp/onyx-model-server:latest
+        scanners: license
+        severity: HIGH,CRITICAL
+        vuln-type: library
+        exit-code: 0
--- a/.github/workflows/pr-integration-tests.yml
+++ b/.github/workflows/pr-integration-tests.yml
@@ -145,7 +145,7 @@ jobs:
        run: |
          cd deployment/docker_compose
          docker compose -f docker-compose.multitenant-dev.yml -p onyx-stack down -v
-      
+
      # NOTE: Use pre-ping/null pool to reduce flakiness due to dropped connections
      - name: Start Docker containers
        run: |
@@ -157,6 +157,7 @@ jobs:
          REQUIRE_EMAIL_VERIFICATION=false \
          DISABLE_TELEMETRY=true \
          IMAGE_TAG=test \
+          INTEGRATION_TESTS_MODE=true \
          docker compose -f docker-compose.dev.yml -p onyx-stack up -d
        id: start_docker

@@ -199,7 +200,7 @@ jobs:
          cd backend/tests/integration/mock_services
          docker compose -f docker-compose.mock-it-services.yml \
            -p mock-it-services-stack up -d
-      
+
      # NOTE: Use pre-ping/null to reduce flakiness due to dropped connections
      - name: Run Standard Integration Tests
        run: |
--- a/.github/workflows/pr-python-connector-tests.yml
+++ b/.github/workflows/pr-python-connector-tests.yml
@@ -1,6 +1,7 @@
 name: Connector Tests

 on:
+  merge_group:
  pull_request:
    branches: [main]
  schedule:
@@ -51,7 +52,7 @@ env:
 jobs:
  connectors-check:
    # See https://runs-on.com/runners/linux/
-    runs-on: [runs-on,runner=8cpu-linux-x64,"run-id=${{ github.run_id }}"]
+    runs-on: [runs-on, runner=8cpu-linux-x64, "run-id=${{ github.run_id }}"]

    env:
      PYTHONPATH: ./backend
@@ -76,7 +77,7 @@ jobs:
          pip install --retries 5 --timeout 30 -r backend/requirements/dev.txt
          playwright install chromium
          playwright install-deps chromium
-          
+
      - name: Run Tests
        shell: script -q -e -c "bash --noprofile --norc -eo pipefail {0}"
        run: py.test -o junit_family=xunit2 -xv --ff backend/tests/daily/connectors
--- a/.github/workflows/pr-python-model-tests.yml
+++ b/.github/workflows/pr-python-model-tests.yml
@@ -1,18 +1,29 @@
-name: Connector Tests
+name: Model Server Tests

 on:
  schedule:
    # This cron expression runs the job daily at 16:00 UTC (9am PT)
    - cron: "0 16 * * *"
-
+  workflow_dispatch:
+    inputs:
+      branch:
+        description: 'Branch to run the workflow on'
+        required: false
+        default: 'main'
+        
 env:
  # Bedrock
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  AWS_REGION_NAME: ${{ secrets.AWS_REGION_NAME }}

-  # OpenAI
+  # API keys for testing
+  COHERE_API_KEY: ${{ secrets.COHERE_API_KEY }}
+  LITELLM_API_KEY: ${{ secrets.LITELLM_API_KEY }}
+  LITELLM_API_URL: ${{ secrets.LITELLM_API_URL }}
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+  AZURE_API_KEY: ${{ secrets.AZURE_API_KEY }}
+  AZURE_API_URL: ${{ secrets.AZURE_API_URL }}

 jobs:
  model-check:
@@ -26,6 +37,23 @@ jobs:
      - name: Checkout code
        uses: actions/checkout@v4

+      - name: Login to Docker Hub
+        uses: docker/login-action@v3
+        with:
+          username: ${{ secrets.DOCKER_USERNAME }}
+          password: ${{ secrets.DOCKER_TOKEN }}
+
+      # tag every docker image with "test" so that we can spin up the correct set
+      # of images during testing
+
+      # We don't need to build the Web Docker image since it's not yet used
+      # in the integration tests. We have a separate action to verify that it builds
+      # successfully.
+      - name: Pull Model Server Docker image
+        run: |
+          docker pull onyxdotapp/onyx-model-server:latest
+          docker tag onyxdotapp/onyx-model-server:latest onyxdotapp/onyx-model-server:test
+          
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
@@ -41,6 +69,49 @@ jobs:
          pip install --retries 5 --timeout 30 -r backend/requirements/default.txt
          pip install --retries 5 --timeout 30 -r backend/requirements/dev.txt

+      - name: Start Docker containers
+        run: |
+          cd deployment/docker_compose
+          ENABLE_PAID_ENTERPRISE_EDITION_FEATURES=true \
+          AUTH_TYPE=basic \
+          REQUIRE_EMAIL_VERIFICATION=false \
+          DISABLE_TELEMETRY=true \
+          IMAGE_TAG=test \
+          docker compose -f docker-compose.model-server-test.yml -p onyx-stack up -d indexing_model_server
+        id: start_docker
+
+      - name: Wait for service to be ready
+        run: |
+          echo "Starting wait-for-service script..."
+
+          start_time=$(date +%s)
+          timeout=300  # 5 minutes in seconds
+
+          while true; do
+            current_time=$(date +%s)
+            elapsed_time=$((current_time - start_time))
+            
+            if [ $elapsed_time -ge $timeout ]; then
+              echo "Timeout reached. Service did not become ready in 5 minutes."
+              exit 1
+            fi
+            
+            # Use curl with error handling to ignore specific exit code 56
+            response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:9000/api/health || echo "curl_error")
+            
+            if [ "$response" = "200" ]; then
+              echo "Service is ready!"
+              break
+            elif [ "$response" = "curl_error" ]; then
+              echo "Curl encountered an error, possibly exit code 56. Continuing to retry..."
+            else
+              echo "Service not ready yet (HTTP status $response). Retrying in 5 seconds..."
+            fi
+            
+            sleep 5
+          done
+          echo "Finished waiting for service."
+          
      - name: Run Tests
        shell: script -q -e -c "bash --noprofile --norc -eo pipefail {0}"
        run: |
@@ -56,3 +127,23 @@ jobs:
            -H 'Content-type: application/json' \
            --data '{"text":"Scheduled Model Tests failed! Check the run at: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"}' \
            $SLACK_WEBHOOK
+            
+      - name: Dump all-container logs (optional)
+        if: always()
+        run: |
+          cd deployment/docker_compose
+          docker compose -f docker-compose.model-server-test.yml -p onyx-stack logs --no-color > $GITHUB_WORKSPACE/docker-compose.log || true
+
+      - name: Upload logs
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: docker-all-logs
+          path: ${{ github.workspace }}/docker-compose.log
+          
+      - name: Stop Docker containers
+        if: always()
+        run: |
+          cd deployment/docker_compose
+          docker compose -f docker-compose.model-server-test.yml -p onyx-stack down -v
+          
--- a/README.md
+++ b/README.md
@@ -26,12 +26,12 @@

 <strong>[Onyx](https://www.onyx.app/)</strong> (formerly Danswer) is the AI platform connected to your company's docs, apps, and people.
 Onyx provides a feature rich Chat interface and plugs into any LLM of your choice.
-There are over 40 supported connectors such as Google Drive, Slack, Confluence, Salesforce, etc. which keep knowledge and permissions up to date.
-Create custom AI agents with unique prompts, knowledge, and actions the agents can take.
+Keep knowledge and access controls sync-ed across over 40 connectors like Google Drive, Slack, Confluence, Salesforce, etc.
+Create custom AI agents with unique prompts, knowledge, and actions that the agents can take.
 Onyx can be deployed securely anywhere and for any scale - on a laptop, on-premise, or to cloud.


-<h3>Feature Showcase</h3>
+<h3>Feature Highlights</h3>

 **Deep research over your team's knowledge:**

@@ -63,22 +63,21 @@ We also have built-in support for high-availability/scalable deployment on Kuber
 References [here](https://github.com/onyx-dot-app/onyx/tree/main/deployment).


+## 🔍 Other Notable Benefits of Onyx
+- Custom deep learning models for indexing and inference time, only through Onyx + learning from user feedback.
+- Flexible security features like SSO (OIDC/SAML/OAuth2), RBAC, encryption of credentials, etc.
+- Knowledge curation features like document-sets, query history, usage analytics, etc.
+- Scalable deployment options tested up to many tens of thousands users and hundreds of millions of documents.
+
+
 ## 🚧 Roadmap
- Extensions to the Chrome Plugin
- Latest methods in information retrieval (StructRAG, LightGraphRAG, etc.)
+- New methods in information retrieval (StructRAG, LightGraphRAG, etc.)
 - Personalized Search
 - Organizational understanding and ability to locate and suggest experts from your team.
 - Code Search
 - SQL and Structured Query Language


-## 🔍 Other Notable Benefits of Onyx
- Custom deep learning models only through Onyx + learn from user feedback.
- Flexible security features like SSO (OIDC/SAML/OAuth2), RBAC, encryption of credentials, etc.
- Knowledge curation features like document-sets, query history, usage analytics, etc.
- Scalable deployment options tested up to many tens of thousands users and hundreds of millions of documents.
-
-
 ## 🔌 Connectors
 Keep knowledge and access up to sync across 40+ connectors:

--- a/backend/alembic/versions/3934b1bc7b62_update_github_connector_repo_name_to_.py
+++ b/backend/alembic/versions/3934b1bc7b62_update_github_connector_repo_name_to_.py
@@ -0,0 +1,125 @@
+"""Update GitHub connector repo_name to repositories
+
+Revision ID: 3934b1bc7b62
+Revises: b7c2b63c4a03
+Create Date: 2025-03-05 10:50:30.516962
+
+"""
+from alembic import op
+import sqlalchemy as sa
+import json
+import logging
+
+# revision identifiers, used by Alembic.
+revision = "3934b1bc7b62"
+down_revision = "b7c2b63c4a03"
+branch_labels = None
+depends_on = None
+
+logger = logging.getLogger("alembic.runtime.migration")
+
+
+def upgrade() -> None:
+    # Get all GitHub connectors
+    conn = op.get_bind()
+
+    # First get all GitHub connectors
+    github_connectors = conn.execute(
+        sa.text(
+            """
+            SELECT id, connector_specific_config
+            FROM connector
+            WHERE source = 'GITHUB'
+            """
+        )
+    ).fetchall()
+
+    # Update each connector's config
+    updated_count = 0
+    for connector_id, config in github_connectors:
+        try:
+            if not config:
+                logger.warning(f"Connector {connector_id} has no config, skipping")
+                continue
+
+            # Parse the config if it's a string
+            if isinstance(config, str):
+                config = json.loads(config)
+
+            if "repo_name" not in config:
+                continue
+
+            # Create new config with repositories instead of repo_name
+            new_config = dict(config)
+            repo_name_value = new_config.pop("repo_name")
+            new_config["repositories"] = repo_name_value
+
+            # Update the connector with the new config
+            conn.execute(
+                sa.text(
+                    """
+                    UPDATE connector
+                    SET connector_specific_config = :new_config
+                    WHERE id = :connector_id
+                    """
+                ),
+                {"connector_id": connector_id, "new_config": json.dumps(new_config)},
+            )
+            updated_count += 1
+        except Exception as e:
+            logger.error(f"Error updating connector {connector_id}: {str(e)}")
+
+
+def downgrade() -> None:
+    # Get all GitHub connectors
+    conn = op.get_bind()
+
+    logger.debug(
+        "Starting rollback of GitHub connectors from repositories to repo_name"
+    )
+
+    github_connectors = conn.execute(
+        sa.text(
+            """
+            SELECT id, connector_specific_config
+            FROM connector
+            WHERE source = 'GITHUB'
+            """
+        )
+    ).fetchall()
+
+    logger.debug(f"Found {len(github_connectors)} GitHub connectors to rollback")
+
+    # Revert each GitHub connector to use repo_name instead of repositories
+    reverted_count = 0
+    for connector_id, config in github_connectors:
+        try:
+            if not config:
+                continue
+
+            # Parse the config if it's a string
+            if isinstance(config, str):
+                config = json.loads(config)
+
+            if "repositories" not in config:
+                continue
+
+            # Create new config with repo_name instead of repositories
+            new_config = dict(config)
+            repositories_value = new_config.pop("repositories")
+            new_config["repo_name"] = repositories_value
+
+            # Update the connector with the new config
+            conn.execute(
+                sa.text(
+                    """
+                    UPDATE connector
+                    SET connector_specific_config = :new_config
+                    WHERE id = :connector_id
+                    """
+                ),
+                {"new_config": json.dumps(new_config), "connector_id": connector_id},
+            )
+            reverted_count += 1
+        except Exception as e:
+            logger.error(f"Error reverting connector {connector_id}: {str(e)}")
--- a/backend/alembic/versions/3bd4c84fe72f_improved_index.py
+++ b/backend/alembic/versions/3bd4c84fe72f_improved_index.py
@@ -0,0 +1,84 @@
+"""improved index
+
+Revision ID: 3bd4c84fe72f
+Revises: 8f43500ee275
+Create Date: 2025-02-26 13:07:56.217791
+
+"""
+from alembic import op
+
+
+# revision identifiers, used by Alembic.
+revision = "3bd4c84fe72f"
+down_revision = "8f43500ee275"
+branch_labels = None
+depends_on = None
+
+
+# NOTE:
+# This migration addresses issues with the previous migration (8f43500ee275) which caused
+# an outage by creating an index without using CONCURRENTLY. This migration:
+#
+# 1. Creates more efficient full-text search capabilities using tsvector columns and GIN indexes
+# 2. Uses CONCURRENTLY for all index creation to prevent table locking
+# 3. Explicitly manages transactions with COMMIT statements to allow CONCURRENTLY to work
+# (see: https://www.postgresql.org/docs/9.4/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY)
+# (see: https://github.com/sqlalchemy/alembic/issues/277)
+# 4. Adds indexes to both chat_message and chat_session tables for comprehensive search
+
+
+def upgrade() -> None:
+    # Create a GIN index for full-text search on chat_message.message
+    op.execute(
+        """
+        ALTER TABLE chat_message
+        ADD COLUMN message_tsv tsvector
+        GENERATED ALWAYS AS (to_tsvector('english', message)) STORED;
+        """
+    )
+
+    # Commit the current transaction before creating concurrent indexes
+    op.execute("COMMIT")
+
+    op.execute(
+        """
+        CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_chat_message_tsv
+        ON chat_message
+        USING GIN (message_tsv)
+        """
+    )
+
+    # Also add a stored tsvector column for chat_session.description
+    op.execute(
+        """
+        ALTER TABLE chat_session
+        ADD COLUMN description_tsv tsvector
+        GENERATED ALWAYS AS (to_tsvector('english', coalesce(description, ''))) STORED;
+        """
+    )
+
+    # Commit again before creating the second concurrent index
+    op.execute("COMMIT")
+
+    op.execute(
+        """
+        CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_chat_session_desc_tsv
+        ON chat_session
+        USING GIN (description_tsv)
+        """
+    )
+
+
+def downgrade() -> None:
+    # Drop the indexes first (use CONCURRENTLY for dropping too)
+    op.execute("COMMIT")
+    op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_chat_message_tsv;")
+
+    op.execute("COMMIT")
+    op.execute("DROP INDEX CONCURRENTLY IF EXISTS idx_chat_session_desc_tsv;")
+
+    # Then drop the columns
+    op.execute("ALTER TABLE chat_message DROP COLUMN IF EXISTS message_tsv;")
+    op.execute("ALTER TABLE chat_session DROP COLUMN IF EXISTS description_tsv;")
+
+    op.execute("DROP INDEX IF EXISTS idx_chat_message_message_lower;")
--- a/backend/alembic/versions/8f43500ee275_add_index.py
+++ b/backend/alembic/versions/8f43500ee275_add_index.py
@@ -0,0 +1,32 @@
+"""add index
+
+Revision ID: 8f43500ee275
+Revises: da42808081e3
+Create Date: 2025-02-24 17:35:33.072714
+
+"""
+from alembic import op
+
+
+# revision identifiers, used by Alembic.
+revision = "8f43500ee275"
+down_revision = "da42808081e3"
+branch_labels = None
+depends_on = None
+
+
+def upgrade() -> None:
+    # Create a basic index on the lowercase message column for direct text matching
+    # Limit to 1500 characters to stay well under the 2856 byte limit of btree version 4
+    # op.execute(
+    #     """
+    #     CREATE INDEX idx_chat_message_message_lower
+    #     ON chat_message (LOWER(substring(message, 1, 1500)))
+    #     """
+    # )
+    pass
+
+
+def downgrade() -> None:
+    # Drop the index
+    op.execute("DROP INDEX IF EXISTS idx_chat_message_message_lower;")
--- a/backend/alembic/versions/b7c2b63c4a03_add_background_reindex_enabled_field.py
+++ b/backend/alembic/versions/b7c2b63c4a03_add_background_reindex_enabled_field.py
@@ -0,0 +1,55 @@
+"""add background_reindex_enabled field
+
+Revision ID: b7c2b63c4a03
+Revises: f11b408e39d3
+Create Date: 2024-03-26 12:34:56.789012
+
+"""
+from alembic import op
+import sqlalchemy as sa
+
+from onyx.db.enums import EmbeddingPrecision
+
+
+# revision identifiers, used by Alembic.
+revision = "b7c2b63c4a03"
+down_revision = "f11b408e39d3"
+branch_labels = None
+depends_on = None
+
+
+def upgrade() -> None:
+    # Add background_reindex_enabled column with default value of True
+    op.add_column(
+        "search_settings",
+        sa.Column(
+            "background_reindex_enabled",
+            sa.Boolean(),
+            nullable=False,
+            server_default="true",
+        ),
+    )
+
+    # Add embedding_precision column with default value of FLOAT
+    op.add_column(
+        "search_settings",
+        sa.Column(
+            "embedding_precision",
+            sa.Enum(EmbeddingPrecision, native_enum=False),
+            nullable=False,
+            server_default=EmbeddingPrecision.FLOAT.name,
+        ),
+    )
+
+    # Add reduced_dimension column with default value of None
+    op.add_column(
+        "search_settings",
+        sa.Column("reduced_dimension", sa.Integer(), nullable=True),
+    )
+
+
+def downgrade() -> None:
+    # Remove the background_reindex_enabled column
+    op.drop_column("search_settings", "background_reindex_enabled")
+    op.drop_column("search_settings", "embedding_precision")
+    op.drop_column("search_settings", "reduced_dimension")
--- a/backend/alembic/versions/da42808081e3_migrate_jira_connectors_to_new_format.py
+++ b/backend/alembic/versions/da42808081e3_migrate_jira_connectors_to_new_format.py
@@ -0,0 +1,120 @@
+"""migrate jira connectors to new format
+
+Revision ID: da42808081e3
+Revises: f13db29f3101
+Create Date: 2025-02-24 11:24:54.396040
+
+"""
+from alembic import op
+import sqlalchemy as sa
+import json
+
+from onyx.configs.constants import DocumentSource
+from onyx.connectors.onyx_jira.utils import extract_jira_project
+
+
+# revision identifiers, used by Alembic.
+revision = "da42808081e3"
+down_revision = "f13db29f3101"
+branch_labels = None
+depends_on = None
+
+
+def upgrade() -> None:
+    # Get all Jira connectors
+    conn = op.get_bind()
+
+    # First get all Jira connectors
+    jira_connectors = conn.execute(
+        sa.text(
+            """
+            SELECT id, connector_specific_config
+            FROM connector
+            WHERE source = :source
+            """
+        ),
+        {"source": DocumentSource.JIRA.value.upper()},
+    ).fetchall()
+
+    # Update each connector's config
+    for connector_id, old_config in jira_connectors:
+        if not old_config:
+            continue
+
+        # Extract project key from URL if it exists
+        new_config: dict[str, str | None] = {}
+        if project_url := old_config.get("jira_project_url"):
+            # Parse the URL to get base and project
+            try:
+                jira_base, project_key = extract_jira_project(project_url)
+                new_config = {"jira_base_url": jira_base, "project_key": project_key}
+            except ValueError:
+                # If URL parsing fails, just use the URL as the base
+                new_config = {
+                    "jira_base_url": project_url.split("/projects/")[0],
+                    "project_key": None,
+                }
+        else:
+            # For connectors without a project URL, we need admin intervention
+            # Mark these for review
+            print(
+                f"WARNING: Jira connector {connector_id} has no project URL configured"
+            )
+            continue
+
+        # Update the connector config
+        conn.execute(
+            sa.text(
+                """
+                UPDATE connector
+                SET connector_specific_config = :new_config
+                WHERE id = :id
+                """
+            ),
+            {"id": connector_id, "new_config": json.dumps(new_config)},
+        )
+
+
+def downgrade() -> None:
+    # Get all Jira connectors
+    conn = op.get_bind()
+
+    # First get all Jira connectors
+    jira_connectors = conn.execute(
+        sa.text(
+            """
+            SELECT id, connector_specific_config
+            FROM connector
+            WHERE source = :source
+            """
+        ),
+        {"source": DocumentSource.JIRA.value.upper()},
+    ).fetchall()
+
+    # Update each connector's config back to the old format
+    for connector_id, new_config in jira_connectors:
+        if not new_config:
+            continue
+
+        old_config = {}
+        base_url = new_config.get("jira_base_url")
+        project_key = new_config.get("project_key")
+
+        if base_url and project_key:
+            old_config = {"jira_project_url": f"{base_url}/projects/{project_key}"}
+        elif base_url:
+            old_config = {"jira_project_url": base_url}
+        else:
+            continue
+
+        # Update the connector config
+        conn.execute(
+            sa.text(
+                """
+                UPDATE connector
+                SET connector_specific_config = :old_config
+                WHERE id = :id
+                """
+            ),
+            {"id": connector_id, "old_config": old_config},
+        )
--- a/backend/alembic/versions/f11b408e39d3_force_lowercase_all_users.py
+++ b/backend/alembic/versions/f11b408e39d3_force_lowercase_all_users.py
@@ -0,0 +1,36 @@
+"""force lowercase all users
+
+Revision ID: f11b408e39d3
+Revises: 3bd4c84fe72f
+Create Date: 2025-02-26 17:04:55.683500
+
+"""
+
+
+# revision identifiers, used by Alembic.
+revision = "f11b408e39d3"
+down_revision = "3bd4c84fe72f"
+branch_labels = None
+depends_on = None
+
+
+def upgrade() -> None:
+    # 1) Convert all existing user emails to lowercase
+    from alembic import op
+
+    op.execute(
+        """
+        UPDATE "user"
+        SET email = LOWER(email)
+        """
+    )
+
+    # 2) Add a check constraint to ensure emails are always lowercase
+    op.create_check_constraint("ensure_lowercase_email", "user", "email = LOWER(email)")
+
+
+def downgrade() -> None:
+    # Drop the check constraint
+    from alembic import op
+
+    op.drop_constraint("ensure_lowercase_email", "user", type_="check")
--- a/backend/alembic/versions/f13db29f3101_add_composite_index_for_last_modified_.py
+++ b/backend/alembic/versions/f13db29f3101_add_composite_index_for_last_modified_.py
@@ -0,0 +1,27 @@
+"""Add composite index for last_modified and last_synced to document
+
+Revision ID: f13db29f3101
+Revises: b388730a2899
+Create Date: 2025-02-18 22:48:11.511389
+
+"""
+from alembic import op
+
+# revision identifiers, used by Alembic.
+revision = "f13db29f3101"
+down_revision = "acaab4ef4507"
+branch_labels: str | None = None
+depends_on: str | None = None
+
+
+def upgrade() -> None:
+    op.create_index(
+        "ix_document_sync_status",
+        "document",
+        ["last_modified", "last_synced"],
+        unique=False,
+    )
+
+
+def downgrade() -> None:
+    op.drop_index("ix_document_sync_status", table_name="document")
--- a/backend/alembic_tenants/versions/34e3630c7f32_lowercase_multi_tenant_user_auth.py
+++ b/backend/alembic_tenants/versions/34e3630c7f32_lowercase_multi_tenant_user_auth.py
@@ -0,0 +1,42 @@
+"""lowercase multi-tenant user auth
+
+Revision ID: 34e3630c7f32
+Revises: a4f6ee863c47
+Create Date: 2025-02-26 15:03:01.211894
+
+"""
+from alembic import op
+
+
+# revision identifiers, used by Alembic.
+revision = "34e3630c7f32"
+down_revision = "a4f6ee863c47"
+branch_labels = None
+depends_on = None
+
+
+def upgrade() -> None:
+    # 1) Convert all existing rows to lowercase
+    op.execute(
+        """
+        UPDATE user_tenant_mapping
+        SET email = LOWER(email)
+        """
+    )
+    # 2) Add a check constraint so that emails cannot be written in uppercase
+    op.create_check_constraint(
+        "ensure_lowercase_email",
+        "user_tenant_mapping",
+        "email = LOWER(email)",
+        schema="public",
+    )
+
+
+def downgrade() -> None:
+    # Drop the check constraint
+    op.drop_constraint(
+        "ensure_lowercase_email",
+        "user_tenant_mapping",
+        schema="public",
+        type_="check",
+    )
--- a/backend/ee/onyx/background/celery/apps/primary.py
+++ b/backend/ee/onyx/background/celery/apps/primary.py
@@ -4,12 +4,11 @@ from ee.onyx.server.reporting.usage_export_generation import create_new_usage_re
 from onyx.background.celery.apps.primary import celery_app
 from onyx.background.task_utils import build_celery_task_wrapper
 from onyx.configs.app_configs import JOB_TIMEOUT
-from onyx.db.chat import delete_chat_sessions_older_than
-from onyx.db.engine import get_session_with_tenant
+from onyx.db.chat import delete_chat_session
+from onyx.db.chat import get_chat_sessions_older_than
+from onyx.db.engine import get_session_with_current_tenant
 from onyx.server.settings.store import load_settings
 from onyx.utils.logger import setup_logger
-from shared_configs.configs import MULTI_TENANT
-from shared_configs.contextvars import CURRENT_TENANT_ID_CONTEXTVAR

 logger = setup_logger()

@@ -18,11 +17,28 @@ logger = setup_logger()

@build_celery_task_wrapper(name_chat_ttl_task)
@celery_app.task(soft_time_limit=JOB_TIMEOUT)
-def perform_ttl_management_task(
-    retention_limit_days: int, *, tenant_id: str | None
-) -> None:
-    with get_session_with_tenant(tenant_id=tenant_id) as db_session:
-        delete_chat_sessions_older_than(retention_limit_days, db_session)
+def perform_ttl_management_task(retention_limit_days: int, *, tenant_id: str) -> None:
+    with get_session_with_current_tenant() as db_session:
+        old_chat_sessions = get_chat_sessions_older_than(
+            retention_limit_days, db_session
+        )
+
+    for user_id, session_id in old_chat_sessions:
+        # one session per delete so that we don't blow up if a deletion fails.
+        with get_session_with_current_tenant() as db_session:
+            try:
+                delete_chat_session(
+                    user_id,
+                    session_id,
+                    db_session,
+                    include_deleted=True,
+                    hard_delete=True,
+                )
+            except Exception:
+                logger.exception(
+                    "delete_chat_session exceptioned. "
+                    f"user_id={user_id} session_id={session_id}"
+                )


 #####
@@ -35,24 +51,19 @@ def perform_ttl_management_task(
    ignore_result=True,
    soft_time_limit=JOB_TIMEOUT,
 )
-def check_ttl_management_task(*, tenant_id: str | None) -> None:
+def check_ttl_management_task(*, tenant_id: str) -> None:
    """Runs periodically to check if any ttl tasks should be run and adds them
    to the queue"""
-    token = None
-    if MULTI_TENANT and tenant_id is not None:
-        token = CURRENT_TENANT_ID_CONTEXTVAR.set(tenant_id)

    settings = load_settings()
    retention_limit_days = settings.maximum_chat_retention_days
-    with get_session_with_tenant(tenant_id=tenant_id) as db_session:
+    with get_session_with_current_tenant() as db_session:
        if should_perform_chat_ttl_check(retention_limit_days, db_session):
            perform_ttl_management_task.apply_async(
                kwargs=dict(
                    retention_limit_days=retention_limit_days, tenant_id=tenant_id
                ),
            )
-    if token is not None:
-        CURRENT_TENANT_ID_CONTEXTVAR.reset(token)


@celery_app.task(
@@ -60,9 +71,9 @@ def check_ttl_management_task(*, tenant_id: str | None) -> None:
    ignore_result=True,
    soft_time_limit=JOB_TIMEOUT,
 )
-def autogenerate_usage_report_task(*, tenant_id: str | None) -> None:
+def autogenerate_usage_report_task(*, tenant_id: str) -> None:
    """This generates usage report under the /admin/generate-usage/report endpoint"""
-    with get_session_with_tenant(tenant_id=tenant_id) as db_session:
+    with get_session_with_current_tenant() as db_session:
        create_new_usage_report(
            db_session=db_session,
            user_id=None,
--- a/backend/ee/onyx/background/celery/tasks/vespa/tasks.py
+++ b/backend/ee/onyx/background/celery/tasks/vespa/tasks.py
@@ -18,7 +18,7 @@ logger = setup_logger()


 def monitor_usergroup_taskset(
-    tenant_id: str | None, key_bytes: bytes, r: Redis, db_session: Session
+    tenant_id: str, key_bytes: bytes, r: Redis, db_session: Session
 ) -> None:
    """This function is likely to move in the worker refactor happening next."""
    fence_key = key_bytes.decode("utf-8")
--- a/backend/ee/onyx/configs/app_configs.py
+++ b/backend/ee/onyx/configs/app_configs.py
@@ -59,10 +59,14 @@ SUPER_CLOUD_API_KEY = os.environ.get("SUPER_CLOUD_API_KEY", "api_key")

 OAUTH_SLACK_CLIENT_ID = os.environ.get("OAUTH_SLACK_CLIENT_ID", "")
 OAUTH_SLACK_CLIENT_SECRET = os.environ.get("OAUTH_SLACK_CLIENT_SECRET", "")
-OAUTH_CONFLUENCE_CLIENT_ID = os.environ.get("OAUTH_CONFLUENCE_CLIENT_ID", "")
-OAUTH_CONFLUENCE_CLIENT_SECRET = os.environ.get("OAUTH_CONFLUENCE_CLIENT_SECRET", "")
-OAUTH_JIRA_CLIENT_ID = os.environ.get("OAUTH_JIRA_CLIENT_ID", "")
-OAUTH_JIRA_CLIENT_SECRET = os.environ.get("OAUTH_JIRA_CLIENT_SECRET", "")
+OAUTH_CONFLUENCE_CLOUD_CLIENT_ID = os.environ.get(
+    "OAUTH_CONFLUENCE_CLOUD_CLIENT_ID", ""
+)
+OAUTH_CONFLUENCE_CLOUD_CLIENT_SECRET = os.environ.get(
+    "OAUTH_CONFLUENCE_CLOUD_CLIENT_SECRET", ""
+)
+OAUTH_JIRA_CLOUD_CLIENT_ID = os.environ.get("OAUTH_JIRA_CLOUD_CLIENT_ID", "")
+OAUTH_JIRA_CLOUD_CLIENT_SECRET = os.environ.get("OAUTH_JIRA_CLOUD_CLIENT_SECRET", "")
 OAUTH_GOOGLE_DRIVE_CLIENT_ID = os.environ.get("OAUTH_GOOGLE_DRIVE_CLIENT_ID", "")
 OAUTH_GOOGLE_DRIVE_CLIENT_SECRET = os.environ.get(
    "OAUTH_GOOGLE_DRIVE_CLIENT_SECRET", ""
--- a/backend/ee/onyx/db/connector_credential_pair.py
+++ b/backend/ee/onyx/db/connector_credential_pair.py
@@ -4,6 +4,7 @@ from sqlalchemy.orm import Session
 from onyx.configs.constants import DocumentSource
 from onyx.db.connector_credential_pair import get_connector_credential_pair
 from onyx.db.enums import AccessType
+from onyx.db.enums import ConnectorCredentialPairStatus
 from onyx.db.models import Connector
 from onyx.db.models import ConnectorCredentialPair
 from onyx.db.models import UserGroup__ConnectorCredentialPair
@@ -35,10 +36,11 @@ def _delete_connector_credential_pair_user_groups_relationship__no_commit(
 def get_cc_pairs_by_source(
    db_session: Session,
    source_type: DocumentSource,
-    only_sync: bool,
+    access_type: AccessType | None = None,
+    status: ConnectorCredentialPairStatus | None = None,
 ) -> list[ConnectorCredentialPair]:
    """
-    Get all cc_pairs for a given source type (and optionally only sync)
+    Get all cc_pairs for a given source type with optional filtering by access_type and status
    result is sorted by cc_pair id
    """
    query = (
@@ -48,8 +50,11 @@ def get_cc_pairs_by_source(
        .order_by(ConnectorCredentialPair.id)
    )

-    if only_sync:
-        query = query.filter(ConnectorCredentialPair.access_type == AccessType.SYNC)
+    if access_type is not None:
+        query = query.filter(ConnectorCredentialPair.access_type == access_type)
+
+    if status is not None:
+        query = query.filter(ConnectorCredentialPair.status == status)

    cc_pairs = query.all()
    return cc_pairs
--- a/backend/ee/onyx/db/query_history.py
+++ b/backend/ee/onyx/db/query_history.py
@@ -134,7 +134,9 @@ def fetch_chat_sessions_eagerly_by_time(
    limit: int | None = 500,
    initial_time: datetime | None = None,
 ) -> list[ChatSession]:
-    time_order: UnaryExpression = desc(ChatSession.time_created)
+    """Sorted by oldest to newest, then by message id"""
+
+    asc_time_order: UnaryExpression = asc(ChatSession.time_created)
    message_order: UnaryExpression = asc(ChatMessage.id)

    filters: list[ColumnElement | BinaryExpression] = [
@@ -147,8 +149,7 @@ def fetch_chat_sessions_eagerly_by_time(
    subquery = (
        db_session.query(ChatSession.id, ChatSession.time_created)
        .filter(*filters)
-        .order_by(ChatSession.id, time_order)
-        .distinct(ChatSession.id)
+        .order_by(asc_time_order)
        .limit(limit)
        .subquery()
    )
@@ -164,7 +165,7 @@ def fetch_chat_sessions_eagerly_by_time(
                ChatMessage.chat_message_feedbacks
            ),
        )
-        .order_by(time_order, message_order)
+        .order_by(asc_time_order, message_order)
    )

    chat_sessions = query.all()
--- a/backend/ee/onyx/db/usage_export.py
+++ b/backend/ee/onyx/db/usage_export.py
@@ -16,13 +16,18 @@ from onyx.db.models import UsageReport
 from onyx.file_store.file_store import get_default_file_store


-# Gets skeletons of all message
+# Gets skeletons of all messages in the given range
 def get_empty_chat_messages_entries__paginated(
    db_session: Session,
    period: tuple[datetime, datetime],
    limit: int | None = 500,
    initial_time: datetime | None = None,
 ) -> tuple[Optional[datetime], list[ChatMessageSkeleton]]:
+    """Returns a tuple where:
+    first element is the most recent timestamp out of the sessions iterated
+    - this timestamp can be used to paginate forward in time
+    second element is a list of messages belonging to all the sessions iterated
+    """
    chat_sessions = fetch_chat_sessions_eagerly_by_time(
        start=period[0],
        end=period[1],
@@ -52,18 +57,17 @@ def get_empty_chat_messages_entries__paginated(
    if len(chat_sessions) == 0:
        return None, []

-    return chat_sessions[0].time_created, message_skeletons
+    return chat_sessions[-1].time_created, message_skeletons


 def get_all_empty_chat_message_entries(
    db_session: Session,
    period: tuple[datetime, datetime],
 ) -> Generator[list[ChatMessageSkeleton], None, None]:
+    """period is the range of time over which to fetch messages."""
    initial_time: Optional[datetime] = period[0]
-    ind = 0
    while True:
-        ind += 1
-
+        # iterate from oldest to newest
        time_created, message_skeletons = get_empty_chat_messages_entries__paginated(
            db_session,
            period,
--- a/backend/ee/onyx/db/user_group.py
+++ b/backend/ee/onyx/db/user_group.py
@@ -424,7 +424,7 @@ def _validate_curator_status__no_commit(
        )

        # if the user is a curator in any of their groups, set their role to CURATOR
-        # otherwise, set their role to BASIC
+        # otherwise, set their role to BASIC only if they were previously a CURATOR
        if curator_relationships:
            user.role = UserRole.CURATOR
        elif user.role == UserRole.CURATOR:
@@ -631,7 +631,16 @@ def update_user_group(
    removed_users = db_session.scalars(
        select(User).where(User.id.in_(removed_user_ids))  # type: ignore
    ).unique()
-    _validate_curator_status__no_commit(db_session, list(removed_users))
+
+    # Filter out admin and global curator users before validating curator status
+    users_to_validate = [
+        user
+        for user in removed_users
+        if user.role not in [UserRole.ADMIN, UserRole.GLOBAL_CURATOR]
+    ]
+
+    if users_to_validate:
+        _validate_curator_status__no_commit(db_session, users_to_validate)

    # update "time_updated" to now
    db_user_group.time_last_modified_by_user = func.now()
--- a/backend/ee/onyx/external_permissions/confluence/doc_sync.py
+++ b/backend/ee/onyx/external_permissions/confluence/doc_sync.py
@@ -9,12 +9,16 @@ from ee.onyx.external_permissions.confluence.constants import ALL_CONF_EMAILS_GR
 from onyx.access.models import DocExternalAccess
 from onyx.access.models import ExternalAccess
 from onyx.connectors.confluence.connector import ConfluenceConnector
+from onyx.connectors.confluence.onyx_confluence import (
+    get_user_email_from_username__server,
+)
 from onyx.connectors.confluence.onyx_confluence import OnyxConfluence
-from onyx.connectors.confluence.utils import get_user_email_from_username__server
+from onyx.connectors.credentials_provider import OnyxDBCredentialsProvider
 from onyx.connectors.models import SlimDocument
 from onyx.db.models import ConnectorCredentialPair
 from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
 from onyx.utils.logger import setup_logger
+from shared_configs.contextvars import get_current_tenant_id

 logger = setup_logger()

@@ -342,7 +346,8 @@ def _fetch_all_page_restrictions(


 def confluence_doc_sync(
-    cc_pair: ConnectorCredentialPair, callback: IndexingHeartbeatInterface | None
+    cc_pair: ConnectorCredentialPair,
+    callback: IndexingHeartbeatInterface | None,
 ) -> list[DocExternalAccess]:
    """
    Adds the external permissions to the documents in postgres
@@ -354,7 +359,11 @@ def confluence_doc_sync(
    confluence_connector = ConfluenceConnector(
        **cc_pair.connector.connector_specific_config
    )
-    confluence_connector.load_credentials(cc_pair.credential.credential_json)
+
+    provider = OnyxDBCredentialsProvider(
+        get_current_tenant_id(), "confluence", cc_pair.credential_id
+    )
+    confluence_connector.set_credentials_provider(provider)

    is_cloud = cc_pair.connector.connector_specific_config.get("is_cloud", False)

--- a/backend/ee/onyx/external_permissions/confluence/group_sync.py
+++ b/backend/ee/onyx/external_permissions/confluence/group_sync.py
@@ -1,9 +1,11 @@
 from ee.onyx.db.external_perm import ExternalUserGroup
 from ee.onyx.external_permissions.confluence.constants import ALL_CONF_EMAILS_GROUP_NAME
 from onyx.background.error_logging import emit_background_error
-from onyx.connectors.confluence.onyx_confluence import build_confluence_client
+from onyx.connectors.confluence.onyx_confluence import (
+    get_user_email_from_username__server,
+)
 from onyx.connectors.confluence.onyx_confluence import OnyxConfluence
-from onyx.connectors.confluence.utils import get_user_email_from_username__server
+from onyx.connectors.credentials_provider import OnyxDBCredentialsProvider
 from onyx.db.models import ConnectorCredentialPair
 from onyx.utils.logger import setup_logger

@@ -61,13 +63,27 @@ def _build_group_member_email_map(


 def confluence_group_sync(
+    tenant_id: str,
    cc_pair: ConnectorCredentialPair,
 ) -> list[ExternalUserGroup]:
-    confluence_client = build_confluence_client(
-        credentials=cc_pair.credential.credential_json,
-        is_cloud=cc_pair.connector.connector_specific_config.get("is_cloud", False),
-        wiki_base=cc_pair.connector.connector_specific_config["wiki_base"],
-    )
+    provider = OnyxDBCredentialsProvider(tenant_id, "confluence", cc_pair.credential_id)
+    is_cloud = cc_pair.connector.connector_specific_config.get("is_cloud", False)
+    wiki_base: str = cc_pair.connector.connector_specific_config["wiki_base"]
+    url = wiki_base.rstrip("/")
+
+    probe_kwargs = {
+        "max_backoff_retries": 6,
+        "max_backoff_seconds": 10,
+    }
+
+    final_kwargs = {
+        "max_backoff_retries": 10,
+        "max_backoff_seconds": 60,
+    }
+
+    confluence_client = OnyxConfluence(is_cloud, url, provider)
+    confluence_client._probe_connection(**probe_kwargs)
+    confluence_client._initialize_connection(**final_kwargs)

    group_member_email_map = _build_group_member_email_map(
        confluence_client=confluence_client,
--- a/backend/ee/onyx/external_permissions/gmail/doc_sync.py
+++ b/backend/ee/onyx/external_permissions/gmail/doc_sync.py
@@ -32,7 +32,8 @@ def _get_slim_doc_generator(


 def gmail_doc_sync(
-    cc_pair: ConnectorCredentialPair, callback: IndexingHeartbeatInterface | None
+    cc_pair: ConnectorCredentialPair,
+    callback: IndexingHeartbeatInterface | None,
 ) -> list[DocExternalAccess]:
    """
    Adds the external permissions to the documents in postgres
--- a/backend/ee/onyx/external_permissions/google_drive/doc_sync.py
+++ b/backend/ee/onyx/external_permissions/google_drive/doc_sync.py
@@ -62,12 +62,14 @@ def _fetch_permissions_for_permission_ids(
        user_email=(owner_email or google_drive_connector.primary_admin_email),
    )

+    # We continue on 404 or 403 because the document may not exist or the user may not have access to it
    fetched_permissions = execute_paginated_retrieval(
        retrieval_function=drive_service.permissions().list,
        list_key="permissions",
        fileId=doc_id,
        fields="permissions(id, emailAddress, type, domain)",
        supportsAllDrives=True,
+        continue_on_404_or_403=True,
    )

    permissions_for_doc_id = []
@@ -104,7 +106,13 @@ def _get_permissions_from_slim_doc(
    user_emails: set[str] = set()
    group_emails: set[str] = set()
    public = False
+    skipped_permissions = 0
+
    for permission in permissions_list:
+        if not permission:
+            skipped_permissions += 1
+            continue
+
        permission_type = permission["type"]
        if permission_type == "user":
            user_emails.add(permission["emailAddress"])
@@ -121,6 +129,11 @@ def _get_permissions_from_slim_doc(
        elif permission_type == "anyone":
            public = True

+    if skipped_permissions > 0:
+        logger.warning(
+            f"Skipped {skipped_permissions} permissions of {len(permissions_list)} for document {slim_doc.id}"
+        )
+
    drive_id = permission_info.get("drive_id")
    group_ids = group_emails | ({drive_id} if drive_id is not None else set())

@@ -132,7 +145,8 @@ def _get_permissions_from_slim_doc(


 def gdrive_doc_sync(
-    cc_pair: ConnectorCredentialPair, callback: IndexingHeartbeatInterface | None
+    cc_pair: ConnectorCredentialPair,
+    callback: IndexingHeartbeatInterface | None,
 ) -> list[DocExternalAccess]:
    """
    Adds the external permissions to the documents in postgres
--- a/backend/ee/onyx/external_permissions/google_drive/group_sync.py
+++ b/backend/ee/onyx/external_permissions/google_drive/group_sync.py
@@ -119,6 +119,7 @@ def _build_onyx_groups(


 def gdrive_group_sync(
+    tenant_id: str,
    cc_pair: ConnectorCredentialPair,
 ) -> list[ExternalUserGroup]:
    # Initialize connector and build credential/service objects
--- a/backend/ee/onyx/external_permissions/slack/doc_sync.py
+++ b/backend/ee/onyx/external_permissions/slack/doc_sync.py
@@ -123,7 +123,8 @@ def _fetch_channel_permissions(


 def slack_doc_sync(
-    cc_pair: ConnectorCredentialPair, callback: IndexingHeartbeatInterface | None
+    cc_pair: ConnectorCredentialPair,
+    callback: IndexingHeartbeatInterface | None,
 ) -> list[DocExternalAccess]:
    """
    Adds the external permissions to the documents in postgres
--- a/backend/ee/onyx/external_permissions/sync_params.py
+++ b/backend/ee/onyx/external_permissions/sync_params.py
@@ -28,6 +28,7 @@ DocSyncFuncType = Callable[

 GroupSyncFuncType = Callable[
    [
+        str,
        ConnectorCredentialPair,
    ],
    list[ExternalUserGroup],
--- a/backend/ee/onyx/main.py
+++ b/backend/ee/onyx/main.py
@@ -15,7 +15,7 @@ from ee.onyx.server.enterprise_settings.api import (
 )
 from ee.onyx.server.manage.standard_answer import router as standard_answer_router
 from ee.onyx.server.middleware.tenant_tracking import add_tenant_id_middleware
-from ee.onyx.server.oauth import router as oauth_router
+from ee.onyx.server.oauth.api import router as ee_oauth_router
 from ee.onyx.server.query_and_chat.chat_backend import (
    router as chat_router,
 )
@@ -128,7 +128,7 @@ def get_application() -> FastAPI:
    include_router_with_global_prefix_prepended(application, query_router)
    include_router_with_global_prefix_prepended(application, chat_router)
    include_router_with_global_prefix_prepended(application, standard_answer_router)
-    include_router_with_global_prefix_prepended(application, oauth_router)
+    include_router_with_global_prefix_prepended(application, ee_oauth_router)

    # Enterprise-only global settings
    include_router_with_global_prefix_prepended(
@@ -152,4 +152,8 @@ def get_application() -> FastAPI:
    # environment variable. Used to automate deployment for multiple environments.
    seed_db()

+    # for debugging discovered routes
+    # for route in application.router.routes:
+    #     print(f"Path: {route.path}, Methods: {route.methods}")
+
    return application
--- a/backend/ee/onyx/onyxbot/slack/handlers/handle_standard_answers.py
+++ b/backend/ee/onyx/onyxbot/slack/handlers/handle_standard_answers.py
@@ -22,7 +22,7 @@ from onyx.onyxbot.slack.blocks import get_restate_blocks
 from onyx.onyxbot.slack.constants import GENERATE_ANSWER_BUTTON_ACTION_ID
 from onyx.onyxbot.slack.handlers.utils import send_team_member_message
 from onyx.onyxbot.slack.models import SlackMessageInfo
-from onyx.onyxbot.slack.utils import respond_in_thread
+from onyx.onyxbot.slack.utils import respond_in_thread_or_channel
 from onyx.onyxbot.slack.utils import update_emote_react
 from onyx.utils.logger import OnyxLoggingAdapter
 from onyx.utils.logger import setup_logger
@@ -216,7 +216,7 @@ def _handle_standard_answers(
        all_blocks = restate_question_blocks + answer_blocks

        try:
-            respond_in_thread(
+            respond_in_thread_or_channel(
                client=client,
                channel=message_info.channel_to_respond,
                receiver_ids=receiver_ids,
@@ -231,6 +231,7 @@ def _handle_standard_answers(
                    client=client,
                    channel=message_info.channel_to_respond,
                    thread_ts=slack_thread_id,
+                    receiver_ids=receiver_ids,
                )

            return True
--- a/backend/ee/onyx/server/oauth.py
+++ b/backend/ee/onyx/server/oauth.py
@@ -1,629 +0,0 @@
-import base64
-import json
-import uuid
-from typing import Any
-from typing import cast
-
-import requests
-from fastapi import APIRouter
-from fastapi import Depends
-from fastapi import HTTPException
-from fastapi.responses import JSONResponse
-from pydantic import BaseModel
-from sqlalchemy.orm import Session
-
-from ee.onyx.configs.app_configs import OAUTH_CONFLUENCE_CLIENT_ID
-from ee.onyx.configs.app_configs import OAUTH_CONFLUENCE_CLIENT_SECRET
-from ee.onyx.configs.app_configs import OAUTH_GOOGLE_DRIVE_CLIENT_ID
-from ee.onyx.configs.app_configs import OAUTH_GOOGLE_DRIVE_CLIENT_SECRET
-from ee.onyx.configs.app_configs import OAUTH_SLACK_CLIENT_ID
-from ee.onyx.configs.app_configs import OAUTH_SLACK_CLIENT_SECRET
-from onyx.auth.users import current_user
-from onyx.configs.app_configs import WEB_DOMAIN
-from onyx.configs.constants import DocumentSource
-from onyx.connectors.google_utils.google_auth import get_google_oauth_creds
-from onyx.connectors.google_utils.google_auth import sanitize_oauth_credentials
-from onyx.connectors.google_utils.shared_constants import (
-    DB_CREDENTIALS_AUTHENTICATION_METHOD,
-)
-from onyx.connectors.google_utils.shared_constants import (
-    DB_CREDENTIALS_DICT_TOKEN_KEY,
-)
-from onyx.connectors.google_utils.shared_constants import (
-    DB_CREDENTIALS_PRIMARY_ADMIN_KEY,
-)
-from onyx.connectors.google_utils.shared_constants import (
-    GoogleOAuthAuthenticationMethod,
-)
-from onyx.db.credentials import create_credential
-from onyx.db.engine import get_session
-from onyx.db.models import User
-from onyx.redis.redis_pool import get_redis_client
-from onyx.server.documents.models import CredentialBase
-from onyx.utils.logger import setup_logger
-from shared_configs.contextvars import get_current_tenant_id
-
-
-logger = setup_logger()
-
-router = APIRouter(prefix="/oauth")
-
-
-class SlackOAuth:
-    # https://knock.app/blog/how-to-authenticate-users-in-slack-using-oauth
-    # Example: https://api.slack.com/authentication/oauth-v2#exchanging
-
-    class OAuthSession(BaseModel):
-        """Stored in redis to be looked up on callback"""
-
-        email: str
-        redirect_on_success: str | None  # Where to send the user if OAuth flow succeeds
-
-    CLIENT_ID = OAUTH_SLACK_CLIENT_ID
-    CLIENT_SECRET = OAUTH_SLACK_CLIENT_SECRET
-
-    TOKEN_URL = "https://slack.com/api/oauth.v2.access"
-
-    # SCOPE is per https://docs.onyx.app/connectors/slack
-    BOT_SCOPE = (
-        "channels:history,"
-        "channels:read,"
-        "groups:history,"
-        "groups:read,"
-        "channels:join,"
-        "im:history,"
-        "users:read,"
-        "users:read.email,"
-        "usergroups:read"
-    )
-
-    REDIRECT_URI = f"{WEB_DOMAIN}/admin/connectors/slack/oauth/callback"
-    DEV_REDIRECT_URI = f"https://redirectmeto.com/{REDIRECT_URI}"
-
-    @classmethod
-    def generate_oauth_url(cls, state: str) -> str:
-        return cls._generate_oauth_url_helper(cls.REDIRECT_URI, state)
-
-    @classmethod
-    def generate_dev_oauth_url(cls, state: str) -> str:
-        """dev mode workaround for localhost testing
-        - https://www.nango.dev/blog/oauth-redirects-on-localhost-with-https
-        """
-
-        return cls._generate_oauth_url_helper(cls.DEV_REDIRECT_URI, state)
-
-    @classmethod
-    def _generate_oauth_url_helper(cls, redirect_uri: str, state: str) -> str:
-        url = (
-            f"https://slack.com/oauth/v2/authorize"
-            f"?client_id={cls.CLIENT_ID}"
-            f"&redirect_uri={redirect_uri}"
-            f"&scope={cls.BOT_SCOPE}"
-            f"&state={state}"
-        )
-        return url
-
-    @classmethod
-    def session_dump_json(cls, email: str, redirect_on_success: str | None) -> str:
-        """Temporary state to store in redis. to be looked up on auth response.
-        Returns a json string.
-        """
-        session = SlackOAuth.OAuthSession(
-            email=email, redirect_on_success=redirect_on_success
-        )
-        return session.model_dump_json()
-
-    @classmethod
-    def parse_session(cls, session_json: str) -> OAuthSession:
-        session = SlackOAuth.OAuthSession.model_validate_json(session_json)
-        return session
-
-
-class ConfluenceCloudOAuth:
-    """work in progress"""
-
-    # https://developer.atlassian.com/cloud/confluence/oauth-2-3lo-apps/
-
-    class OAuthSession(BaseModel):
-        """Stored in redis to be looked up on callback"""
-
-        email: str
-        redirect_on_success: str | None  # Where to send the user if OAuth flow succeeds
-
-    CLIENT_ID = OAUTH_CONFLUENCE_CLIENT_ID
-    CLIENT_SECRET = OAUTH_CONFLUENCE_CLIENT_SECRET
-    TOKEN_URL = "https://auth.atlassian.com/oauth/token"
-
-    # All read scopes per https://developer.atlassian.com/cloud/confluence/scopes-for-oauth-2-3LO-and-forge-apps/
-    CONFLUENCE_OAUTH_SCOPE = (
-        "read:confluence-props%20"
-        "read:confluence-content.all%20"
-        "read:confluence-content.summary%20"
-        "read:confluence-content.permission%20"
-        "read:confluence-user%20"
-        "read:confluence-groups%20"
-        "readonly:content.attachment:confluence"
-    )
-
-    REDIRECT_URI = f"{WEB_DOMAIN}/admin/connectors/confluence/oauth/callback"
-    DEV_REDIRECT_URI = f"https://redirectmeto.com/{REDIRECT_URI}"
-
-    # eventually for Confluence Data Center
-    # oauth_url = (
-    #     f"http://localhost:8090/rest/oauth/v2/authorize?client_id={CONFLUENCE_OAUTH_CLIENT_ID}"
-    #     f"&scope={CONFLUENCE_OAUTH_SCOPE_2}"
-    #     f"&redirect_uri={redirectme_uri}"
-    # )
-
-    @classmethod
-    def generate_oauth_url(cls, state: str) -> str:
-        return cls._generate_oauth_url_helper(cls.REDIRECT_URI, state)
-
-    @classmethod
-    def generate_dev_oauth_url(cls, state: str) -> str:
-        """dev mode workaround for localhost testing
-        - https://www.nango.dev/blog/oauth-redirects-on-localhost-with-https
-        """
-        return cls._generate_oauth_url_helper(cls.DEV_REDIRECT_URI, state)
-
-    @classmethod
-    def _generate_oauth_url_helper(cls, redirect_uri: str, state: str) -> str:
-        url = (
-            "https://auth.atlassian.com/authorize"
-            f"?audience=api.atlassian.com"
-            f"&client_id={cls.CLIENT_ID}"
-            f"&redirect_uri={redirect_uri}"
-            f"&scope={cls.CONFLUENCE_OAUTH_SCOPE}"
-            f"&state={state}"
-            "&response_type=code"
-            "&prompt=consent"
-        )
-        return url
-
-    @classmethod
-    def session_dump_json(cls, email: str, redirect_on_success: str | None) -> str:
-        """Temporary state to store in redis. to be looked up on auth response.
-        Returns a json string.
-        """
-        session = ConfluenceCloudOAuth.OAuthSession(
-            email=email, redirect_on_success=redirect_on_success
-        )
-        return session.model_dump_json()
-
-    @classmethod
-    def parse_session(cls, session_json: str) -> SlackOAuth.OAuthSession:
-        session = SlackOAuth.OAuthSession.model_validate_json(session_json)
-        return session
-
-
-class GoogleDriveOAuth:
-    # https://developers.google.com/identity/protocols/oauth2
-    # https://developers.google.com/identity/protocols/oauth2/web-server
-
-    class OAuthSession(BaseModel):
-        """Stored in redis to be looked up on callback"""
-
-        email: str
-        redirect_on_success: str | None  # Where to send the user if OAuth flow succeeds
-
-    CLIENT_ID = OAUTH_GOOGLE_DRIVE_CLIENT_ID
-    CLIENT_SECRET = OAUTH_GOOGLE_DRIVE_CLIENT_SECRET
-
-    TOKEN_URL = "https://oauth2.googleapis.com/token"
-
-    # SCOPE is per https://docs.onyx.app/connectors/google-drive
-    # TODO: Merge with or use google_utils.GOOGLE_SCOPES
-    SCOPE = (
-        "https://www.googleapis.com/auth/drive.readonly%20"
-        "https://www.googleapis.com/auth/drive.metadata.readonly%20"
-        "https://www.googleapis.com/auth/admin.directory.user.readonly%20"
-        "https://www.googleapis.com/auth/admin.directory.group.readonly"
-    )
-
-    REDIRECT_URI = f"{WEB_DOMAIN}/admin/connectors/google-drive/oauth/callback"
-    DEV_REDIRECT_URI = f"https://redirectmeto.com/{REDIRECT_URI}"
-
-    @classmethod
-    def generate_oauth_url(cls, state: str) -> str:
-        return cls._generate_oauth_url_helper(cls.REDIRECT_URI, state)
-
-    @classmethod
-    def generate_dev_oauth_url(cls, state: str) -> str:
-        """dev mode workaround for localhost testing
-        - https://www.nango.dev/blog/oauth-redirects-on-localhost-with-https
-        """
-
-        return cls._generate_oauth_url_helper(cls.DEV_REDIRECT_URI, state)
-
-    @classmethod
-    def _generate_oauth_url_helper(cls, redirect_uri: str, state: str) -> str:
-        # without prompt=consent, a refresh token is only issued the first time the user approves
-        url = (
-            f"https://accounts.google.com/o/oauth2/v2/auth"
-            f"?client_id={cls.CLIENT_ID}"
-            f"&redirect_uri={redirect_uri}"
-            "&response_type=code"
-            f"&scope={cls.SCOPE}"
-            "&access_type=offline"
-            f"&state={state}"
-            "&prompt=consent"
-        )
-        return url
-
-    @classmethod
-    def session_dump_json(cls, email: str, redirect_on_success: str | None) -> str:
-        """Temporary state to store in redis. to be looked up on auth response.
-        Returns a json string.
-        """
-        session = GoogleDriveOAuth.OAuthSession(
-            email=email, redirect_on_success=redirect_on_success
-        )
-        return session.model_dump_json()
-
-    @classmethod
-    def parse_session(cls, session_json: str) -> OAuthSession:
-        session = GoogleDriveOAuth.OAuthSession.model_validate_json(session_json)
-        return session
-
-
-@router.post("/prepare-authorization-request")
-def prepare_authorization_request(
-    connector: DocumentSource,
-    redirect_on_success: str | None,
-    user: User = Depends(current_user),
-) -> JSONResponse:
-    """Used by the frontend to generate the url for the user's browser during auth request.
-
-    Example: https://www.oauth.com/oauth2-servers/authorization/the-authorization-request/
-    """
-    tenant_id = get_current_tenant_id()
-
-    # create random oauth state param for security and to retrieve user data later
-    oauth_uuid = uuid.uuid4()
-    oauth_uuid_str = str(oauth_uuid)
-
-    # urlsafe b64 encode the uuid for the oauth url
-    oauth_state = (
-        base64.urlsafe_b64encode(oauth_uuid.bytes).rstrip(b"=").decode("utf-8")
-    )
-    session: str
-
-    if connector == DocumentSource.SLACK:
-        oauth_url = SlackOAuth.generate_oauth_url(oauth_state)
-        session = SlackOAuth.session_dump_json(
-            email=user.email, redirect_on_success=redirect_on_success
-        )
-    elif connector == DocumentSource.GOOGLE_DRIVE:
-        oauth_url = GoogleDriveOAuth.generate_oauth_url(oauth_state)
-        session = GoogleDriveOAuth.session_dump_json(
-            email=user.email, redirect_on_success=redirect_on_success
-        )
-    # elif connector == DocumentSource.CONFLUENCE:
-    #     oauth_url = ConfluenceCloudOAuth.generate_oauth_url(oauth_state)
-    #     session = ConfluenceCloudOAuth.session_dump_json(
-    #         email=user.email, redirect_on_success=redirect_on_success
-    #     )
-    # elif connector == DocumentSource.JIRA:
-    #     oauth_url = JiraCloudOAuth.generate_dev_oauth_url(oauth_state)
-    else:
-        oauth_url = None
-
-    if not oauth_url:
-        raise HTTPException(
-            status_code=404,
-            detail=f"The document source type {connector} does not have OAuth implemented",
-        )
-
-    r = get_redis_client(tenant_id=tenant_id)
-
-    # store important session state to retrieve when the user is redirected back
-    # 10 min is the max we want an oauth flow to be valid
-    r.set(f"da_oauth:{oauth_uuid_str}", session, ex=600)
-
-    return JSONResponse(content={"url": oauth_url})
-
-
-@router.post("/connector/slack/callback")
-def handle_slack_oauth_callback(
-    code: str,
-    state: str,
-    user: User = Depends(current_user),
-    db_session: Session = Depends(get_session),
-) -> JSONResponse:
-    if not SlackOAuth.CLIENT_ID or not SlackOAuth.CLIENT_SECRET:
-        raise HTTPException(
-            status_code=500,
-            detail="Slack client ID or client secret is not configured.",
-        )
-
-    r = get_redis_client()
-
-    # recover the state
-    padded_state = state + "=" * (
-        -len(state) % 4
-    )  # Add padding back (Base64 decoding requires padding)
-    uuid_bytes = base64.urlsafe_b64decode(
-        padded_state
-    )  # Decode the Base64 string back to bytes
-
-    # Convert bytes back to a UUID
-    oauth_uuid = uuid.UUID(bytes=uuid_bytes)
-    oauth_uuid_str = str(oauth_uuid)
-
-    r_key = f"da_oauth:{oauth_uuid_str}"
-
-    session_json_bytes = cast(bytes, r.get(r_key))
-    if not session_json_bytes:
-        raise HTTPException(
-            status_code=400,
-            detail=f"Slack OAuth failed - OAuth state key not found: key={r_key}",
-        )
-
-    session_json = session_json_bytes.decode("utf-8")
-    try:
-        session = SlackOAuth.parse_session(session_json)
-
-        # Exchange the authorization code for an access token
-        response = requests.post(
-            SlackOAuth.TOKEN_URL,
-            headers={"Content-Type": "application/x-www-form-urlencoded"},
-            data={
-                "client_id": SlackOAuth.CLIENT_ID,
-                "client_secret": SlackOAuth.CLIENT_SECRET,
-                "code": code,
-                "redirect_uri": SlackOAuth.REDIRECT_URI,
-            },
-        )
-
-        response_data = response.json()
-
-        if not response_data.get("ok"):
-            raise HTTPException(
-                status_code=400,
-                detail=f"Slack OAuth failed: {response_data.get('error')}",
-            )
-
-        # Extract token and team information
-        access_token: str = response_data.get("access_token")
-        team_id: str = response_data.get("team", {}).get("id")
-        authed_user_id: str = response_data.get("authed_user", {}).get("id")
-
-        credential_info = CredentialBase(
-            credential_json={"slack_bot_token": access_token},
-            admin_public=True,
-            source=DocumentSource.SLACK,
-            name="Slack OAuth",
-        )
-
-        create_credential(credential_info, user, db_session)
-    except Exception as e:
-        return JSONResponse(
-            status_code=500,
-            content={
-                "success": False,
-                "message": f"An error occurred during Slack OAuth: {str(e)}",
-            },
-        )
-    finally:
-        r.delete(r_key)
-
-    # return the result
-    return JSONResponse(
-        content={
-            "success": True,
-            "message": "Slack OAuth completed successfully.",
-            "team_id": team_id,
-            "authed_user_id": authed_user_id,
-            "redirect_on_success": session.redirect_on_success,
-        }
-    )
-
-
-# Work in progress
-# @router.post("/connector/confluence/callback")
-# def handle_confluence_oauth_callback(
-#     code: str,
-#     state: str,
-#     user: User = Depends(current_user),
-#     db_session: Session = Depends(get_session),
-#     tenant_id: str | None = Depends(get_current_tenant_id),
-# ) -> JSONResponse:
-#     if not ConfluenceCloudOAuth.CLIENT_ID or not ConfluenceCloudOAuth.CLIENT_SECRET:
-#         raise HTTPException(
-#             status_code=500,
-#             detail="Confluence client ID or client secret is not configured."
-#         )
-
-#     r = get_redis_client(tenant_id=tenant_id)
-
-#     # recover the state
-#     padded_state = state + '=' * (-len(state) % 4)  # Add padding back (Base64 decoding requires padding)
-#     uuid_bytes = base64.urlsafe_b64decode(padded_state)  # Decode the Base64 string back to bytes
-
-#     # Convert bytes back to a UUID
-#     oauth_uuid = uuid.UUID(bytes=uuid_bytes)
-#     oauth_uuid_str = str(oauth_uuid)
-
-#     r_key = f"da_oauth:{oauth_uuid_str}"
-
-#     result = r.get(r_key)
-#     if not result:
-#         raise HTTPException(
-#             status_code=400,
-#             detail=f"Confluence OAuth failed - OAuth state key not found: key={r_key}"
-#         )
-
-#     try:
-#         session = ConfluenceCloudOAuth.parse_session(result)
-
-#         # Exchange the authorization code for an access token
-#         response = requests.post(
-#             ConfluenceCloudOAuth.TOKEN_URL,
-#             headers={"Content-Type": "application/x-www-form-urlencoded"},
-#             data={
-#                 "client_id": ConfluenceCloudOAuth.CLIENT_ID,
-#                 "client_secret": ConfluenceCloudOAuth.CLIENT_SECRET,
-#                 "code": code,
-#                 "redirect_uri": ConfluenceCloudOAuth.DEV_REDIRECT_URI,
-#             },
-#         )
-
-#         response_data = response.json()
-
-#         if not response_data.get("ok"):
-#             raise HTTPException(
-#                 status_code=400,
-#                 detail=f"ConfluenceCloudOAuth OAuth failed: {response_data.get('error')}"
-#             )
-
-#         # Extract token and team information
-#         access_token: str = response_data.get("access_token")
-#         team_id: str = response_data.get("team", {}).get("id")
-#         authed_user_id: str = response_data.get("authed_user", {}).get("id")
-
-#         credential_info = CredentialBase(
-#             credential_json={"slack_bot_token": access_token},
-#             admin_public=True,
-#             source=DocumentSource.CONFLUENCE,
-#             name="Confluence OAuth",
-#         )
-
-#         logger.info(f"Slack access token: {access_token}")
-
-#         credential = create_credential(credential_info, user, db_session)
-
-#         logger.info(f"new_credential_id={credential.id}")
-#     except Exception as e:
-#         return JSONResponse(
-#             status_code=500,
-#             content={
-#                 "success": False,
-#                 "message": f"An error occurred during Slack OAuth: {str(e)}",
-#             },
-#         )
-#     finally:
-#         r.delete(r_key)
-
-#     # return the result
-#     return JSONResponse(
-#         content={
-#             "success": True,
-#             "message": "Slack OAuth completed successfully.",
-#             "team_id": team_id,
-#             "authed_user_id": authed_user_id,
-#             "redirect_on_success": session.redirect_on_success,
-#         }
-#     )
-
-
-@router.post("/connector/google-drive/callback")
-def handle_google_drive_oauth_callback(
-    code: str,
-    state: str,
-    user: User = Depends(current_user),
-    db_session: Session = Depends(get_session),
-) -> JSONResponse:
-    if not GoogleDriveOAuth.CLIENT_ID or not GoogleDriveOAuth.CLIENT_SECRET:
-        raise HTTPException(
-            status_code=500,
-            detail="Google Drive client ID or client secret is not configured.",
-        )
-
-    r = get_redis_client()
-
-    # recover the state
-    padded_state = state + "=" * (
-        -len(state) % 4
-    )  # Add padding back (Base64 decoding requires padding)
-    uuid_bytes = base64.urlsafe_b64decode(
-        padded_state
-    )  # Decode the Base64 string back to bytes
-
-    # Convert bytes back to a UUID
-    oauth_uuid = uuid.UUID(bytes=uuid_bytes)
-    oauth_uuid_str = str(oauth_uuid)
-
-    r_key = f"da_oauth:{oauth_uuid_str}"
-
-    session_json_bytes = cast(bytes, r.get(r_key))
-    if not session_json_bytes:
-        raise HTTPException(
-            status_code=400,
-            detail=f"Google Drive OAuth failed - OAuth state key not found: key={r_key}",
-        )
-
-    session_json = session_json_bytes.decode("utf-8")
-    session: GoogleDriveOAuth.OAuthSession
-    try:
-        session = GoogleDriveOAuth.parse_session(session_json)
-
-        # Exchange the authorization code for an access token
-        response = requests.post(
-            GoogleDriveOAuth.TOKEN_URL,
-            headers={"Content-Type": "application/x-www-form-urlencoded"},
-            data={
-                "client_id": GoogleDriveOAuth.CLIENT_ID,
-                "client_secret": GoogleDriveOAuth.CLIENT_SECRET,
-                "code": code,
-                "redirect_uri": GoogleDriveOAuth.REDIRECT_URI,
-                "grant_type": "authorization_code",
-            },
-        )
-
-        response.raise_for_status()
-
-        authorization_response: dict[str, Any] = response.json()
-
-        # the connector wants us to store the json in its authorized_user_info format
-        # returned from OAuthCredentials.get_authorized_user_info().
-        # So refresh immediately via get_google_oauth_creds with the params filled in
-        # from fields in authorization_response to get the json we need
-        authorized_user_info = {}
-        authorized_user_info["client_id"] = OAUTH_GOOGLE_DRIVE_CLIENT_ID
-        authorized_user_info["client_secret"] = OAUTH_GOOGLE_DRIVE_CLIENT_SECRET
-        authorized_user_info["refresh_token"] = authorization_response["refresh_token"]
-
-        token_json_str = json.dumps(authorized_user_info)
-        oauth_creds = get_google_oauth_creds(
-            token_json_str=token_json_str, source=DocumentSource.GOOGLE_DRIVE
-        )
-        if not oauth_creds:
-            raise RuntimeError("get_google_oauth_creds returned None.")
-
-        # save off the credentials
-        oauth_creds_sanitized_json_str = sanitize_oauth_credentials(oauth_creds)
-
-        credential_dict: dict[str, str] = {}
-        credential_dict[DB_CREDENTIALS_DICT_TOKEN_KEY] = oauth_creds_sanitized_json_str
-        credential_dict[DB_CREDENTIALS_PRIMARY_ADMIN_KEY] = session.email
-        credential_dict[
-            DB_CREDENTIALS_AUTHENTICATION_METHOD
-        ] = GoogleOAuthAuthenticationMethod.OAUTH_INTERACTIVE.value
-
-        credential_info = CredentialBase(
-            credential_json=credential_dict,
-            admin_public=True,
-            source=DocumentSource.GOOGLE_DRIVE,
-            name="OAuth (interactive)",
-        )
-
-        create_credential(credential_info, user, db_session)
-    except Exception as e:
-        return JSONResponse(
-            status_code=500,
-            content={
-                "success": False,
-                "message": f"An error occurred during Google Drive OAuth: {str(e)}",
-            },
-        )
-    finally:
-        r.delete(r_key)
-
-    # return the result
-    return JSONResponse(
-        content={
-            "success": True,
-            "message": "Google Drive OAuth completed successfully.",
-            "redirect_on_success": session.redirect_on_success,
-        }
-    )
--- a/backend/ee/onyx/server/oauth/api.py
+++ b/backend/ee/onyx/server/oauth/api.py
@@ -0,0 +1,91 @@
+import base64
+import uuid
+
+from fastapi import Depends
+from fastapi import HTTPException
+from fastapi.responses import JSONResponse
+
+from ee.onyx.server.oauth.api_router import router
+from ee.onyx.server.oauth.confluence_cloud import ConfluenceCloudOAuth
+from ee.onyx.server.oauth.google_drive import GoogleDriveOAuth
+from ee.onyx.server.oauth.slack import SlackOAuth
+from onyx.auth.users import current_admin_user
+from onyx.configs.app_configs import DEV_MODE
+from onyx.configs.constants import DocumentSource
+from onyx.db.engine import get_current_tenant_id
+from onyx.db.models import User
+from onyx.redis.redis_pool import get_redis_client
+from onyx.utils.logger import setup_logger
+
+logger = setup_logger()
+
+
+@router.post("/prepare-authorization-request")
+def prepare_authorization_request(
+    connector: DocumentSource,
+    redirect_on_success: str | None,
+    user: User = Depends(current_admin_user),
+    tenant_id: str | None = Depends(get_current_tenant_id),
+) -> JSONResponse:
+    """Used by the frontend to generate the url for the user's browser during auth request.
+
+    Example: https://www.oauth.com/oauth2-servers/authorization/the-authorization-request/
+    """
+
+    # create random oauth state param for security and to retrieve user data later
+    oauth_uuid = uuid.uuid4()
+    oauth_uuid_str = str(oauth_uuid)
+
+    # urlsafe b64 encode the uuid for the oauth url
+    oauth_state = (
+        base64.urlsafe_b64encode(oauth_uuid.bytes).rstrip(b"=").decode("utf-8")
+    )
+
+    session: str | None = None
+    if connector == DocumentSource.SLACK:
+        if not DEV_MODE:
+            oauth_url = SlackOAuth.generate_oauth_url(oauth_state)
+        else:
+            oauth_url = SlackOAuth.generate_dev_oauth_url(oauth_state)
+
+        session = SlackOAuth.session_dump_json(
+            email=user.email, redirect_on_success=redirect_on_success
+        )
+    elif connector == DocumentSource.CONFLUENCE:
+        if not DEV_MODE:
+            oauth_url = ConfluenceCloudOAuth.generate_oauth_url(oauth_state)
+        else:
+            oauth_url = ConfluenceCloudOAuth.generate_dev_oauth_url(oauth_state)
+        session = ConfluenceCloudOAuth.session_dump_json(
+            email=user.email, redirect_on_success=redirect_on_success
+        )
+    elif connector == DocumentSource.GOOGLE_DRIVE:
+        if not DEV_MODE:
+            oauth_url = GoogleDriveOAuth.generate_oauth_url(oauth_state)
+        else:
+            oauth_url = GoogleDriveOAuth.generate_dev_oauth_url(oauth_state)
+        session = GoogleDriveOAuth.session_dump_json(
+            email=user.email, redirect_on_success=redirect_on_success
+        )
+    else:
+        oauth_url = None
+
+    if not oauth_url:
+        raise HTTPException(
+            status_code=404,
+            detail=f"The document source type {connector} does not have OAuth implemented",
+        )
+
+    if not session:
+        raise HTTPException(
+            status_code=500,
+            detail=f"The document source type {connector} failed to generate an OAuth session.",
+        )
+
+    r = get_redis_client(tenant_id=tenant_id)
+
+    # store important session state to retrieve when the user is redirected back
+    # 10 min is the max we want an oauth flow to be valid
+    r.set(f"da_oauth:{oauth_uuid_str}", session, ex=600)
+
+    return JSONResponse(content={"url": oauth_url})
--- a/backend/ee/onyx/server/oauth/api_router.py
+++ b/backend/ee/onyx/server/oauth/api_router.py
@@ -0,0 +1,3 @@
+from fastapi import APIRouter
+
+router: APIRouter = APIRouter(prefix="/oauth")
--- a/backend/ee/onyx/server/oauth/confluence_cloud.py
+++ b/backend/ee/onyx/server/oauth/confluence_cloud.py
@@ -0,0 +1,362 @@
+import base64
+import uuid
+from datetime import datetime
+from datetime import timedelta
+from datetime import timezone
+from typing import Any
+from typing import cast
+
+import requests
+from fastapi import Depends
+from fastapi import HTTPException
+from fastapi.responses import JSONResponse
+from pydantic import BaseModel
+from pydantic import ValidationError
+from sqlalchemy.orm import Session
+
+from ee.onyx.configs.app_configs import OAUTH_CONFLUENCE_CLOUD_CLIENT_ID
+from ee.onyx.configs.app_configs import OAUTH_CONFLUENCE_CLOUD_CLIENT_SECRET
+from ee.onyx.server.oauth.api_router import router
+from onyx.auth.users import current_admin_user
+from onyx.configs.app_configs import DEV_MODE
+from onyx.configs.app_configs import WEB_DOMAIN
+from onyx.configs.constants import DocumentSource
+from onyx.connectors.confluence.utils import CONFLUENCE_OAUTH_TOKEN_URL
+from onyx.db.credentials import create_credential
+from onyx.db.credentials import fetch_credential_by_id_for_user
+from onyx.db.credentials import update_credential_json
+from onyx.db.engine import get_current_tenant_id
+from onyx.db.engine import get_session
+from onyx.db.models import User
+from onyx.redis.redis_pool import get_redis_client
+from onyx.server.documents.models import CredentialBase
+from onyx.utils.logger import setup_logger
+
+logger = setup_logger()
+
+
+class ConfluenceCloudOAuth:
+    # https://developer.atlassian.com/cloud/confluence/oauth-2-3lo-apps/
+
+    class OAuthSession(BaseModel):
+        """Stored in redis to be looked up on callback"""
+
+        email: str
+        redirect_on_success: str | None  # Where to send the user if OAuth flow succeeds
+
+    class TokenResponse(BaseModel):
+        access_token: str
+        expires_in: int
+        token_type: str
+        refresh_token: str
+        scope: str
+
+    class AccessibleResources(BaseModel):
+        id: str
+        name: str
+        url: str
+        scopes: list[str]
+        avatarUrl: str
+
+    CLIENT_ID = OAUTH_CONFLUENCE_CLOUD_CLIENT_ID
+    CLIENT_SECRET = OAUTH_CONFLUENCE_CLOUD_CLIENT_SECRET
+    TOKEN_URL = CONFLUENCE_OAUTH_TOKEN_URL
+
+    ACCESSIBLE_RESOURCE_URL = (
+        "https://api.atlassian.com/oauth/token/accessible-resources"
+    )
+
+    # All read scopes per https://developer.atlassian.com/cloud/confluence/scopes-for-oauth-2-3LO-and-forge-apps/
+    CONFLUENCE_OAUTH_SCOPE = (
+        # classic scope
+        "read:confluence-space.summary%20"
+        "read:confluence-props%20"
+        "read:confluence-content.all%20"
+        "read:confluence-content.summary%20"
+        "read:confluence-content.permission%20"
+        "read:confluence-user%20"
+        "read:confluence-groups%20"
+        "readonly:content.attachment:confluence%20"
+        "search:confluence%20"
+        # granular scope
+        "read:attachment:confluence%20"  # possibly unneeded unless calling v2 attachments api
+        "read:content-details:confluence%20"  # for permission sync
+        "offline_access"
+    )
+
+    REDIRECT_URI = f"{WEB_DOMAIN}/admin/connectors/confluence/oauth/callback"
+    DEV_REDIRECT_URI = f"https://redirectmeto.com/{REDIRECT_URI}"
+
+    # eventually for Confluence Data Center
+    # oauth_url = (
+    #     f"http://localhost:8090/rest/oauth/v2/authorize?client_id={CONFLUENCE_OAUTH_CLIENT_ID}"
+    #     f"&scope={CONFLUENCE_OAUTH_SCOPE_2}"
+    #     f"&redirect_uri={redirectme_uri}"
+    # )
+
+    @classmethod
+    def generate_oauth_url(cls, state: str) -> str:
+        return cls._generate_oauth_url_helper(cls.REDIRECT_URI, state)
+
+    @classmethod
+    def generate_dev_oauth_url(cls, state: str) -> str:
+        """dev mode workaround for localhost testing
+        - https://www.nango.dev/blog/oauth-redirects-on-localhost-with-https
+        """
+        return cls._generate_oauth_url_helper(cls.DEV_REDIRECT_URI, state)
+
+    @classmethod
+    def _generate_oauth_url_helper(cls, redirect_uri: str, state: str) -> str:
+        # https://developer.atlassian.com/cloud/jira/platform/oauth-2-3lo-apps/#1--direct-the-user-to-the-authorization-url-to-get-an-authorization-code
+
+        url = (
+            "https://auth.atlassian.com/authorize"
+            f"?audience=api.atlassian.com"
+            f"&client_id={cls.CLIENT_ID}"
+            f"&scope={cls.CONFLUENCE_OAUTH_SCOPE}"
+            f"&redirect_uri={redirect_uri}"
+            f"&state={state}"
+            "&response_type=code"
+            "&prompt=consent"
+        )
+        return url
+
+    @classmethod
+    def session_dump_json(cls, email: str, redirect_on_success: str | None) -> str:
+        """Temporary state to store in redis. to be looked up on auth response.
+        Returns a json string.
+        """
+        session = ConfluenceCloudOAuth.OAuthSession(
+            email=email, redirect_on_success=redirect_on_success
+        )
+        return session.model_dump_json()
+
+    @classmethod
+    def parse_session(cls, session_json: str) -> OAuthSession:
+        session = ConfluenceCloudOAuth.OAuthSession.model_validate_json(session_json)
+        return session
+
+    @classmethod
+    def generate_finalize_url(cls, credential_id: int) -> str:
+        return f"{WEB_DOMAIN}/admin/connectors/confluence/oauth/finalize?credential={credential_id}"
+
+
+@router.post("/connector/confluence/callback")
+def confluence_oauth_callback(
+    code: str,
+    state: str,
+    user: User = Depends(current_admin_user),
+    db_session: Session = Depends(get_session),
+    tenant_id: str | None = Depends(get_current_tenant_id),
+) -> JSONResponse:
+    """Handles the backend logic for the frontend page that the user is redirected to
+    after visiting the oauth authorization url."""
+
+    if not ConfluenceCloudOAuth.CLIENT_ID or not ConfluenceCloudOAuth.CLIENT_SECRET:
+        raise HTTPException(
+            status_code=500,
+            detail="Confluence Cloud client ID or client secret is not configured.",
+        )
+
+    r = get_redis_client(tenant_id=tenant_id)
+
+    # recover the state
+    padded_state = state + "=" * (
+        -len(state) % 4
+    )  # Add padding back (Base64 decoding requires padding)
+    uuid_bytes = base64.urlsafe_b64decode(
+        padded_state
+    )  # Decode the Base64 string back to bytes
+
+    # Convert bytes back to a UUID
+    oauth_uuid = uuid.UUID(bytes=uuid_bytes)
+    oauth_uuid_str = str(oauth_uuid)
+
+    r_key = f"da_oauth:{oauth_uuid_str}"
+
+    session_json_bytes = cast(bytes, r.get(r_key))
+    if not session_json_bytes:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Confluence Cloud OAuth failed - OAuth state key not found: key={r_key}",
+        )
+
+    session_json = session_json_bytes.decode("utf-8")
+    try:
+        session = ConfluenceCloudOAuth.parse_session(session_json)
+
+        if not DEV_MODE:
+            redirect_uri = ConfluenceCloudOAuth.REDIRECT_URI
+        else:
+            redirect_uri = ConfluenceCloudOAuth.DEV_REDIRECT_URI
+
+        # Exchange the authorization code for an access token
+        response = requests.post(
+            ConfluenceCloudOAuth.TOKEN_URL,
+            headers={"Content-Type": "application/x-www-form-urlencoded"},
+            data={
+                "client_id": ConfluenceCloudOAuth.CLIENT_ID,
+                "client_secret": ConfluenceCloudOAuth.CLIENT_SECRET,
+                "code": code,
+                "redirect_uri": redirect_uri,
+                "grant_type": "authorization_code",
+            },
+        )
+
+        token_response: ConfluenceCloudOAuth.TokenResponse | None = None
+
+        try:
+            token_response = ConfluenceCloudOAuth.TokenResponse.model_validate_json(
+                response.text
+            )
+        except Exception:
+            raise RuntimeError(
+                "Confluence Cloud OAuth failed during code/token exchange."
+            )
+
+        now = datetime.now(timezone.utc)
+        expires_at = now + timedelta(seconds=token_response.expires_in)
+
+        credential_info = CredentialBase(
+            credential_json={
+                "confluence_access_token": token_response.access_token,
+                "confluence_refresh_token": token_response.refresh_token,
+                "created_at": now.isoformat(),
+                "expires_at": expires_at.isoformat(),
+                "expires_in": token_response.expires_in,
+                "scope": token_response.scope,
+            },
+            admin_public=True,
+            source=DocumentSource.CONFLUENCE,
+            name="Confluence Cloud OAuth",
+        )
+
+        credential = create_credential(credential_info, user, db_session)
+    except Exception as e:
+        return JSONResponse(
+            status_code=500,
+            content={
+                "success": False,
+                "message": f"An error occurred during Confluence Cloud OAuth: {str(e)}",
+            },
+        )
+    finally:
+        r.delete(r_key)
+
+    # return the result
+    return JSONResponse(
+        content={
+            "success": True,
+            "message": "Confluence Cloud OAuth completed successfully.",
+            "finalize_url": ConfluenceCloudOAuth.generate_finalize_url(credential.id),
+            "redirect_on_success": session.redirect_on_success,
+        }
+    )
+
+
+@router.get("/connector/confluence/accessible-resources")
+def confluence_oauth_accessible_resources(
+    credential_id: int,
+    user: User = Depends(current_admin_user),
+    db_session: Session = Depends(get_session),
+    tenant_id: str | None = Depends(get_current_tenant_id),
+) -> JSONResponse:
+    """Atlassian's API is weird and does not supply us with enough info to be in a
+    usable state after authorizing.  All API's require a cloud id. We have to list
+    the accessible resources/sites and let the user choose which site to use."""
+
+    credential = fetch_credential_by_id_for_user(credential_id, user, db_session)
+    if not credential:
+        raise HTTPException(400, f"Credential {credential_id} not found.")
+
+    credential_dict = credential.credential_json
+    access_token = credential_dict["confluence_access_token"]
+
+    try:
+        # Exchange the authorization code for an access token
+        response = requests.get(
+            ConfluenceCloudOAuth.ACCESSIBLE_RESOURCE_URL,
+            headers={
+                "Authorization": f"Bearer {access_token}",
+                "Accept": "application/json",
+            },
+        )
+
+        response.raise_for_status()
+        accessible_resources_data = response.json()
+
+        # Validate the list of AccessibleResources
+        try:
+            accessible_resources = [
+                ConfluenceCloudOAuth.AccessibleResources(**resource)
+                for resource in accessible_resources_data
+            ]
+        except ValidationError as e:
+            raise RuntimeError(f"Failed to parse accessible resources: {e}")
+    except Exception as e:
+        return JSONResponse(
+            status_code=500,
+            content={
+                "success": False,
+                "message": f"An error occurred retrieving Confluence Cloud accessible resources: {str(e)}",
+            },
+        )
+
+    # return the result
+    return JSONResponse(
+        content={
+            "success": True,
+            "message": "Confluence Cloud get accessible resources completed successfully.",
+            "accessible_resources": [
+                resource.model_dump() for resource in accessible_resources
+            ],
+        }
+    )
+
+
+@router.post("/connector/confluence/finalize")
+def confluence_oauth_finalize(
+    credential_id: int,
+    cloud_id: str,
+    cloud_name: str,
+    cloud_url: str,
+    user: User = Depends(current_admin_user),
+    db_session: Session = Depends(get_session),
+    tenant_id: str | None = Depends(get_current_tenant_id),
+) -> JSONResponse:
+    """Saves the info for the selected cloud site to the credential.
+    This is the final step in the confluence oauth flow where after the traditional
+    OAuth process, the user has to select a site to associate with the credentials.
+    After this, the credential is usable."""
+
+    credential = fetch_credential_by_id_for_user(credential_id, user, db_session)
+    if not credential:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Confluence Cloud OAuth failed - credential {credential_id} not found.",
+        )
+
+    new_credential_json: dict[str, Any] = dict(credential.credential_json)
+    new_credential_json["cloud_id"] = cloud_id
+    new_credential_json["cloud_name"] = cloud_name
+    new_credential_json["wiki_base"] = cloud_url
+
+    try:
+        update_credential_json(credential_id, new_credential_json, user, db_session)
+    except Exception as e:
+        return JSONResponse(
+            status_code=500,
+            content={
+                "success": False,
+                "message": f"An error occurred during Confluence Cloud OAuth: {str(e)}",
+            },
+        )
+
+    # return the result
+    return JSONResponse(
+        content={
+            "success": True,
+            "message": "Confluence Cloud OAuth finalized successfully.",
+            "redirect_url": f"{WEB_DOMAIN}/admin/connectors/confluence",
+        }
+    )
--- a/backend/ee/onyx/server/oauth/google_drive.py
+++ b/backend/ee/onyx/server/oauth/google_drive.py
@@ -0,0 +1,229 @@
+import base64
+import json
+import uuid
+from typing import Any
+from typing import cast
+
+import requests
+from fastapi import Depends
+from fastapi import HTTPException
+from fastapi.responses import JSONResponse
+from pydantic import BaseModel
+from sqlalchemy.orm import Session
+
+from ee.onyx.configs.app_configs import OAUTH_GOOGLE_DRIVE_CLIENT_ID
+from ee.onyx.configs.app_configs import OAUTH_GOOGLE_DRIVE_CLIENT_SECRET
+from ee.onyx.server.oauth.api_router import router
+from onyx.auth.users import current_admin_user
+from onyx.configs.app_configs import DEV_MODE
+from onyx.configs.app_configs import WEB_DOMAIN
+from onyx.configs.constants import DocumentSource
+from onyx.connectors.google_utils.google_auth import get_google_oauth_creds
+from onyx.connectors.google_utils.google_auth import sanitize_oauth_credentials
+from onyx.connectors.google_utils.shared_constants import (
+    DB_CREDENTIALS_AUTHENTICATION_METHOD,
+)
+from onyx.connectors.google_utils.shared_constants import (
+    DB_CREDENTIALS_DICT_TOKEN_KEY,
+)
+from onyx.connectors.google_utils.shared_constants import (
+    DB_CREDENTIALS_PRIMARY_ADMIN_KEY,
+)
+from onyx.connectors.google_utils.shared_constants import (
+    GoogleOAuthAuthenticationMethod,
+)
+from onyx.db.credentials import create_credential
+from onyx.db.engine import get_current_tenant_id
+from onyx.db.engine import get_session
+from onyx.db.models import User
+from onyx.redis.redis_pool import get_redis_client
+from onyx.server.documents.models import CredentialBase
+
+
+class GoogleDriveOAuth:
+    # https://developers.google.com/identity/protocols/oauth2
+    # https://developers.google.com/identity/protocols/oauth2/web-server
+
+    class OAuthSession(BaseModel):
+        """Stored in redis to be looked up on callback"""
+
+        email: str
+        redirect_on_success: str | None  # Where to send the user if OAuth flow succeeds
+
+    CLIENT_ID = OAUTH_GOOGLE_DRIVE_CLIENT_ID
+    CLIENT_SECRET = OAUTH_GOOGLE_DRIVE_CLIENT_SECRET
+
+    TOKEN_URL = "https://oauth2.googleapis.com/token"
+
+    # SCOPE is per https://docs.danswer.dev/connectors/google-drive
+    # TODO: Merge with or use google_utils.GOOGLE_SCOPES
+    SCOPE = (
+        "https://www.googleapis.com/auth/drive.readonly%20"
+        "https://www.googleapis.com/auth/drive.metadata.readonly%20"
+        "https://www.googleapis.com/auth/admin.directory.user.readonly%20"
+        "https://www.googleapis.com/auth/admin.directory.group.readonly"
+    )
+
+    REDIRECT_URI = f"{WEB_DOMAIN}/admin/connectors/google-drive/oauth/callback"
+    DEV_REDIRECT_URI = f"https://redirectmeto.com/{REDIRECT_URI}"
+
+    @classmethod
+    def generate_oauth_url(cls, state: str) -> str:
+        return cls._generate_oauth_url_helper(cls.REDIRECT_URI, state)
+
+    @classmethod
+    def generate_dev_oauth_url(cls, state: str) -> str:
+        """dev mode workaround for localhost testing
+        - https://www.nango.dev/blog/oauth-redirects-on-localhost-with-https
+        """
+
+        return cls._generate_oauth_url_helper(cls.DEV_REDIRECT_URI, state)
+
+    @classmethod
+    def _generate_oauth_url_helper(cls, redirect_uri: str, state: str) -> str:
+        # without prompt=consent, a refresh token is only issued the first time the user approves
+        url = (
+            f"https://accounts.google.com/o/oauth2/v2/auth"
+            f"?client_id={cls.CLIENT_ID}"
+            f"&redirect_uri={redirect_uri}"
+            "&response_type=code"
+            f"&scope={cls.SCOPE}"
+            "&access_type=offline"
+            f"&state={state}"
+            "&prompt=consent"
+        )
+        return url
+
+    @classmethod
+    def session_dump_json(cls, email: str, redirect_on_success: str | None) -> str:
+        """Temporary state to store in redis. to be looked up on auth response.
+        Returns a json string.
+        """
+        session = GoogleDriveOAuth.OAuthSession(
+            email=email, redirect_on_success=redirect_on_success
+        )
+        return session.model_dump_json()
+
+    @classmethod
+    def parse_session(cls, session_json: str) -> OAuthSession:
+        session = GoogleDriveOAuth.OAuthSession.model_validate_json(session_json)
+        return session
+
+
+@router.post("/connector/google-drive/callback")
+def handle_google_drive_oauth_callback(
+    code: str,
+    state: str,
+    user: User = Depends(current_admin_user),
+    db_session: Session = Depends(get_session),
+    tenant_id: str | None = Depends(get_current_tenant_id),
+) -> JSONResponse:
+    if not GoogleDriveOAuth.CLIENT_ID or not GoogleDriveOAuth.CLIENT_SECRET:
+        raise HTTPException(
+            status_code=500,
+            detail="Google Drive client ID or client secret is not configured.",
+        )
+
+    r = get_redis_client(tenant_id=tenant_id)
+
+    # recover the state
+    padded_state = state + "=" * (
+        -len(state) % 4
+    )  # Add padding back (Base64 decoding requires padding)
+    uuid_bytes = base64.urlsafe_b64decode(
+        padded_state
+    )  # Decode the Base64 string back to bytes
+
+    # Convert bytes back to a UUID
+    oauth_uuid = uuid.UUID(bytes=uuid_bytes)
+    oauth_uuid_str = str(oauth_uuid)
+
+    r_key = f"da_oauth:{oauth_uuid_str}"
+
+    session_json_bytes = cast(bytes, r.get(r_key))
+    if not session_json_bytes:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Google Drive OAuth failed - OAuth state key not found: key={r_key}",
+        )
+
+    session_json = session_json_bytes.decode("utf-8")
+    try:
+        session = GoogleDriveOAuth.parse_session(session_json)
+
+        if not DEV_MODE:
+            redirect_uri = GoogleDriveOAuth.REDIRECT_URI
+        else:
+            redirect_uri = GoogleDriveOAuth.DEV_REDIRECT_URI
+
+        # Exchange the authorization code for an access token
+        response = requests.post(
+            GoogleDriveOAuth.TOKEN_URL,
+            headers={"Content-Type": "application/x-www-form-urlencoded"},
+            data={
+                "client_id": GoogleDriveOAuth.CLIENT_ID,
+                "client_secret": GoogleDriveOAuth.CLIENT_SECRET,
+                "code": code,
+                "redirect_uri": redirect_uri,
+                "grant_type": "authorization_code",
+            },
+        )
+
+        response.raise_for_status()
+
+        authorization_response: dict[str, Any] = response.json()
+
+        # the connector wants us to store the json in its authorized_user_info format
+        # returned from OAuthCredentials.get_authorized_user_info().
+        # So refresh immediately via get_google_oauth_creds with the params filled in
+        # from fields in authorization_response to get the json we need
+        authorized_user_info = {}
+        authorized_user_info["client_id"] = OAUTH_GOOGLE_DRIVE_CLIENT_ID
+        authorized_user_info["client_secret"] = OAUTH_GOOGLE_DRIVE_CLIENT_SECRET
+        authorized_user_info["refresh_token"] = authorization_response["refresh_token"]
+
+        token_json_str = json.dumps(authorized_user_info)
+        oauth_creds = get_google_oauth_creds(
+            token_json_str=token_json_str, source=DocumentSource.GOOGLE_DRIVE
+        )
+        if not oauth_creds:
+            raise RuntimeError("get_google_oauth_creds returned None.")
+
+        # save off the credentials
+        oauth_creds_sanitized_json_str = sanitize_oauth_credentials(oauth_creds)
+
+        credential_dict: dict[str, str] = {}
+        credential_dict[DB_CREDENTIALS_DICT_TOKEN_KEY] = oauth_creds_sanitized_json_str
+        credential_dict[DB_CREDENTIALS_PRIMARY_ADMIN_KEY] = session.email
+        credential_dict[
+            DB_CREDENTIALS_AUTHENTICATION_METHOD
+        ] = GoogleOAuthAuthenticationMethod.OAUTH_INTERACTIVE.value
+
+        credential_info = CredentialBase(
+            credential_json=credential_dict,
+            admin_public=True,
+            source=DocumentSource.GOOGLE_DRIVE,
+            name="OAuth (interactive)",
+        )
+
+        create_credential(credential_info, user, db_session)
+    except Exception as e:
+        return JSONResponse(
+            status_code=500,
+            content={
+                "success": False,
+                "message": f"An error occurred during Google Drive OAuth: {str(e)}",
+            },
+        )
+    finally:
+        r.delete(r_key)
+
+    # return the result
+    return JSONResponse(
+        content={
+            "success": True,
+            "message": "Google Drive OAuth completed successfully.",
+            "finalize_url": None,
+            "redirect_on_success": session.redirect_on_success,
+        }
+    )
--- a/backend/ee/onyx/server/oauth/slack.py
+++ b/backend/ee/onyx/server/oauth/slack.py
@@ -0,0 +1,197 @@
+import base64
+import uuid
+from typing import cast
+
+import requests
+from fastapi import Depends
+from fastapi import HTTPException
+from fastapi.responses import JSONResponse
+from pydantic import BaseModel
+from sqlalchemy.orm import Session
+
+from ee.onyx.configs.app_configs import OAUTH_SLACK_CLIENT_ID
+from ee.onyx.configs.app_configs import OAUTH_SLACK_CLIENT_SECRET
+from ee.onyx.server.oauth.api_router import router
+from onyx.auth.users import current_admin_user
+from onyx.configs.app_configs import DEV_MODE
+from onyx.configs.app_configs import WEB_DOMAIN
+from onyx.configs.constants import DocumentSource
+from onyx.db.credentials import create_credential
+from onyx.db.engine import get_current_tenant_id
+from onyx.db.engine import get_session
+from onyx.db.models import User
+from onyx.redis.redis_pool import get_redis_client
+from onyx.server.documents.models import CredentialBase
+
+
+class SlackOAuth:
+    # https://knock.app/blog/how-to-authenticate-users-in-slack-using-oauth
+    # Example: https://api.slack.com/authentication/oauth-v2#exchanging
+
+    class OAuthSession(BaseModel):
+        """Stored in redis to be looked up on callback"""
+
+        email: str
+        redirect_on_success: str | None  # Where to send the user if OAuth flow succeeds
+
+    CLIENT_ID = OAUTH_SLACK_CLIENT_ID
+    CLIENT_SECRET = OAUTH_SLACK_CLIENT_SECRET
+
+    TOKEN_URL = "https://slack.com/api/oauth.v2.access"
+
+    # SCOPE is per https://docs.danswer.dev/connectors/slack
+    BOT_SCOPE = (
+        "channels:history,"
+        "channels:read,"
+        "groups:history,"
+        "groups:read,"
+        "channels:join,"
+        "im:history,"
+        "users:read,"
+        "users:read.email,"
+        "usergroups:read"
+    )
+
+    REDIRECT_URI = f"{WEB_DOMAIN}/admin/connectors/slack/oauth/callback"
+    DEV_REDIRECT_URI = f"https://redirectmeto.com/{REDIRECT_URI}"
+
+    @classmethod
+    def generate_oauth_url(cls, state: str) -> str:
+        return cls._generate_oauth_url_helper(cls.REDIRECT_URI, state)
+
+    @classmethod
+    def generate_dev_oauth_url(cls, state: str) -> str:
+        """dev mode workaround for localhost testing
+        - https://www.nango.dev/blog/oauth-redirects-on-localhost-with-https
+        """
+
+        return cls._generate_oauth_url_helper(cls.DEV_REDIRECT_URI, state)
+
+    @classmethod
+    def _generate_oauth_url_helper(cls, redirect_uri: str, state: str) -> str:
+        url = (
+            f"https://slack.com/oauth/v2/authorize"
+            f"?client_id={cls.CLIENT_ID}"
+            f"&redirect_uri={redirect_uri}"
+            f"&scope={cls.BOT_SCOPE}"
+            f"&state={state}"
+        )
+        return url
+
+    @classmethod
+    def session_dump_json(cls, email: str, redirect_on_success: str | None) -> str:
+        """Temporary state to store in redis. to be looked up on auth response.
+        Returns a json string.
+        """
+        session = SlackOAuth.OAuthSession(
+            email=email, redirect_on_success=redirect_on_success
+        )
+        return session.model_dump_json()
+
+    @classmethod
+    def parse_session(cls, session_json: str) -> OAuthSession:
+        session = SlackOAuth.OAuthSession.model_validate_json(session_json)
+        return session
+
+
+@router.post("/connector/slack/callback")
+def handle_slack_oauth_callback(
+    code: str,
+    state: str,
+    user: User = Depends(current_admin_user),
+    db_session: Session = Depends(get_session),
+    tenant_id: str | None = Depends(get_current_tenant_id),
+) -> JSONResponse:
+    if not SlackOAuth.CLIENT_ID or not SlackOAuth.CLIENT_SECRET:
+        raise HTTPException(
+            status_code=500,
+            detail="Slack client ID or client secret is not configured.",
+        )
+
+    r = get_redis_client(tenant_id=tenant_id)
+
+    # recover the state
+    padded_state = state + "=" * (
+        -len(state) % 4
+    )  # Add padding back (Base64 decoding requires padding)
+    uuid_bytes = base64.urlsafe_b64decode(
+        padded_state
+    )  # Decode the Base64 string back to bytes
+
+    # Convert bytes back to a UUID
+    oauth_uuid = uuid.UUID(bytes=uuid_bytes)
+    oauth_uuid_str = str(oauth_uuid)
+
+    r_key = f"da_oauth:{oauth_uuid_str}"
+
+    session_json_bytes = cast(bytes, r.get(r_key))
+    if not session_json_bytes:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Slack OAuth failed - OAuth state key not found: key={r_key}",
+        )
+
+    session_json = session_json_bytes.decode("utf-8")
+    try:
+        session = SlackOAuth.parse_session(session_json)
+
+        if not DEV_MODE:
+            redirect_uri = SlackOAuth.REDIRECT_URI
+        else:
+            redirect_uri = SlackOAuth.DEV_REDIRECT_URI
+
+        # Exchange the authorization code for an access token
+        response = requests.post(
+            SlackOAuth.TOKEN_URL,
+            headers={"Content-Type": "application/x-www-form-urlencoded"},
+            data={
+                "client_id": SlackOAuth.CLIENT_ID,
+                "client_secret": SlackOAuth.CLIENT_SECRET,
+                "code": code,
+                "redirect_uri": redirect_uri,
+            },
+        )
+
+        response_data = response.json()
+
+        if not response_data.get("ok"):
+            raise HTTPException(
+                status_code=400,
+                detail=f"Slack OAuth failed: {response_data.get('error')}",
+            )
+
+        # Extract token and team information
+        access_token: str = response_data.get("access_token")
+        team_id: str = response_data.get("team", {}).get("id")
+        authed_user_id: str = response_data.get("authed_user", {}).get("id")
+
+        credential_info = CredentialBase(
+            credential_json={"slack_bot_token": access_token},
+            admin_public=True,
+            source=DocumentSource.SLACK,
+            name="Slack OAuth",
+        )
+
+        create_credential(credential_info, user, db_session)
+    except Exception as e:
+        return JSONResponse(
+            status_code=500,
+            content={
+                "success": False,
+                "message": f"An error occurred during Slack OAuth: {str(e)}",
+            },
+        )
+    finally:
+        r.delete(r_key)
+
+    # return the result
+    return JSONResponse(
+        content={
+            "success": True,
+            "message": "Slack OAuth completed successfully.",
+            "finalize_url": None,
+            "redirect_on_success": session.redirect_on_success,
+            "team_id": team_id,
+            "authed_user_id": authed_user_id,
+        }
+    )
--- a/backend/ee/onyx/server/query_and_chat/token_limit.py
+++ b/backend/ee/onyx/server/query_and_chat/token_limit.py
@@ -13,7 +13,7 @@ from sqlalchemy import select
 from sqlalchemy.orm import Session

 from onyx.db.api_key import is_api_key_email_address
-from onyx.db.engine import get_session_with_tenant
+from onyx.db.engine import get_session_with_current_tenant
 from onyx.db.models import ChatMessage
 from onyx.db.models import ChatSession
 from onyx.db.models import TokenRateLimit
@@ -28,21 +28,21 @@ from onyx.server.query_and_chat.token_limit import _user_is_rate_limited_by_glob
 from onyx.utils.threadpool_concurrency import run_functions_tuples_in_parallel


-def _check_token_rate_limits(user: User | None, tenant_id: str) -> None:
+def _check_token_rate_limits(user: User | None) -> None:
    if user is None:
        # Unauthenticated users are only rate limited by global settings
-        _user_is_rate_limited_by_global(tenant_id)
+        _user_is_rate_limited_by_global()

    elif is_api_key_email_address(user.email):
        # API keys are only rate limited by global settings
-        _user_is_rate_limited_by_global(tenant_id)
+        _user_is_rate_limited_by_global()

    else:
        run_functions_tuples_in_parallel(
            [
-                (_user_is_rate_limited, (user.id, tenant_id)),
-                (_user_is_rate_limited_by_group, (user.id, tenant_id)),
-                (_user_is_rate_limited_by_global, (tenant_id,)),
+                (_user_is_rate_limited, (user.id,)),
+                (_user_is_rate_limited_by_group, (user.id,)),
+                (_user_is_rate_limited_by_global, ()),
            ]
        )

@@ -52,8 +52,8 @@ User rate limits
 """


-def _user_is_rate_limited(user_id: UUID, tenant_id: str) -> None:
-    with get_session_with_tenant(tenant_id=tenant_id) as db_session:
+def _user_is_rate_limited(user_id: UUID) -> None:
+    with get_session_with_current_tenant() as db_session:
        user_rate_limits = fetch_all_user_token_rate_limits(
            db_session=db_session, enabled_only=True, ordered=False
        )
@@ -93,8 +93,8 @@ User Group rate limits
 """


-def _user_is_rate_limited_by_group(user_id: UUID, tenant_id: str | None) -> None:
-    with get_session_with_tenant(tenant_id=tenant_id) as db_session:
+def _user_is_rate_limited_by_group(user_id: UUID) -> None:
+    with get_session_with_current_tenant() as db_session:
        group_rate_limits = _fetch_all_user_group_rate_limits(user_id, db_session)

        if group_rate_limits:
--- a/backend/ee/onyx/server/query_history/api.py
+++ b/backend/ee/onyx/server/query_history/api.py
@@ -2,6 +2,7 @@ import csv
 import io
 from datetime import datetime
 from datetime import timezone
+from http import HTTPStatus
 from uuid import UUID

 from fastapi import APIRouter
@@ -21,8 +22,10 @@ from ee.onyx.server.query_history.models import QuestionAnswerPairSnapshot
 from onyx.auth.users import current_admin_user
 from onyx.auth.users import get_display_email
 from onyx.chat.chat_utils import create_chat_chain
+from onyx.configs.app_configs import ONYX_QUERY_HISTORY_TYPE
 from onyx.configs.constants import MessageType
 from onyx.configs.constants import QAFeedbackType
+from onyx.configs.constants import QueryHistoryType
 from onyx.configs.constants import SessionType
 from onyx.db.chat import get_chat_session_by_id
 from onyx.db.chat import get_chat_sessions_by_user
@@ -35,6 +38,8 @@ from onyx.server.query_and_chat.models import ChatSessionsResponse

 router = APIRouter()

+ONYX_ANONYMIZED_EMAIL = "anonymous@anonymous.invalid"
+

 def fetch_and_process_chat_session_history(
    db_session: Session,
@@ -107,6 +112,17 @@ def get_user_chat_sessions(
    _: User | None = Depends(current_admin_user),
    db_session: Session = Depends(get_session),
 ) -> ChatSessionsResponse:
+    # we specifically don't allow this endpoint if "anonymized" since
+    # this is a direct query on the user id
+    if ONYX_QUERY_HISTORY_TYPE in [
+        QueryHistoryType.DISABLED,
+        QueryHistoryType.ANONYMIZED,
+    ]:
+        raise HTTPException(
+            status_code=HTTPStatus.FORBIDDEN,
+            detail="Per user query history has been disabled by the administrator.",
+        )
+
    try:
        chat_sessions = get_chat_sessions_by_user(
            user_id=user_id, deleted=False, db_session=db_session, limit=0
@@ -122,6 +138,7 @@ def get_user_chat_sessions(
                name=chat.description,
                persona_id=chat.persona_id,
                time_created=chat.time_created.isoformat(),
+                time_updated=chat.time_updated.isoformat(),
                shared_status=chat.shared_status,
                folder_id=chat.folder_id,
                current_alternate_model=chat.current_alternate_model,
@@ -141,6 +158,12 @@ def get_chat_session_history(
    _: User | None = Depends(current_admin_user),
    db_session: Session = Depends(get_session),
 ) -> PaginatedReturn[ChatSessionMinimal]:
+    if ONYX_QUERY_HISTORY_TYPE == QueryHistoryType.DISABLED:
+        raise HTTPException(
+            status_code=HTTPStatus.FORBIDDEN,
+            detail="Query history has been disabled by the administrator.",
+        )
+
    page_of_chat_sessions = get_page_of_chat_sessions(
        page_num=page_num,
        page_size=page_size,
@@ -157,11 +180,16 @@ def get_chat_session_history(
        feedback_filter=feedback_type,
    )

+    minimal_chat_sessions: list[ChatSessionMinimal] = []
+
+    for chat_session in page_of_chat_sessions:
+        minimal_chat_session = ChatSessionMinimal.from_chat_session(chat_session)
+        if ONYX_QUERY_HISTORY_TYPE == QueryHistoryType.ANONYMIZED:
+            minimal_chat_session.user_email = ONYX_ANONYMIZED_EMAIL
+        minimal_chat_sessions.append(minimal_chat_session)
+
    return PaginatedReturn(
-        items=[
-            ChatSessionMinimal.from_chat_session(chat_session)
-            for chat_session in page_of_chat_sessions
-        ],
+        items=minimal_chat_sessions,
        total_items=total_filtered_chat_sessions_count,
    )

@@ -172,6 +200,12 @@ def get_chat_session_admin(
    _: User | None = Depends(current_admin_user),
    db_session: Session = Depends(get_session),
 ) -> ChatSessionSnapshot:
+    if ONYX_QUERY_HISTORY_TYPE == QueryHistoryType.DISABLED:
+        raise HTTPException(
+            status_code=HTTPStatus.FORBIDDEN,
+            detail="Query history has been disabled by the administrator.",
+        )
+
    try:
        chat_session = get_chat_session_by_id(
            chat_session_id=chat_session_id,
@@ -193,6 +227,9 @@ def get_chat_session_admin(
            f"Could not create snapshot for chat session with id '{chat_session_id}'",
        )

+    if ONYX_QUERY_HISTORY_TYPE == QueryHistoryType.ANONYMIZED:
+        snapshot.user_email = ONYX_ANONYMIZED_EMAIL
+
    return snapshot


@@ -203,6 +240,12 @@ def get_query_history_as_csv(
    end: datetime | None = None,
    db_session: Session = Depends(get_session),
 ) -> StreamingResponse:
+    if ONYX_QUERY_HISTORY_TYPE == QueryHistoryType.DISABLED:
+        raise HTTPException(
+            status_code=HTTPStatus.FORBIDDEN,
+            detail="Query history has been disabled by the administrator.",
+        )
+
    complete_chat_session_history = fetch_and_process_chat_session_history(
        db_session=db_session,
        start=start or datetime.fromtimestamp(0, tz=timezone.utc),
@@ -213,6 +256,9 @@ def get_query_history_as_csv(

    question_answer_pairs: list[QuestionAnswerPairSnapshot] = []
    for chat_session_snapshot in complete_chat_session_history:
+        if ONYX_QUERY_HISTORY_TYPE == QueryHistoryType.ANONYMIZED:
+            chat_session_snapshot.user_email = ONYX_ANONYMIZED_EMAIL
+
        question_answer_pairs.extend(
            QuestionAnswerPairSnapshot.from_chat_session_snapshot(chat_session_snapshot)
        )
--- a/backend/ee/onyx/server/tenants/billing.py
+++ b/backend/ee/onyx/server/tenants/billing.py
@@ -7,6 +7,7 @@ from ee.onyx.configs.app_configs import STRIPE_PRICE_ID
 from ee.onyx.configs.app_configs import STRIPE_SECRET_KEY
 from ee.onyx.server.tenants.access import generate_data_plane_token
 from ee.onyx.server.tenants.models import BillingInformation
+from ee.onyx.server.tenants.models import SubscriptionStatusResponse
 from onyx.configs.app_configs import CONTROL_PLANE_API_BASE_URL
 from onyx.utils.logger import setup_logger

@@ -41,7 +42,9 @@ def fetch_tenant_stripe_information(tenant_id: str) -> dict:
    return response.json()


-def fetch_billing_information(tenant_id: str) -> BillingInformation:
+def fetch_billing_information(
+    tenant_id: str,
+) -> BillingInformation | SubscriptionStatusResponse:
    logger.info("Fetching billing information")
    token = generate_data_plane_token()
    headers = {
@@ -52,8 +55,19 @@ def fetch_billing_information(tenant_id: str) -> BillingInformation:
    params = {"tenant_id": tenant_id}
    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()
-    billing_info = BillingInformation(**response.json())
-    return billing_info
+
+    response_data = response.json()
+
+    # Check if the response indicates no subscription
+    if (
+        isinstance(response_data, dict)
+        and "subscribed" in response_data
+        and not response_data["subscribed"]
+    ):
+        return SubscriptionStatusResponse(**response_data)
+
+    # Otherwise, parse as BillingInformation
+    return BillingInformation(**response_data)


 def register_tenant_users(tenant_id: str, number_of_users: int) -> stripe.Subscription:
--- a/backend/ee/onyx/server/tenants/product_gating.py
+++ b/backend/ee/onyx/server/tenants/product_gating.py
@@ -48,4 +48,5 @@ def store_product_gating(tenant_id: str, application_status: ApplicationStatus)

 def get_gated_tenants() -> set[str]:
    redis_client = get_redis_replica_client(tenant_id=ONYX_CLOUD_TENANT_ID)
-    return cast(set[str], redis_client.smembers(GATED_TENANTS_KEY))
+    gated_tenants_bytes = cast(set[bytes], redis_client.smembers(GATED_TENANTS_KEY))
+    return {tenant_id.decode("utf-8") for tenant_id in gated_tenants_bytes}
--- a/backend/ee/onyx/server/tenants/provisioning.py
+++ b/backend/ee/onyx/server/tenants/provisioning.py
@@ -55,7 +55,11 @@ logger = logging.getLogger(__name__)
 async def get_or_provision_tenant(
    email: str, referral_source: str | None = None, request: Request | None = None
 ) -> str:
-    """Get existing tenant ID for an email or create a new tenant if none exists."""
+    """
+    Get existing tenant ID for an email or create a new tenant if none exists.
+    This function should only be called after we have verified we want this user's tenant to exist.
+    It returns the tenant ID associated with the email, creating a new tenant if necessary.
+    """
    if not MULTI_TENANT:
        return POSTGRES_DEFAULT_SCHEMA

@@ -104,14 +108,14 @@ async def provision_tenant(tenant_id: str, email: str) -> None:
            status_code=409, detail="User already belongs to an organization"
        )

-    logger.info(f"Provisioning tenant: {tenant_id}")
+    logger.debug(f"Provisioning tenant {tenant_id} for user {email}")
    token = None

    try:
        if not create_schema_if_not_exists(tenant_id):
-            logger.info(f"Created schema for tenant {tenant_id}")
+            logger.debug(f"Created schema for tenant {tenant_id}")
        else:
-            logger.info(f"Schema already exists for tenant {tenant_id}")
+            logger.debug(f"Schema already exists for tenant {tenant_id}")

        token = CURRENT_TENANT_ID_CONTEXTVAR.set(tenant_id)

@@ -200,33 +204,15 @@ async def rollback_tenant_provisioning(tenant_id: str) -> None:


 def configure_default_api_keys(db_session: Session) -> None:
-    if OPENAI_DEFAULT_API_KEY:
-        open_provider = LLMProviderUpsertRequest(
-            name="OpenAI",
-            provider=OPENAI_PROVIDER_NAME,
-            api_key=OPENAI_DEFAULT_API_KEY,
-            default_model_name="gpt-4",
-            fast_default_model_name="gpt-4o-mini",
-            model_names=OPEN_AI_MODEL_NAMES,
-        )
-        try:
-            full_provider = upsert_llm_provider(open_provider, db_session)
-            update_default_provider(full_provider.id, db_session)
-        except Exception as e:
-            logger.error(f"Failed to configure OpenAI provider: {e}")
-    else:
-        logger.error(
-            "OPENAI_DEFAULT_API_KEY not set, skipping OpenAI provider configuration"
-        )
-
    if ANTHROPIC_DEFAULT_API_KEY:
        anthropic_provider = LLMProviderUpsertRequest(
            name="Anthropic",
            provider=ANTHROPIC_PROVIDER_NAME,
            api_key=ANTHROPIC_DEFAULT_API_KEY,
-            default_model_name="claude-3-5-sonnet-20241022",
+            default_model_name="claude-3-7-sonnet-20250219",
            fast_default_model_name="claude-3-5-sonnet-20241022",
            model_names=ANTHROPIC_MODEL_NAMES,
+            display_model_names=["claude-3-5-sonnet-20241022"],
        )
        try:
            full_provider = upsert_llm_provider(anthropic_provider, db_session)
@@ -238,6 +224,26 @@ def configure_default_api_keys(db_session: Session) -> None:
            "ANTHROPIC_DEFAULT_API_KEY not set, skipping Anthropic provider configuration"
        )

+    if OPENAI_DEFAULT_API_KEY:
+        open_provider = LLMProviderUpsertRequest(
+            name="OpenAI",
+            provider=OPENAI_PROVIDER_NAME,
+            api_key=OPENAI_DEFAULT_API_KEY,
+            default_model_name="gpt-4o",
+            fast_default_model_name="gpt-4o-mini",
+            model_names=OPEN_AI_MODEL_NAMES,
+            display_model_names=["o1", "o3-mini", "gpt-4o", "gpt-4o-mini"],
+        )
+        try:
+            full_provider = upsert_llm_provider(open_provider, db_session)
+            update_default_provider(full_provider.id, db_session)
+        except Exception as e:
+            logger.error(f"Failed to configure OpenAI provider: {e}")
+    else:
+        logger.error(
+            "OPENAI_DEFAULT_API_KEY not set, skipping OpenAI provider configuration"
+        )
+
    if COHERE_DEFAULT_API_KEY:
        cloud_embedding_provider = CloudEmbeddingProviderCreationRequest(
            provider_type=EmbeddingProvider.COHERE,
--- a/backend/ee/onyx/server/tenants/user_mapping.py
+++ b/backend/ee/onyx/server/tenants/user_mapping.py
@@ -28,7 +28,7 @@ def get_tenant_id_for_email(email: str) -> str:


 def user_owns_a_tenant(email: str) -> bool:
-    with get_session_with_tenant(tenant_id=None) as db_session:
+    with get_session_with_tenant(tenant_id=POSTGRES_DEFAULT_SCHEMA) as db_session:
        result = (
            db_session.query(UserTenantMapping)
            .filter(UserTenantMapping.email == email)
@@ -38,7 +38,7 @@ def user_owns_a_tenant(email: str) -> bool:


 def add_users_to_tenant(emails: list[str], tenant_id: str) -> None:
-    with get_session_with_tenant(tenant_id=None) as db_session:
+    with get_session_with_tenant(tenant_id=POSTGRES_DEFAULT_SCHEMA) as db_session:
        try:
            for email in emails:
                db_session.add(UserTenantMapping(email=email, tenant_id=tenant_id))
@@ -48,7 +48,7 @@ def add_users_to_tenant(emails: list[str], tenant_id: str) -> None:


 def remove_users_from_tenant(emails: list[str], tenant_id: str) -> None:
-    with get_session_with_tenant(tenant_id=None) as db_session:
+    with get_session_with_tenant(tenant_id=POSTGRES_DEFAULT_SCHEMA) as db_session:
        try:
            mappings_to_delete = (
                db_session.query(UserTenantMapping)
@@ -71,7 +71,7 @@ def remove_users_from_tenant(emails: list[str], tenant_id: str) -> None:


 def remove_all_users_from_tenant(tenant_id: str) -> None:
-    with get_session_with_tenant(tenant_id=None) as db_session:
+    with get_session_with_tenant(tenant_id=POSTGRES_DEFAULT_SCHEMA) as db_session:
        db_session.query(UserTenantMapping).filter(
            UserTenantMapping.tenant_id == tenant_id
        ).delete()
--- a/backend/model_server/constants.py
+++ b/backend/model_server/constants.py
@@ -6,7 +6,7 @@ MODEL_WARM_UP_STRING = "hi " * 512
 DEFAULT_OPENAI_MODEL = "text-embedding-3-small"
 DEFAULT_COHERE_MODEL = "embed-english-light-v3.0"
 DEFAULT_VOYAGE_MODEL = "voyage-large-2-instruct"
-DEFAULT_VERTEX_MODEL = "text-embedding-004"
+DEFAULT_VERTEX_MODEL = "text-embedding-005"


 class EmbeddingModelTextType:
--- a/backend/model_server/encoders.py
+++ b/backend/model_server/encoders.py
@@ -5,6 +5,7 @@ from types import TracebackType
 from typing import cast
 from typing import Optional

+import aioboto3  # type: ignore
 import httpx
 import openai
 import vertexai  # type: ignore
@@ -28,11 +29,13 @@ from model_server.constants import DEFAULT_VERTEX_MODEL
 from model_server.constants import DEFAULT_VOYAGE_MODEL
 from model_server.constants import EmbeddingModelTextType
 from model_server.constants import EmbeddingProvider
+from model_server.utils import pass_aws_key
 from model_server.utils import simple_log_function_time
 from onyx.utils.logger import setup_logger
 from shared_configs.configs import API_BASED_EMBEDDING_TIMEOUT
 from shared_configs.configs import INDEXING_ONLY
 from shared_configs.configs import OPENAI_EMBEDDING_TIMEOUT
+from shared_configs.configs import VERTEXAI_EMBEDDING_LOCAL_BATCH_SIZE
 from shared_configs.enums import EmbedTextType
 from shared_configs.enums import RerankerProvider
 from shared_configs.model_server_models import Embedding
@@ -78,7 +81,7 @@ class CloudEmbedding:
        self._closed = False

    async def _embed_openai(
-        self, texts: list[str], model: str | None
+        self, texts: list[str], model: str | None, reduced_dimension: int | None
    ) -> list[Embedding]:
        if not model:
            model = DEFAULT_OPENAI_MODEL
@@ -91,19 +94,28 @@ class CloudEmbedding:
        final_embeddings: list[Embedding] = []
        try:
            for text_batch in batch_list(texts, _OPENAI_MAX_INPUT_LEN):
-                response = await client.embeddings.create(input=text_batch, model=model)
+                response = await client.embeddings.create(
+                    input=text_batch,
+                    model=model,
+                    dimensions=reduced_dimension or openai.NOT_GIVEN,
+                )
                final_embeddings.extend(
                    [embedding.embedding for embedding in response.data]
                )
            return final_embeddings
        except Exception as e:
            error_string = (
-                f"Error embedding text with OpenAI: {str(e)} \n"
-                f"Model: {model} \n"
-                f"Provider: {self.provider} \n"
-                f"Texts: {texts}"
+                f"Exception embedding text with OpenAI - {type(e)}: "
+                f"Model: {model} "
+                f"Provider: {self.provider} "
+                f"Exception: {e}"
            )
            logger.error(error_string)
+
+            # only log text when it's not an authentication error.
+            if not isinstance(e, openai.AuthenticationError):
+                logger.debug(f"Exception texts: {texts}")
+
            raise RuntimeError(error_string)

    async def _embed_cohere(
@@ -173,17 +185,24 @@ class CloudEmbedding:
        vertexai.init(project=project_id, credentials=credentials)
        client = TextEmbeddingModel.from_pretrained(model)

-        embeddings = await client.get_embeddings_async(
-            [
-                TextEmbeddingInput(
-                    text,
-                    embedding_type,
-                )
-                for text in texts
-            ],
-            auto_truncate=True,  # This is the default
-        )
-        return [embedding.values for embedding in embeddings]
+        inputs = [TextEmbeddingInput(text, embedding_type) for text in texts]
+
+        # Split into batches of 25 texts
+        max_texts_per_batch = VERTEXAI_EMBEDDING_LOCAL_BATCH_SIZE
+        batches = [
+            inputs[i : i + max_texts_per_batch]
+            for i in range(0, len(inputs), max_texts_per_batch)
+        ]
+
+        # Dispatch all embedding calls asynchronously at once
+        tasks = [
+            client.get_embeddings_async(batch, auto_truncate=True) for batch in batches
+        ]
+
+        # Wait for all tasks to complete in parallel
+        results = await asyncio.gather(*tasks)
+
+        return [embedding.values for batch in results for embedding in batch]

    async def _embed_litellm_proxy(
        self, texts: list[str], model_name: str | None
@@ -218,9 +237,10 @@ class CloudEmbedding:
        text_type: EmbedTextType,
        model_name: str | None = None,
        deployment_name: str | None = None,
+        reduced_dimension: int | None = None,
    ) -> list[Embedding]:
        if self.provider == EmbeddingProvider.OPENAI:
-            return await self._embed_openai(texts, model_name)
+            return await self._embed_openai(texts, model_name, reduced_dimension)
        elif self.provider == EmbeddingProvider.AZURE:
            return await self._embed_azure(texts, f"azure/{deployment_name}")
        elif self.provider == EmbeddingProvider.LITELLM:
@@ -321,6 +341,7 @@ async def embed_text(
    prefix: str | None,
    api_url: str | None,
    api_version: str | None,
+    reduced_dimension: int | None,
    gpu_type: str = "UNKNOWN",
 ) -> list[Embedding]:
    if not all(texts):
@@ -364,6 +385,7 @@ async def embed_text(
                model_name=model_name,
                deployment_name=deployment_name,
                text_type=text_type,
+                reduced_dimension=reduced_dimension,
            )

        if any(embedding is None for embedding in embeddings):
@@ -435,7 +457,7 @@ async def local_rerank(query: str, docs: list[str], model_name: str) -> list[flo
    )


-async def cohere_rerank(
+async def cohere_rerank_api(
    query: str, docs: list[str], model_name: str, api_key: str
 ) -> list[float]:
    cohere_client = CohereAsyncClient(api_key=api_key)
@@ -445,6 +467,45 @@ async def cohere_rerank(
    return [result.relevance_score for result in sorted_results]


+async def cohere_rerank_aws(
+    query: str,
+    docs: list[str],
+    model_name: str,
+    region_name: str,
+    aws_access_key_id: str,
+    aws_secret_access_key: str,
+) -> list[float]:
+    session = aioboto3.Session(
+        aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key
+    )
+    async with session.client(
+        "bedrock-runtime", region_name=region_name
+    ) as bedrock_client:
+        body = json.dumps(
+            {
+                "query": query,
+                "documents": docs,
+                "api_version": 2,
+            }
+        )
+        # Invoke the Bedrock model asynchronously
+        response = await bedrock_client.invoke_model(
+            modelId=model_name,
+            accept="application/json",
+            contentType="application/json",
+            body=body,
+        )
+
+        # Read the response asynchronously
+        response_body = json.loads(await response["body"].read())
+
+        # Extract and sort the results
+        results = response_body.get("results", [])
+        sorted_results = sorted(results, key=lambda item: item["index"])
+
+        return [result["relevance_score"] for result in sorted_results]
+
+
 async def litellm_rerank(
    query: str, docs: list[str], api_url: str, model_name: str, api_key: str | None
 ) -> list[float]:
@@ -503,6 +564,7 @@ async def process_embed_request(
            text_type=embed_request.text_type,
            api_url=embed_request.api_url,
            api_version=embed_request.api_version,
+            reduced_dimension=embed_request.reduced_dimension,
            prefix=prefix,
            gpu_type=gpu_type,
        )
@@ -559,15 +621,32 @@ async def process_rerank_request(rerank_request: RerankRequest) -> RerankRespons
        elif rerank_request.provider_type == RerankerProvider.COHERE:
            if rerank_request.api_key is None:
                raise RuntimeError("Cohere Rerank Requires an API Key")
-            sim_scores = await cohere_rerank(
+            sim_scores = await cohere_rerank_api(
                query=rerank_request.query,
                docs=rerank_request.documents,
                model_name=rerank_request.model_name,
                api_key=rerank_request.api_key,
            )
            return RerankResponse(scores=sim_scores)
+
+        elif rerank_request.provider_type == RerankerProvider.BEDROCK:
+            if rerank_request.api_key is None:
+                raise RuntimeError("Bedrock Rerank Requires an API Key")
+            aws_access_key_id, aws_secret_access_key, aws_region = pass_aws_key(
+                rerank_request.api_key
+            )
+            sim_scores = await cohere_rerank_aws(
+                query=rerank_request.query,
+                docs=rerank_request.documents,
+                model_name=rerank_request.model_name,
+                region_name=aws_region,
+                aws_access_key_id=aws_access_key_id,
+                aws_secret_access_key=aws_secret_access_key,
+            )
+            return RerankResponse(scores=sim_scores)
        else:
            raise ValueError(f"Unsupported provider: {rerank_request.provider_type}")
+
    except Exception as e:
        logger.exception(f"Error during reranking process:\n{str(e)}")
        raise HTTPException(
--- a/backend/model_server/utils.py
+++ b/backend/model_server/utils.py
@@ -70,3 +70,32 @@ def get_gpu_type() -> str:
        return GPUStatus.MAC_MPS

    return GPUStatus.NONE
+
+
+def pass_aws_key(api_key: str) -> tuple[str, str, str]:
+    """Parse AWS API key string into components.
+
+    Args:
+        api_key: String in format 'aws_ACCESSKEY_SECRETKEY_REGION'
+
+    Returns:
+        Tuple of (access_key, secret_key, region)
+
+    Raises:
+        ValueError: If key format is invalid
+    """
+    if not api_key.startswith("aws"):
+        raise ValueError("API key must start with 'aws' prefix")
+
+    parts = api_key.split("_")
+    if len(parts) != 4:
+        raise ValueError(
+            f"API key must be in format 'aws_ACCESSKEY_SECRETKEY_REGION', got {len(parts) - 1} parts"
+            "this is an onyx specific format for formatting the aws secrets for bedrock"
+        )
+
+    try:
+        _, aws_access_key_id, aws_secret_access_key, aws_region = parts
+        return aws_access_key_id, aws_secret_access_key, aws_region
+    except Exception as e:
+        raise ValueError(f"Failed to parse AWS key components: {str(e)}")
--- a/backend/onyx/agents/agent_search/deep_search/initial/generate_initial_answer/nodes/generate_initial_answer.py
+++ b/backend/onyx/agents/agent_search/deep_search/initial/generate_initial_answer/nodes/generate_initial_answer.py
@@ -153,8 +153,9 @@ def generate_initial_answer(
    )
    for tool_response in yield_search_responses(
        query=question,
-        reranked_sections=answer_generation_documents.streaming_documents,
-        final_context_sections=answer_generation_documents.context_documents,
+        get_retrieved_sections=lambda: answer_generation_documents.context_documents,
+        get_reranked_sections=lambda: answer_generation_documents.streaming_documents,
+        get_final_context_sections=lambda: answer_generation_documents.context_documents,
        search_query_info=query_info,
        get_section_relevance=lambda: relevance_list,
        search_tool=graph_config.tooling.search_tool,
--- a/backend/onyx/agents/agent_search/deep_search/main/nodes/generate_validate_refined_answer.py
+++ b/backend/onyx/agents/agent_search/deep_search/main/nodes/generate_validate_refined_answer.py
@@ -179,8 +179,9 @@ def generate_validate_refined_answer(
    )
    for tool_response in yield_search_responses(
        query=question,
-        reranked_sections=answer_generation_documents.streaming_documents,
-        final_context_sections=answer_generation_documents.context_documents,
+        get_retrieved_sections=lambda: answer_generation_documents.context_documents,
+        get_reranked_sections=lambda: answer_generation_documents.streaming_documents,
+        get_final_context_sections=lambda: answer_generation_documents.context_documents,
        search_query_info=query_info,
        get_section_relevance=lambda: relevance_list,
        search_tool=graph_config.tooling.search_tool,
--- a/backend/onyx/agents/agent_search/deep_search/main/operations.py
+++ b/backend/onyx/agents/agent_search/deep_search/main/operations.py
@@ -13,7 +13,6 @@ from onyx.chat.models import StreamStopInfo
 from onyx.chat.models import StreamStopReason
 from onyx.chat.models import StreamType
 from onyx.chat.models import SubQuestionPiece
-from onyx.context.search.models import IndexFilters
 from onyx.tools.models import SearchQueryInfo
 from onyx.utils.logger import setup_logger

@@ -144,8 +143,6 @@ def get_query_info(results: list[QueryRetrievalResult]) -> SearchQueryInfo:
        if result.query_info is not None:
            query_info = result.query_info
            break
-    return query_info or SearchQueryInfo(
-        predicted_search=None,
-        final_filters=IndexFilters(access_control_list=None),
-        recency_bias_multiplier=1.0,
-    )
+
+    assert query_info is not None, "must have query info"
+    return query_info
--- a/backend/onyx/agents/agent_search/deep_search/shared/expanded_retrieval/nodes/format_results.py
+++ b/backend/onyx/agents/agent_search/deep_search/shared/expanded_retrieval/nodes/format_results.py
@@ -56,8 +56,9 @@ def format_results(
        relevance_list = relevance_from_docs(reranked_documents)
        for tool_response in yield_search_responses(
            query=state.question,
-            reranked_sections=state.retrieved_documents,
-            final_context_sections=reranked_documents,
+            get_retrieved_sections=lambda: reranked_documents,
+            get_reranked_sections=lambda: state.retrieved_documents,
+            get_final_context_sections=lambda: reranked_documents,
            search_query_info=query_info,
            get_section_relevance=lambda: relevance_list,
            search_tool=graph_config.tooling.search_tool,
--- a/backend/onyx/agents/agent_search/deep_search/shared/expanded_retrieval/nodes/retrieve_documents.py
+++ b/backend/onyx/agents/agent_search/deep_search/shared/expanded_retrieval/nodes/retrieve_documents.py
@@ -91,7 +91,7 @@ def retrieve_documents(
    retrieved_docs = retrieved_docs[:AGENT_MAX_QUERY_RETRIEVAL_RESULTS]

    if AGENT_RETRIEVAL_STATS:
-        pre_rerank_docs = callback_container[0]
+        pre_rerank_docs = callback_container[0] if callback_container else []
        fit_scores = get_fit_scores(
            pre_rerank_docs,
            retrieved_docs,
--- a/backend/onyx/agents/agent_search/orchestration/nodes/call_tool.py
+++ b/backend/onyx/agents/agent_search/orchestration/nodes/call_tool.py
@@ -44,7 +44,9 @@ def call_tool(
    tool = tool_choice.tool
    tool_args = tool_choice.tool_args
    tool_id = tool_choice.id
-    tool_runner = ToolRunner(tool, tool_args)
+    tool_runner = ToolRunner(
+        tool, tool_args, override_kwargs=tool_choice.search_tool_override_kwargs
+    )
    tool_kickoff = tool_runner.kickoff()

    emit_packet(tool_kickoff, writer)
--- a/backend/onyx/agents/agent_search/orchestration/nodes/choose_tool.py
+++ b/backend/onyx/agents/agent_search/orchestration/nodes/choose_tool.py
@@ -15,8 +15,17 @@ from onyx.chat.tool_handling.tool_response_handler import get_tool_by_name
 from onyx.chat.tool_handling.tool_response_handler import (
    get_tool_call_for_non_tool_calling_llm_impl,
 )
+from onyx.context.search.preprocessing.preprocessing import query_analysis
+from onyx.context.search.retrieval.search_runner import get_query_embedding
+from onyx.tools.models import SearchToolOverrideKwargs
 from onyx.tools.tool import Tool
+from onyx.tools.tool_implementations.search.search_tool import SearchTool
 from onyx.utils.logger import setup_logger
+from onyx.utils.threadpool_concurrency import run_in_background
+from onyx.utils.threadpool_concurrency import TimeoutThread
+from onyx.utils.threadpool_concurrency import wait_on_background
+from onyx.utils.timing import log_function_time
+from shared_configs.model_server_models import Embedding

 logger = setup_logger()

@@ -25,6 +34,7 @@ logger = setup_logger()
 # and a function that handles extracting the necessary fields
 # from the state and config
 # TODO: fan-out to multiple tool call nodes? Make this configurable?
+@log_function_time(print_only=True)
 def choose_tool(
    state: ToolChoiceState,
    config: RunnableConfig,
@@ -37,6 +47,31 @@ def choose_tool(
    should_stream_answer = state.should_stream_answer

    agent_config = cast(GraphConfig, config["metadata"]["config"])
+
+    force_use_tool = agent_config.tooling.force_use_tool
+
+    embedding_thread: TimeoutThread[Embedding] | None = None
+    keyword_thread: TimeoutThread[tuple[bool, list[str]]] | None = None
+    override_kwargs: SearchToolOverrideKwargs | None = None
+    if (
+        not agent_config.behavior.use_agentic_search
+        and agent_config.tooling.search_tool is not None
+        and (
+            not force_use_tool.force_use or force_use_tool.tool_name == SearchTool.name
+        )
+    ):
+        override_kwargs = SearchToolOverrideKwargs()
+        # Run in a background thread to avoid blocking the main thread
+        embedding_thread = run_in_background(
+            get_query_embedding,
+            agent_config.inputs.search_request.query,
+            agent_config.persistence.db_session,
+        )
+        keyword_thread = run_in_background(
+            query_analysis,
+            agent_config.inputs.search_request.query,
+        )
+
    using_tool_calling_llm = agent_config.tooling.using_tool_calling_llm
    prompt_builder = state.prompt_snapshot or agent_config.inputs.prompt_builder

@@ -47,7 +82,6 @@ def choose_tool(
    tools = [
        tool for tool in (agent_config.tooling.tools or []) if tool.name in state.tools
    ]
-    force_use_tool = agent_config.tooling.force_use_tool

    tool, tool_args = None, None
    if force_use_tool.force_use and force_use_tool.args is not None:
@@ -71,11 +105,22 @@ def choose_tool(
    # If we have a tool and tool args, we are ready to request a tool call.
    # This only happens if the tool call was forced or we are using a non-tool calling LLM.
    if tool and tool_args:
+        if embedding_thread and tool.name == SearchTool._NAME:
+            # Wait for the embedding thread to finish
+            embedding = wait_on_background(embedding_thread)
+            assert override_kwargs is not None, "must have override kwargs"
+            override_kwargs.precomputed_query_embedding = embedding
+        if keyword_thread and tool.name == SearchTool._NAME:
+            is_keyword, keywords = wait_on_background(keyword_thread)
+            assert override_kwargs is not None, "must have override kwargs"
+            override_kwargs.precomputed_is_keyword = is_keyword
+            override_kwargs.precomputed_keywords = keywords
        return ToolChoiceUpdate(
            tool_choice=ToolChoice(
                tool=tool,
                tool_args=tool_args,
                id=str(uuid4()),
+                search_tool_override_kwargs=override_kwargs,
            ),
        )

@@ -98,8 +143,16 @@ def choose_tool(
        # For tool calling LLMs, we want to insert the task prompt as part of this flow, this is because the LLM
        # may choose to not call any tools and just generate the answer, in which case the task prompt is needed.
        prompt=built_prompt,
-        tools=[tool.tool_definition() for tool in tools] or None,
-        tool_choice=("required" if tools and force_use_tool.force_use else None),
+        tools=(
+            [tool.tool_definition() for tool in tools] or None
+            if using_tool_calling_llm
+            else None
+        ),
+        tool_choice=(
+            "required"
+            if tools and force_use_tool.force_use and using_tool_calling_llm
+            else None
+        ),
        structured_response_format=structured_response_format,
    )

@@ -145,10 +198,22 @@ def choose_tool(
    logger.debug(f"Selected tool: {selected_tool.name}")
    logger.debug(f"Selected tool call request: {selected_tool_call_request}")

+    if embedding_thread and selected_tool.name == SearchTool._NAME:
+        # Wait for the embedding thread to finish
+        embedding = wait_on_background(embedding_thread)
+        assert override_kwargs is not None, "must have override kwargs"
+        override_kwargs.precomputed_query_embedding = embedding
+    if keyword_thread and selected_tool.name == SearchTool._NAME:
+        is_keyword, keywords = wait_on_background(keyword_thread)
+        assert override_kwargs is not None, "must have override kwargs"
+        override_kwargs.precomputed_is_keyword = is_keyword
+        override_kwargs.precomputed_keywords = keywords
+
    return ToolChoiceUpdate(
        tool_choice=ToolChoice(
            tool=selected_tool,
            tool_args=selected_tool_call_request["args"],
            id=selected_tool_call_request["id"],
+            search_tool_override_kwargs=override_kwargs,
        ),
    )
--- a/backend/onyx/agents/agent_search/orchestration/nodes/use_tool_response.py
+++ b/backend/onyx/agents/agent_search/orchestration/nodes/use_tool_response.py
@@ -9,18 +9,23 @@ from onyx.agents.agent_search.basic.states import BasicState
 from onyx.agents.agent_search.basic.utils import process_llm_stream
 from onyx.agents.agent_search.models import GraphConfig
 from onyx.chat.models import LlmDoc
-from onyx.chat.models import OnyxContexts
 from onyx.tools.tool_implementations.search.search_tool import (
-    SEARCH_DOC_CONTENT_ID,
+    SEARCH_RESPONSE_SUMMARY_ID,
+)
+from onyx.tools.tool_implementations.search.search_tool import SearchResponseSummary
+from onyx.tools.tool_implementations.search.search_utils import (
+    context_from_inference_section,
 )
 from onyx.tools.tool_implementations.search_like_tool_utils import (
    FINAL_CONTEXT_DOCUMENTS_ID,
 )
 from onyx.utils.logger import setup_logger
+from onyx.utils.timing import log_function_time

 logger = setup_logger()


+@log_function_time(print_only=True)
 def basic_use_tool_response(
    state: BasicState, config: RunnableConfig, writer: StreamWriter = lambda _: None
 ) -> BasicOutput:
@@ -50,11 +55,13 @@ def basic_use_tool_response(
    for yield_item in tool_call_responses:
        if yield_item.id == FINAL_CONTEXT_DOCUMENTS_ID:
            final_search_results = cast(list[LlmDoc], yield_item.response)
-        elif yield_item.id == SEARCH_DOC_CONTENT_ID:
-            search_contexts = cast(OnyxContexts, yield_item.response).contexts
-            for doc in search_contexts:
-                if doc.document_id not in initial_search_results:
-                    initial_search_results.append(doc)
+        elif yield_item.id == SEARCH_RESPONSE_SUMMARY_ID:
+            search_response_summary = cast(SearchResponseSummary, yield_item.response)
+            for section in search_response_summary.top_sections:
+                if section.center_chunk.document_id not in initial_search_results:
+                    initial_search_results.append(
+                        context_from_inference_section(section)
+                    )

    new_tool_call_chunk = AIMessageChunk(content="")
    if not agent_config.behavior.skip_gen_ai_answer_generation:
--- a/backend/onyx/agents/agent_search/orchestration/states.py
+++ b/backend/onyx/agents/agent_search/orchestration/states.py
@@ -2,6 +2,7 @@ from pydantic import BaseModel

 from onyx.chat.prompt_builder.answer_prompt_builder import PromptSnapshot
 from onyx.tools.message import ToolCallSummary
+from onyx.tools.models import SearchToolOverrideKwargs
 from onyx.tools.models import ToolCallFinalResult
 from onyx.tools.models import ToolCallKickoff
 from onyx.tools.models import ToolResponse
@@ -35,6 +36,7 @@ class ToolChoice(BaseModel):
    tool: Tool
    tool_args: dict
    id: str | None
+    search_tool_override_kwargs: SearchToolOverrideKwargs | None = None

    class Config:
        arbitrary_types_allowed = True
--- a/backend/onyx/agents/agent_search/shared_graph_utils/constants.py
+++ b/backend/onyx/agents/agent_search/shared_graph_utils/constants.py
@@ -13,6 +13,11 @@ AGENT_NEGATIVE_VALUE_STR = "no"
 AGENT_ANSWER_SEPARATOR = "Answer:"


+EMBEDDING_KEY = "embedding"
+IS_KEYWORD_KEY = "is_keyword"
+KEYWORDS_KEY = "keywords"
+
+
 class AgentLLMErrorType(str, Enum):
    TIMEOUT = "timeout"
    RATE_LIMIT = "rate_limit"
--- a/backend/onyx/auth/api_key.py
+++ b/backend/onyx/auth/api_key.py
@@ -10,6 +10,7 @@ from pydantic import BaseModel

 from onyx.auth.schemas import UserRole
 from onyx.configs.app_configs import API_KEY_HASH_ROUNDS
+from shared_configs.configs import MULTI_TENANT


 _API_KEY_HEADER_NAME = "Authorization"
@@ -35,8 +36,7 @@ class ApiKeyDescriptor(BaseModel):


 def generate_api_key(tenant_id: str | None = None) -> str:
-    # For backwards compatibility, if no tenant_id, generate old style key
-    if not tenant_id:
+    if not MULTI_TENANT or not tenant_id:
        return _API_KEY_PREFIX + secrets.token_urlsafe(_API_KEY_LEN)

    encoded_tenant = quote(tenant_id)  # URL encode the tenant ID
--- a/backend/onyx/auth/email_utils.py
+++ b/backend/onyx/auth/email_utils.py
@@ -2,6 +2,8 @@ import smtplib
 from datetime import datetime
 from email.mime.multipart import MIMEMultipart
 from email.mime.text import MIMEText
+from email.utils import formatdate
+from email.utils import make_msgid

 from onyx.configs.app_configs import EMAIL_CONFIGURED
 from onyx.configs.app_configs import EMAIL_FROM
@@ -10,8 +12,10 @@ from onyx.configs.app_configs import SMTP_PORT
 from onyx.configs.app_configs import SMTP_SERVER
 from onyx.configs.app_configs import SMTP_USER
 from onyx.configs.app_configs import WEB_DOMAIN
+from onyx.configs.constants import AuthType
 from onyx.configs.constants import TENANT_ID_COOKIE_NAME
 from onyx.db.models import User
+from shared_configs.configs import MULTI_TENANT

 HTML_EMAIL_TEMPLATE = """\
 <!DOCTYPE html>
@@ -149,8 +153,9 @@ def send_email(
    msg = MIMEMultipart("alternative")
    msg["Subject"] = subject
    msg["To"] = user_email
-    if mail_from:
-        msg["From"] = mail_from
+    msg["From"] = mail_from
+    msg["Date"] = formatdate(localtime=True)
+    msg["Message-ID"] = make_msgid(domain="onyx.app")

    part_text = MIMEText(text_body, "plain")
    part_html = MIMEText(html_body, "html")
@@ -172,7 +177,7 @@ def send_subscription_cancellation_email(user_email: str) -> None:
    subject = "Your Onyx Subscription Has Been Canceled"
    heading = "Subscription Canceled"
    message = (
-        "<p>We’re sorry to see you go.</p>"
+        "<p>We're sorry to see you go.</p>"
        "<p>Your subscription has been canceled and will end on your next billing date.</p>"
        "<p>If you change your mind, you can always come back!</p>"
    )
@@ -187,36 +192,64 @@ def send_subscription_cancellation_email(user_email: str) -> None:
    send_email(user_email, subject, html_content, text_content)


-def send_user_email_invite(user_email: str, current_user: User) -> None:
+def send_user_email_invite(
+    user_email: str, current_user: User, auth_type: AuthType
+) -> None:
    subject = "Invitation to Join Onyx Organization"
    heading = "You've Been Invited!"
-    message = (
-        f"<p>You have been invited by {current_user.email} to join an organization on Onyx.</p>"
-        "<p>To join the organization, please click the button below to set a password "
-        "or login with Google and complete your registration.</p>"
-    )
+
+    # the exact action taken by the user, and thus the message, depends on the auth type
+    message = f"<p>You have been invited by {current_user.email} to join an organization on Onyx.</p>"
+    if auth_type == AuthType.CLOUD:
+        message += (
+            "<p>To join the organization, please click the button below to set a password "
+            "or login with Google and complete your registration.</p>"
+        )
+    elif auth_type == AuthType.BASIC:
+        message += (
+            "<p>To join the organization, please click the button below to set a password "
+            "and complete your registration.</p>"
+        )
+    elif auth_type == AuthType.GOOGLE_OAUTH:
+        message += (
+            "<p>To join the organization, please click the button below to login with Google "
+            "and complete your registration.</p>"
+        )
+    elif auth_type == AuthType.OIDC or auth_type == AuthType.SAML:
+        message += (
+            "<p>To join the organization, please click the button below to"
+            " complete your registration.</p>"
+        )
+    else:
+        raise ValueError(f"Invalid auth type: {auth_type}")
+
    cta_text = "Join Organization"
    cta_link = f"{WEB_DOMAIN}/auth/signup?email={user_email}"
    html_content = build_html_email(heading, message, cta_text, cta_link)
+
+    # text content is the fallback for clients that don't support HTML
+    # not as critical, so not having special cases for each auth type
    text_content = (
        f"You have been invited by {current_user.email} to join an organization on Onyx.\n"
        "To join the organization, please visit the following link:\n"
        f"{WEB_DOMAIN}/auth/signup?email={user_email}\n"
-        "You'll be asked to set a password or login with Google to complete your registration."
    )
+    if auth_type == AuthType.CLOUD:
+        text_content += "You'll be asked to set a password or login with Google to complete your registration."
+
    send_email(user_email, subject, html_content, text_content)


 def send_forgot_password_email(
    user_email: str,
    token: str,
+    tenant_id: str,
    mail_from: str = EMAIL_FROM,
-    tenant_id: str | None = None,
 ) -> None:
    # Builds a forgot password email with or without fancy HTML
    subject = "Onyx Forgot Password"
    link = f"{WEB_DOMAIN}/auth/reset-password?token={token}"
-    if tenant_id:
+    if MULTI_TENANT:
        link += f"&{TENANT_ID_COOKIE_NAME}={tenant_id}"
    message = f"<p>Click the following link to reset your password:</p><p>{link}</p>"
    html_content = build_html_email("Reset Your Password", message)
--- a/backend/onyx/auth/users.py
+++ b/backend/onyx/auth/users.py
@@ -214,7 +214,7 @@ def verify_email_is_invited(email: str) -> None:
    raise PermissionError("User not on allowed user whitelist")


-def verify_email_in_whitelist(email: str, tenant_id: str | None = None) -> None:
+def verify_email_in_whitelist(email: str, tenant_id: str) -> None:
    with get_session_with_tenant(tenant_id=tenant_id) as db_session:
        if not get_user_by_email(email, db_session):
            verify_email_is_invited(email)
@@ -411,7 +411,7 @@ class UserManager(UUIDIDMixin, BaseUserManager[User, uuid.UUID]):
                "refresh_token": refresh_token,
            }

-            user: User
+            user: User | None = None

            try:
                # Attempt to get user by OAuth account
@@ -420,15 +420,20 @@ class UserManager(UUIDIDMixin, BaseUserManager[User, uuid.UUID]):
            except exceptions.UserNotExists:
                try:
                    # Attempt to get user by email
-                    user = await self.get_by_email(account_email)
+                    user = await self.user_db.get_by_email(account_email)
                    if not associate_by_email:
                        raise exceptions.UserAlreadyExists()

-                    user = await self.user_db.add_oauth_account(
-                        user, oauth_account_dict
-                    )
+                    # Make sure user is not None before adding OAuth account
+                    if user is not None:
+                        user = await self.user_db.add_oauth_account(
+                            user, oauth_account_dict
+                        )
+                    else:
+                        # This shouldn't happen since get_by_email would raise UserNotExists
+                        # but adding as a safeguard
+                        raise exceptions.UserNotExists()

-                    # If user not found by OAuth account or email, create a new user
                except exceptions.UserNotExists:
                    password = self.password_helper.generate()
                    user_dict = {
@@ -439,26 +444,36 @@ class UserManager(UUIDIDMixin, BaseUserManager[User, uuid.UUID]):

                    user = await self.user_db.create(user_dict)

-                    # Explicitly set the Postgres schema for this session to ensure
-                    # OAuth account creation happens in the correct tenant schema
-
-                    # Add OAuth account
-                    await self.user_db.add_oauth_account(user, oauth_account_dict)
-                    await self.on_after_register(user, request)
+                    # Add OAuth account only if user creation was successful
+                    if user is not None:
+                        await self.user_db.add_oauth_account(user, oauth_account_dict)
+                        await self.on_after_register(user, request)
+                    else:
+                        raise HTTPException(
+                            status_code=500, detail="Failed to create user account"
+                        )

            else:
-                for existing_oauth_account in user.oauth_accounts:
-                    if (
-                        existing_oauth_account.account_id == account_id
-                        and existing_oauth_account.oauth_name == oauth_name
-                    ):
-                        user = await self.user_db.update_oauth_account(
-                            user,
-                            # NOTE: OAuthAccount DOES implement the OAuthAccountProtocol
-                            # but the type checker doesn't know that :(
-                            existing_oauth_account,  # type: ignore
-                            oauth_account_dict,
-                        )
+                # User exists, update OAuth account if needed
+                if user is not None:  # Add explicit check
+                    for existing_oauth_account in user.oauth_accounts:
+                        if (
+                            existing_oauth_account.account_id == account_id
+                            and existing_oauth_account.oauth_name == oauth_name
+                        ):
+                            user = await self.user_db.update_oauth_account(
+                                user,
+                                # NOTE: OAuthAccount DOES implement the OAuthAccountProtocol
+                                # but the type checker doesn't know that :(
+                                existing_oauth_account,  # type: ignore
+                                oauth_account_dict,
+                            )
+
+            # Ensure user is not None before proceeding
+            if user is None:
+                raise HTTPException(
+                    status_code=500, detail="Failed to authenticate or create user"
+                )

            # NOTE: Most IdPs have very short expiry times, and we don't want to force the user to
            # re-authenticate that frequently, so by default this is disabled
@@ -508,6 +523,7 @@ class UserManager(UUIDIDMixin, BaseUserManager[User, uuid.UUID]):
        token = CURRENT_TENANT_ID_CONTEXTVAR.set(tenant_id)
        try:
            user_count = await get_user_count()
+            logger.debug(f"Current tenant user count: {user_count}")

            with get_session_with_tenant(tenant_id=tenant_id) as db_session:
                if user_count == 1:
@@ -529,7 +545,7 @@ class UserManager(UUIDIDMixin, BaseUserManager[User, uuid.UUID]):
        finally:
            CURRENT_TENANT_ID_CONTEXTVAR.reset(token)

-        logger.notice(f"User {user.id} has registered.")
+        logger.debug(f"User {user.id} has registered.")
        optional_telemetry(
            record_type=RecordType.SIGN_UP,
            data={"action": "create"},
@@ -553,7 +569,7 @@ class UserManager(UUIDIDMixin, BaseUserManager[User, uuid.UUID]):
            async_return_default_schema,
        )(email=user.email)

-        send_forgot_password_email(user.email, token, tenant_id=tenant_id)
+        send_forgot_password_email(user.email, tenant_id=tenant_id, token=token)

    async def on_after_request_verify(
        self, user: User, token: str, request: Optional[Request] = None
@@ -571,14 +587,20 @@ class UserManager(UUIDIDMixin, BaseUserManager[User, uuid.UUID]):
    ) -> Optional[User]:
        email = credentials.username

-        # Get tenant_id from mapping table
-        tenant_id = await fetch_ee_implementation_or_noop(
-            "onyx.server.tenants.provisioning",
-            "get_or_provision_tenant",
-            async_return_default_schema,
-        )(
-            email=email,
-        )
+        tenant_id: str | None = None
+        try:
+            tenant_id = fetch_ee_implementation_or_noop(
+                "onyx.server.tenants.provisioning",
+                "get_tenant_id_for_email",
+                None,
+            )(
+                email=email,
+            )
+        except Exception as e:
+            logger.warning(
+                f"User attempted to login with invalid credentials: {str(e)}"
+            )
+
        if not tenant_id:
            # User not found in mapping
            self.password_helper.hash(credentials.password)
--- a/backend/onyx/background/celery/apps/app_base.py
+++ b/backend/onyx/background/celery/apps/app_base.py
@@ -2,6 +2,7 @@ import logging
 import multiprocessing
 import time
 from typing import Any
+from typing import cast

 import sentry_sdk
 from celery import Task
@@ -131,16 +132,16 @@ def on_task_postrun(
    # Get tenant_id directly from kwargs- each celery task has a tenant_id kwarg
    if not kwargs:
        logger.error(f"Task {task.name} (ID: {task_id}) is missing kwargs")
-        tenant_id = None
+        tenant_id = POSTGRES_DEFAULT_SCHEMA
    else:
-        tenant_id = kwargs.get("tenant_id")
+        tenant_id = cast(str, kwargs.get("tenant_id", POSTGRES_DEFAULT_SCHEMA))

    task_logger.debug(
        f"Task {task.name} (ID: {task_id}) completed with state: {state} "
        f"{f'for tenant_id={tenant_id}' if tenant_id else ''}"
    )

-    r = get_redis_client()
+    r = get_redis_client(tenant_id=tenant_id)

    if task_id.startswith(RedisConnectorCredentialPair.PREFIX):
        r.srem(RedisConnectorCredentialPair.get_taskset_key(), task_id)
--- a/backend/onyx/background/celery/celery_redis.py
+++ b/backend/onyx/background/celery/celery_redis.py
@@ -92,7 +92,8 @@ def celery_find_task(task_id: str, queue: str, r: Redis) -> int:


 def celery_get_queued_task_ids(queue: str, r: Redis) -> set[str]:
-    """This is a redis specific way to build a list of tasks in a queue.
+    """This is a redis specific way to build a list of tasks in a queue and return them
+    as a set.

    This helps us read the queue once and then efficiently look for missing tasks
    in the queue.
--- a/backend/onyx/background/celery/celery_utils.py
+++ b/backend/onyx/background/celery/celery_utils.py
@@ -34,7 +34,7 @@ def _get_deletion_status(
    connector_id: int,
    credential_id: int,
    db_session: Session,
-    tenant_id: str | None = None,
+    tenant_id: str,
 ) -> TaskQueueState | None:
    """We no longer store TaskQueueState in the DB for a deletion attempt.
    This function populates TaskQueueState by just checking redis.
@@ -67,7 +67,7 @@ def get_deletion_attempt_snapshot(
    connector_id: int,
    credential_id: int,
    db_session: Session,
-    tenant_id: str | None = None,
+    tenant_id: str,
 ) -> DeletionAttemptSnapshot | None:
    deletion_task = _get_deletion_status(
        connector_id, credential_id, db_session, tenant_id
--- a/backend/onyx/background/celery/tasks/connector_deletion/tasks.py
+++ b/backend/onyx/background/celery/tasks/connector_deletion/tasks.py
@@ -8,16 +8,21 @@ from celery import Celery
 from celery import shared_task
 from celery import Task
 from celery.exceptions import SoftTimeLimitExceeded
+from pydantic import ValidationError
 from redis import Redis
 from redis.lock import Lock as RedisLock
 from sqlalchemy.orm import Session

 from onyx.background.celery.apps.app_base import task_logger
+from onyx.background.celery.celery_redis import celery_get_queue_length
+from onyx.background.celery.celery_redis import celery_get_queued_task_ids
 from onyx.configs.app_configs import JOB_TIMEOUT
 from onyx.configs.constants import CELERY_GENERIC_BEAT_LOCK_TIMEOUT
+from onyx.configs.constants import OnyxCeleryQueues
 from onyx.configs.constants import OnyxCeleryTask
 from onyx.configs.constants import OnyxRedisConstants
 from onyx.configs.constants import OnyxRedisLocks
+from onyx.configs.constants import OnyxRedisSignals
 from onyx.db.connector import fetch_connector_by_id
 from onyx.db.connector_credential_pair import add_deletion_failure_message
 from onyx.db.connector_credential_pair import (
@@ -52,6 +57,51 @@ class TaskDependencyError(RuntimeError):
    with connector deletion."""


+def revoke_tasks_blocking_deletion(
+    redis_connector: RedisConnector, db_session: Session, app: Celery
+) -> None:
+    search_settings_list = get_all_search_settings(db_session)
+    for search_settings in search_settings_list:
+        redis_connector_index = redis_connector.new_index(search_settings.id)
+        try:
+            index_payload = redis_connector_index.payload
+            if index_payload and index_payload.celery_task_id:
+                app.control.revoke(index_payload.celery_task_id)
+                task_logger.info(
+                    f"Revoked indexing task {index_payload.celery_task_id}."
+                )
+        except Exception:
+            task_logger.exception("Exception while revoking indexing task")
+
+    try:
+        permissions_sync_payload = redis_connector.permissions.payload
+        if permissions_sync_payload and permissions_sync_payload.celery_task_id:
+            app.control.revoke(permissions_sync_payload.celery_task_id)
+            task_logger.info(
+                f"Revoked permissions sync task {permissions_sync_payload.celery_task_id}."
+            )
+    except Exception:
+        task_logger.exception("Exception while revoking pruning task")
+
+    try:
+        prune_payload = redis_connector.prune.payload
+        if prune_payload and prune_payload.celery_task_id:
+            app.control.revoke(prune_payload.celery_task_id)
+            task_logger.info(f"Revoked pruning task {prune_payload.celery_task_id}.")
+    except Exception:
+        task_logger.exception("Exception while revoking permissions sync task")
+
+    try:
+        external_group_sync_payload = redis_connector.external_group_sync.payload
+        if external_group_sync_payload and external_group_sync_payload.celery_task_id:
+            app.control.revoke(external_group_sync_payload.celery_task_id)
+            task_logger.info(
+                f"Revoked external group sync task {external_group_sync_payload.celery_task_id}."
+            )
+    except Exception:
+        task_logger.exception("Exception while revoking external group sync task")
+
+
@shared_task(
    name=OnyxCeleryTask.CHECK_FOR_CONNECTOR_DELETION,
    ignore_result=True,
@@ -59,22 +109,36 @@ class TaskDependencyError(RuntimeError):
    trail=False,
    bind=True,
 )
-def check_for_connector_deletion_task(
-    self: Task, *, tenant_id: str | None
-) -> bool | None:
+def check_for_connector_deletion_task(self: Task, *, tenant_id: str) -> bool | None:
    r = get_redis_client()
    r_replica = get_redis_replica_client()
+    r_celery: Redis = self.app.broker_connection().channel().client  # type: ignore

    lock_beat: RedisLock = r.lock(
        OnyxRedisLocks.CHECK_CONNECTOR_DELETION_BEAT_LOCK,
        timeout=CELERY_GENERIC_BEAT_LOCK_TIMEOUT,
    )

-    # these tasks should never overlap
+    # Prevent this task from overlapping with itself
    if not lock_beat.acquire(blocking=False):
        return None

    try:
+        # we want to run this less frequently than the overall task
+        lock_beat.reacquire()
+        if not r.exists(OnyxRedisSignals.BLOCK_VALIDATE_CONNECTOR_DELETION_FENCES):
+            # clear fences that don't have associated celery tasks in progress
+            try:
+                validate_connector_deletion_fences(
+                    tenant_id, r, r_replica, r_celery, lock_beat
+                )
+            except Exception:
+                task_logger.exception(
+                    "Exception while validating connector deletion fences"
+                )
+
+            r.set(OnyxRedisSignals.BLOCK_VALIDATE_CONNECTOR_DELETION_FENCES, 1, ex=300)
+
        # collect cc_pair_ids
        cc_pair_ids: list[int] = []
        with get_session_with_current_tenant() as db_session:
@@ -92,9 +156,38 @@ def check_for_connector_deletion_task(
                    )
                except TaskDependencyError as e:
                    # this means we wanted to start deleting but dependent tasks were running
-                    # Leave a stop signal to clear indexing and pruning tasks more quickly
+                    # on the first error, we set a stop signal and revoke the dependent tasks
+                    # on subsequent errors, we hard reset blocking fences after our specified timeout
+                    # is exceeded
                    task_logger.info(str(e))
-                    redis_connector.stop.set_fence(True)
+
+                    if not redis_connector.stop.fenced:
+                        # one time revoke of celery tasks
+                        task_logger.info("Revoking any tasks blocking deletion.")
+                        revoke_tasks_blocking_deletion(
+                            redis_connector, db_session, self.app
+                        )
+                        redis_connector.stop.set_fence(True)
+                        redis_connector.stop.set_timeout()
+                    else:
+                        # stop signal already set
+                        if redis_connector.stop.timed_out:
+                            # waiting too long, just reset blocking fences
+                            task_logger.info(
+                                "Timed out waiting for tasks blocking deletion. Resetting blocking fences."
+                            )
+                            search_settings_list = get_all_search_settings(db_session)
+                            for search_settings in search_settings_list:
+                                redis_connector_index = redis_connector.new_index(
+                                    search_settings.id
+                                )
+                                redis_connector_index.reset()
+                            redis_connector.prune.reset()
+                            redis_connector.permissions.reset()
+                            redis_connector.external_group_sync.reset()
+                        else:
+                            # just wait
+                            pass
                else:
                    # clear the stop signal if it exists ... no longer needed
                    redis_connector.stop.set_fence(False)
@@ -129,7 +222,7 @@ def try_generate_document_cc_pair_cleanup_tasks(
    cc_pair_id: int,
    db_session: Session,
    lock_beat: RedisLock,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> int | None:
    """Returns an int if syncing is needed. The int represents the number of sync tasks generated.
    Note that syncing can still be required even if the number of sync tasks generated is zero.
@@ -169,6 +262,7 @@ def try_generate_document_cc_pair_cleanup_tasks(
        return None

    # set a basic fence to start
+    redis_connector.delete.set_active()
    fence_payload = RedisConnectorDeletePayload(
        num_tasks=None,
        submitted=datetime.now(timezone.utc),
@@ -249,7 +343,7 @@ def try_generate_document_cc_pair_cleanup_tasks(


 def monitor_connector_deletion_taskset(
-    tenant_id: str | None, key_bytes: bytes, r: Redis
+    tenant_id: str, key_bytes: bytes, r: Redis
 ) -> None:
    fence_key = key_bytes.decode("utf-8")
    cc_pair_id_str = RedisConnector.get_id_from_fence_key(fence_key)
@@ -401,3 +495,171 @@ def monitor_connector_deletion_taskset(
    )

    redis_connector.delete.reset()
+
+
+def validate_connector_deletion_fences(
+    tenant_id: str,
+    r: Redis,
+    r_replica: Redis,
+    r_celery: Redis,
+    lock_beat: RedisLock,
+) -> None:
+    # building lookup table can be expensive, so we won't bother
+    # validating until the queue is small
+    CONNECTION_DELETION_VALIDATION_MAX_QUEUE_LEN = 1024
+
+    queue_len = celery_get_queue_length(OnyxCeleryQueues.CONNECTOR_DELETION, r_celery)
+    if queue_len > CONNECTION_DELETION_VALIDATION_MAX_QUEUE_LEN:
+        return
+
+    queued_upsert_tasks = celery_get_queued_task_ids(
+        OnyxCeleryQueues.CONNECTOR_DELETION, r_celery
+    )
+
+    # validate all existing connector deletion jobs
+    lock_beat.reacquire()
+    keys = cast(set[Any], r_replica.smembers(OnyxRedisConstants.ACTIVE_FENCES))
+    for key in keys:
+        key_bytes = cast(bytes, key)
+        key_str = key_bytes.decode("utf-8")
+        if not key_str.startswith(RedisConnectorDelete.FENCE_PREFIX):
+            continue
+
+        validate_connector_deletion_fence(
+            tenant_id,
+            key_bytes,
+            queued_upsert_tasks,
+            r,
+        )
+
+        lock_beat.reacquire()
+
+    return
+
+
+def validate_connector_deletion_fence(
+    tenant_id: str,
+    key_bytes: bytes,
+    queued_tasks: set[str],
+    r: Redis,
+) -> None:
+    """Checks for the error condition where an indexing fence is set but the associated celery tasks don't exist.
+    This can happen if the indexing worker hard crashes or is terminated.
+    Being in this bad state means the fence will never clear without help, so this function
+    gives the help.
+
+    How this works:
+    1. This function renews the active signal with a 5 minute TTL under the following conditions
+    1.2. When the task is seen in the redis queue
+    1.3. When the task is seen in the reserved / prefetched list
+
+    2. Externally, the active signal is renewed when:
+    2.1. The fence is created
+    2.2. The indexing watchdog checks the spawned task.
+
+    3. The TTL allows us to get through the transitions on fence startup
+    and when the task starts executing.
+
+    More TTL clarification: it is seemingly impossible to exactly query Celery for
+    whether a task is in the queue or currently executing.
+    1. An unknown task id is always returned as state PENDING.
+    2. Redis can be inspected for the task id, but the task id is gone between the time a worker receives the task
+    and the time it actually starts on the worker.
+
+    queued_tasks: the celery queue of lightweight permission sync tasks
+    reserved_tasks: prefetched tasks for sync task generator
+    """
+    # if the fence doesn't exist, there's nothing to do
+    fence_key = key_bytes.decode("utf-8")
+    cc_pair_id_str = RedisConnector.get_id_from_fence_key(fence_key)
+    if cc_pair_id_str is None:
+        task_logger.warning(
+            f"validate_connector_deletion_fence - could not parse id from {fence_key}"
+        )
+        return
+
+    cc_pair_id = int(cc_pair_id_str)
+    # parse out metadata and initialize the helper class with it
+    redis_connector = RedisConnector(tenant_id, int(cc_pair_id))
+
+    # check to see if the fence/payload exists
+    if not redis_connector.delete.fenced:
+        return
+
+    # in the cloud, the payload format may have changed ...
+    # it's a little sloppy, but just reset the fence for now if that happens
+    # TODO: add intentional cleanup/abort logic
+    try:
+        payload = redis_connector.delete.payload
+    except ValidationError:
+        task_logger.exception(
+            "validate_connector_deletion_fence - "
+            "Resetting fence because fence schema is out of date: "
+            f"cc_pair={cc_pair_id} "
+            f"fence={fence_key}"
+        )
+
+        redis_connector.delete.reset()
+        return
+
+    if not payload:
+        return
+
+    # OK, there's actually something for us to validate
+
+    # look up every task in the current taskset in the celery queue
+    # every entry in the taskset should have an associated entry in the celery task queue
+    # because we get the celery tasks first, the entries in our own permissions taskset
+    # should be roughly a subset of the tasks in celery
+
+    # this check isn't very exact, but should be sufficient over a period of time
+    # A single successful check over some number of attempts is sufficient.
+
+    # TODO: if the number of tasks in celery is much lower than than the taskset length
+    # we might be able to shortcut the lookup since by definition some of the tasks
+    # must not exist in celery.
+
+    tasks_scanned = 0
+    tasks_not_in_celery = 0  # a non-zero number after completing our check is bad
+
+    for member in r.sscan_iter(redis_connector.delete.taskset_key):
+        tasks_scanned += 1
+
+        member_bytes = cast(bytes, member)
+        member_str = member_bytes.decode("utf-8")
+        if member_str in queued_tasks:
+            continue
+
+        tasks_not_in_celery += 1
+
+    task_logger.info(
+        "validate_connector_deletion_fence task check: "
+        f"tasks_scanned={tasks_scanned} tasks_not_in_celery={tasks_not_in_celery}"
+    )
+
+    # we're active if there are still tasks to run and those tasks all exist in celery
+    if tasks_scanned > 0 and tasks_not_in_celery == 0:
+        redis_connector.delete.set_active()
+        return
+
+    # we may want to enable this check if using the active task list somehow isn't good enough
+    # if redis_connector_index.generator_locked():
+    #     logger.info(f"{payload.celery_task_id} is currently executing.")
+
+    # if we get here, we didn't find any direct indication that the associated celery tasks exist,
+    # but they still might be there due to gaps in our ability to check states during transitions
+    # Checking the active signal safeguards us against these transition periods
+    # (which has a duration that allows us to bridge those gaps)
+    if redis_connector.delete.active():
+        return
+
+    # celery tasks don't exist and the active signal has expired, possibly due to a crash. Clean it up.
+    task_logger.warning(
+        "validate_connector_deletion_fence - "
+        "Resetting fence because no associated celery tasks were found: "
+        f"cc_pair={cc_pair_id} "
+        f"fence={fence_key}"
+    )
+
+    redis_connector.delete.reset()
+    return
--- a/backend/onyx/background/celery/tasks/doc_permission_syncing/tasks.py
+++ b/backend/onyx/background/celery/tasks/doc_permission_syncing/tasks.py
@@ -30,6 +30,7 @@ from onyx.background.celery.celery_redis import celery_find_task
 from onyx.background.celery.celery_redis import celery_get_queue_length
 from onyx.background.celery.celery_redis import celery_get_queued_task_ids
 from onyx.background.celery.celery_redis import celery_get_unacked_task_ids
+from onyx.background.celery.tasks.shared.tasks import OnyxCeleryTaskCompletionStatus
 from onyx.configs.app_configs import JOB_TIMEOUT
 from onyx.configs.constants import CELERY_GENERIC_BEAT_LOCK_TIMEOUT
 from onyx.configs.constants import CELERY_PERMISSIONS_SYNC_LOCK_TIMEOUT
@@ -42,8 +43,10 @@ from onyx.configs.constants import OnyxCeleryTask
 from onyx.configs.constants import OnyxRedisConstants
 from onyx.configs.constants import OnyxRedisLocks
 from onyx.configs.constants import OnyxRedisSignals
+from onyx.connectors.factory import validate_ccpair_for_user
 from onyx.db.connector import mark_cc_pair_as_permissions_synced
 from onyx.db.connector_credential_pair import get_connector_credential_pair_from_id
+from onyx.db.connector_credential_pair import update_connector_credential_pair
 from onyx.db.document import upsert_document_by_connector_credential_pair
 from onyx.db.engine import get_session_with_current_tenant
 from onyx.db.enums import AccessType
@@ -63,6 +66,7 @@ from onyx.redis.redis_pool import get_redis_replica_client
 from onyx.redis.redis_pool import redis_lock_dump
 from onyx.server.utils import make_short_id
 from onyx.utils.logger import doc_permission_sync_ctx
+from onyx.utils.logger import format_error_for_logging
 from onyx.utils.logger import LoggerContextVars
 from onyx.utils.logger import setup_logger

@@ -193,12 +197,19 @@ def check_for_doc_permissions_sync(self: Task, *, tenant_id: str) -> bool | None
                    monitor_ccpair_permissions_taskset(
                        tenant_id, key_bytes, r, db_session
                    )
+        task_logger.info(f"check_for_doc_permissions_sync finished: tenant={tenant_id}")
    except SoftTimeLimitExceeded:
        task_logger.info(
            "Soft time limit exceeded, task is being terminated gracefully."
        )
-    except Exception:
-        task_logger.exception(f"Unexpected exception: tenant={tenant_id}")
+    except Exception as e:
+        error_msg = format_error_for_logging(e)
+        task_logger.warning(
+            f"Unexpected check_for_doc_permissions_sync exception: tenant={tenant_id} {error_msg}"
+        )
+        task_logger.exception(
+            f"Unexpected check_for_doc_permissions_sync exception: tenant={tenant_id}"
+        )
    finally:
        if lock_beat.owned():
            lock_beat.release()
@@ -210,7 +221,7 @@ def try_creating_permissions_sync_task(
    app: Celery,
    cc_pair_id: int,
    r: Redis,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> str | None:
    """Returns a randomized payload id on success.
    Returns None if no syncing is required."""
@@ -282,13 +293,19 @@ def try_creating_permissions_sync_task(
        redis_connector.permissions.set_fence(payload)

        payload_id = payload.id
-    except Exception:
-        task_logger.exception(f"Unexpected exception: cc_pair={cc_pair_id}")
+    except Exception as e:
+        error_msg = format_error_for_logging(e)
+        task_logger.warning(
+            f"Unexpected try_creating_permissions_sync_task exception: cc_pair={cc_pair_id} {error_msg}"
+        )
        return None
    finally:
        if lock.owned():
            lock.release()

+    task_logger.info(
+        f"try_creating_permissions_sync_task finished: cc_pair={cc_pair_id} payload_id={payload_id}"
+    )
    return payload_id


@@ -303,7 +320,7 @@ def try_creating_permissions_sync_task(
 def connector_permission_sync_generator_task(
    self: Task,
    cc_pair_id: int,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> None:
    """
    Permission sync task that handles document permission syncing for a given connector credential pair
@@ -388,6 +405,29 @@ def connector_permission_sync_generator_task(
                    f"No connector credential pair found for id: {cc_pair_id}"
                )

+            try:
+                created = validate_ccpair_for_user(
+                    cc_pair.connector.id,
+                    cc_pair.credential.id,
+                    db_session,
+                    enforce_creation=False,
+                )
+                if not created:
+                    task_logger.warning(
+                        f"Unable to create connector credential pair for id: {cc_pair_id}"
+                    )
+            except Exception:
+                task_logger.exception(
+                    f"validate_ccpair_permissions_sync exceptioned: cc_pair={cc_pair_id}"
+                )
+                update_connector_credential_pair(
+                    db_session=db_session,
+                    connector_id=cc_pair.connector.id,
+                    credential_id=cc_pair.credential.id,
+                    status=ConnectorCredentialPairStatus.INVALID,
+                )
+                raise
+
            source_type = cc_pair.connector.source

            doc_sync_func = DOC_PERMISSIONS_FUNC_MAP.get(source_type)
@@ -439,6 +479,10 @@ def connector_permission_sync_generator_task(
            redis_connector.permissions.generator_complete = tasks_generated

    except Exception as e:
+        error_msg = format_error_for_logging(e)
+        task_logger.warning(
+            f"Permission sync exceptioned: cc_pair={cc_pair_id} payload_id={payload_id} {error_msg}"
+        )
        task_logger.exception(
            f"Permission sync exceptioned: cc_pair={cc_pair_id} payload_id={payload_id}"
        )
@@ -465,7 +509,7 @@ def connector_permission_sync_generator_task(
 )
 def update_external_document_permissions_task(
    self: Task,
-    tenant_id: str | None,
+    tenant_id: str,
    serialized_doc_external_access: dict,
    source_string: str,
    connector_id: int,
@@ -473,6 +517,8 @@ def update_external_document_permissions_task(
 ) -> bool:
    start = time.monotonic()

+    completion_status = OnyxCeleryTaskCompletionStatus.UNDEFINED
+
    document_external_access = DocExternalAccess.from_dict(
        serialized_doc_external_access
    )
@@ -512,18 +558,33 @@ def update_external_document_permissions_task(
                f"elapsed={elapsed:.2f}"
            )

-    except Exception:
+        completion_status = OnyxCeleryTaskCompletionStatus.SUCCEEDED
+    except Exception as e:
+        error_msg = format_error_for_logging(e)
+        task_logger.warning(
+            f"Exception in update_external_document_permissions_task: connector_id={connector_id} doc_id={doc_id} {error_msg}"
+        )
        task_logger.exception(
-            f"Exception in update_external_document_permissions_task: "
+            f"update_external_document_permissions_task exceptioned: "
            f"connector_id={connector_id} doc_id={doc_id}"
        )
+        completion_status = OnyxCeleryTaskCompletionStatus.NON_RETRYABLE_EXCEPTION
+    finally:
+        task_logger.info(
+            f"update_external_document_permissions_task completed: status={completion_status.value} doc={doc_id}"
+        )
+
+    if completion_status != OnyxCeleryTaskCompletionStatus.SUCCEEDED:
        return False

+    task_logger.info(
+        f"update_external_document_permissions_task finished: connector_id={connector_id} doc_id={doc_id}"
+    )
    return True


 def validate_permission_sync_fences(
-    tenant_id: str | None,
+    tenant_id: str,
    r: Redis,
    r_replica: Redis,
    r_celery: Redis,
@@ -570,7 +631,7 @@ def validate_permission_sync_fences(


 def validate_permission_sync_fence(
-    tenant_id: str | None,
+    tenant_id: str,
    key_bytes: bytes,
    queued_tasks: set[str],
    reserved_tasks: set[str],
@@ -780,7 +841,7 @@ class PermissionSyncCallback(IndexingHeartbeatInterface):


 def monitor_ccpair_permissions_taskset(
-    tenant_id: str | None, key_bytes: bytes, r: Redis, db_session: Session
+    tenant_id: str, key_bytes: bytes, r: Redis, db_session: Session
 ) -> None:
    fence_key = key_bytes.decode("utf-8")
    cc_pair_id_str = RedisConnector.get_id_from_fence_key(fence_key)
--- a/backend/onyx/background/celery/tasks/external_group_syncing/tasks.py
+++ b/backend/onyx/background/celery/tasks/external_group_syncing/tasks.py
@@ -37,8 +37,11 @@ from onyx.configs.constants import OnyxCeleryTask
 from onyx.configs.constants import OnyxRedisConstants
 from onyx.configs.constants import OnyxRedisLocks
 from onyx.configs.constants import OnyxRedisSignals
+from onyx.connectors.exceptions import ConnectorValidationError
+from onyx.connectors.factory import validate_ccpair_for_user
 from onyx.db.connector import mark_cc_pair_as_external_group_synced
 from onyx.db.connector_credential_pair import get_connector_credential_pair_from_id
+from onyx.db.connector_credential_pair import update_connector_credential_pair
 from onyx.db.engine import get_session_with_current_tenant
 from onyx.db.enums import AccessType
 from onyx.db.enums import ConnectorCredentialPairStatus
@@ -55,6 +58,7 @@ from onyx.redis.redis_connector_ext_group_sync import (
 from onyx.redis.redis_pool import get_redis_client
 from onyx.redis.redis_pool import get_redis_replica_client
 from onyx.server.utils import make_short_id
+from onyx.utils.logger import format_error_for_logging
 from onyx.utils.logger import setup_logger

 logger = setup_logger()
@@ -119,7 +123,7 @@ def _is_external_group_sync_due(cc_pair: ConnectorCredentialPair) -> bool:
    soft_time_limit=JOB_TIMEOUT,
    bind=True,
 )
-def check_for_external_group_sync(self: Task, *, tenant_id: str | None) -> bool | None:
+def check_for_external_group_sync(self: Task, *, tenant_id: str) -> bool | None:
    # we need to use celery's redis client to access its redis data
    # (which lives on a different db number)
    r = get_redis_client()
@@ -148,7 +152,10 @@ def check_for_external_group_sync(self: Task, *, tenant_id: str | None) -> bool
            for source in GROUP_PERMISSIONS_IS_CC_PAIR_AGNOSTIC:
                # These are ordered by cc_pair id so the first one is the one we want
                cc_pairs_to_dedupe = get_cc_pairs_by_source(
-                    db_session, source, only_sync=True
+                    db_session,
+                    source,
+                    access_type=AccessType.SYNC,
+                    status=ConnectorCredentialPairStatus.ACTIVE,
                )
                # We only want to sync one cc_pair per source type
                # in GROUP_PERMISSIONS_IS_CC_PAIR_AGNOSTIC so we dedupe here
@@ -195,12 +202,17 @@ def check_for_external_group_sync(self: Task, *, tenant_id: str | None) -> bool
        task_logger.info(
            "Soft time limit exceeded, task is being terminated gracefully."
        )
-    except Exception:
+    except Exception as e:
+        error_msg = format_error_for_logging(e)
+        task_logger.warning(
+            f"Unexpected check_for_external_group_sync exception: tenant={tenant_id} {error_msg}"
+        )
        task_logger.exception(f"Unexpected exception: tenant={tenant_id}")
    finally:
        if lock_beat.owned():
            lock_beat.release()

+    task_logger.info(f"check_for_external_group_sync finished: tenant={tenant_id}")
    return True


@@ -208,7 +220,7 @@ def try_creating_external_group_sync_task(
    app: Celery,
    cc_pair_id: int,
    r: Redis,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> str | None:
    """Returns an int if syncing is needed. The int represents the number of sync tasks generated.
    Returns None if no syncing is required."""
@@ -267,12 +279,19 @@ def try_creating_external_group_sync_task(
        redis_connector.external_group_sync.set_fence(payload)

        payload_id = payload.id
-    except Exception:
+    except Exception as e:
+        error_msg = format_error_for_logging(e)
+        task_logger.warning(
+            f"Unexpected try_creating_external_group_sync_task exception: cc_pair={cc_pair_id} {error_msg}"
+        )
        task_logger.exception(
            f"Unexpected exception while trying to create external group sync task: cc_pair={cc_pair_id}"
        )
        return None

+    task_logger.info(
+        f"try_creating_external_group_sync_task finished: cc_pair={cc_pair_id} payload_id={payload_id}"
+    )
    return payload_id


@@ -287,7 +306,7 @@ def try_creating_external_group_sync_task(
 def connector_external_group_sync_generator_task(
    self: Task,
    cc_pair_id: int,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> None:
    """
    External group sync task for a given connector credential pair
@@ -361,12 +380,36 @@ def connector_external_group_sync_generator_task(
            cc_pair = get_connector_credential_pair_from_id(
                db_session=db_session,
                cc_pair_id=cc_pair_id,
+                eager_load_credential=True,
            )
            if cc_pair is None:
                raise ValueError(
                    f"No connector credential pair found for id: {cc_pair_id}"
                )

+            try:
+                created = validate_ccpair_for_user(
+                    cc_pair.connector.id,
+                    cc_pair.credential.id,
+                    db_session,
+                    enforce_creation=False,
+                )
+                if not created:
+                    task_logger.warning(
+                        f"Unable to create connector credential pair for id: {cc_pair_id}"
+                    )
+            except Exception:
+                task_logger.exception(
+                    f"validate_ccpair_permissions_sync exceptioned: cc_pair={cc_pair_id}"
+                )
+                update_connector_credential_pair(
+                    db_session=db_session,
+                    connector_id=cc_pair.connector.id,
+                    credential_id=cc_pair.credential.id,
+                    status=ConnectorCredentialPairStatus.INVALID,
+                )
+                raise
+
            source_type = cc_pair.connector.source

            ext_group_sync_func = GROUP_PERMISSIONS_FUNC_MAP.get(source_type)
@@ -378,8 +421,18 @@ def connector_external_group_sync_generator_task(
            logger.info(
                f"Syncing external groups for {source_type} for cc_pair: {cc_pair_id}"
            )
-
-            external_user_groups: list[ExternalUserGroup] = ext_group_sync_func(cc_pair)
+            external_user_groups: list[ExternalUserGroup] = []
+            try:
+                external_user_groups = ext_group_sync_func(tenant_id, cc_pair)
+            except ConnectorValidationError as e:
+                msg = f"Error syncing external groups for {source_type} for cc_pair: {cc_pair_id} {e}"
+                update_connector_credential_pair(
+                    db_session=db_session,
+                    connector_id=cc_pair.connector.id,
+                    credential_id=cc_pair.credential.id,
+                    status=ConnectorCredentialPairStatus.INVALID,
+                )
+                raise e

            logger.info(
                f"Syncing {len(external_user_groups)} external user groups for {source_type}"
@@ -405,6 +458,14 @@ def connector_external_group_sync_generator_task(
                sync_status=SyncStatus.SUCCESS,
            )
    except Exception as e:
+        error_msg = format_error_for_logging(e)
+        task_logger.warning(
+            f"External group sync exceptioned: cc_pair={cc_pair_id} payload_id={payload.id} {error_msg}"
+        )
+        task_logger.exception(
+            f"External group sync exceptioned: cc_pair={cc_pair_id} payload_id={payload.id}"
+        )
+
        msg = f"External group sync exceptioned: cc_pair={cc_pair_id} payload_id={payload.id}"
        task_logger.exception(msg)
        emit_background_error(msg + f"\n\n{e}", cc_pair_id=cc_pair_id)
@@ -432,7 +493,7 @@ def connector_external_group_sync_generator_task(


 def validate_external_group_sync_fences(
-    tenant_id: str | None,
+    tenant_id: str,
    celery_app: Celery,
    r: Redis,
    r_replica: Redis,
@@ -464,7 +525,7 @@ def validate_external_group_sync_fences(


 def validate_external_group_sync_fence(
-    tenant_id: str | None,
+    tenant_id: str,
    key_bytes: bytes,
    reserved_tasks: set[str],
    r_celery: Redis,
--- a/backend/onyx/background/celery/tasks/indexing/tasks.py
+++ b/backend/onyx/background/celery/tasks/indexing/tasks.py
@@ -23,9 +23,9 @@ from sqlalchemy.orm import Session

 from onyx.background.celery.apps.app_base import task_logger
 from onyx.background.celery.celery_utils import httpx_init_vespa_pool
-from onyx.background.celery.tasks.indexing.utils import _should_index
 from onyx.background.celery.tasks.indexing.utils import get_unfenced_index_attempt_ids
 from onyx.background.celery.tasks.indexing.utils import IndexingCallback
+from onyx.background.celery.tasks.indexing.utils import should_index
 from onyx.background.celery.tasks.indexing.utils import try_creating_indexing_task
 from onyx.background.celery.tasks.indexing.utils import validate_indexing_fences
 from onyx.background.indexing.checkpointing_utils import cleanup_checkpoint
@@ -48,7 +48,7 @@ from onyx.configs.constants import OnyxCeleryTask
 from onyx.configs.constants import OnyxRedisConstants
 from onyx.configs.constants import OnyxRedisLocks
 from onyx.configs.constants import OnyxRedisSignals
-from onyx.connectors.interfaces import ConnectorValidationError
+from onyx.connectors.exceptions import ConnectorValidationError
 from onyx.db.connector import mark_ccpair_with_indexing_trigger
 from onyx.db.connector_credential_pair import fetch_connector_credential_pairs
 from onyx.db.connector_credential_pair import get_connector_credential_pair_from_id
@@ -61,7 +61,7 @@ from onyx.db.index_attempt import mark_attempt_canceled
 from onyx.db.index_attempt import mark_attempt_failed
 from onyx.db.search_settings import get_active_search_settings_list
 from onyx.db.search_settings import get_current_search_settings
-from onyx.db.swap_index import check_index_swap
+from onyx.db.swap_index import check_and_perform_index_swap
 from onyx.natural_language_processing.search_nlp_models import EmbeddingModel
 from onyx.natural_language_processing.search_nlp_models import warm_up_bi_encoder
 from onyx.redis.redis_connector import RedisConnector
@@ -182,7 +182,7 @@ class SimpleJobResult:


 class ConnectorIndexingContext(BaseModel):
-    tenant_id: str | None
+    tenant_id: str
    cc_pair_id: int
    search_settings_id: int
    index_attempt_id: int
@@ -210,7 +210,7 @@ class ConnectorIndexingLogBuilder:


 def monitor_ccpair_indexing_taskset(
-    tenant_id: str | None, key_bytes: bytes, r: Redis, db_session: Session
+    tenant_id: str, key_bytes: bytes, r: Redis, db_session: Session
 ) -> None:
    # if the fence doesn't exist, there's nothing to do
    fence_key = key_bytes.decode("utf-8")
@@ -358,7 +358,7 @@ def monitor_ccpair_indexing_taskset(
    soft_time_limit=300,
    bind=True,
 )
-def check_for_indexing(self: Task, *, tenant_id: str | None) -> int | None:
+def check_for_indexing(self: Task, *, tenant_id: str) -> int | None:
    """a lightweight task used to kick off indexing tasks.
    Occcasionally does some validation of existing state to clear up error conditions"""

@@ -406,7 +406,7 @@ def check_for_indexing(self: Task, *, tenant_id: str | None) -> int | None:

        # check for search settings swap
        with get_session_with_current_tenant() as db_session:
-            old_search_settings = check_index_swap(db_session=db_session)
+            old_search_settings = check_and_perform_index_swap(db_session=db_session)
            current_search_settings = get_current_search_settings(db_session)
            # So that the first time users aren't surprised by really slow speed of first
            # batch of documents indexed
@@ -439,6 +439,15 @@ def check_for_indexing(self: Task, *, tenant_id: str | None) -> int | None:
            with get_session_with_current_tenant() as db_session:
                search_settings_list = get_active_search_settings_list(db_session)
                for search_settings_instance in search_settings_list:
+                    # skip non-live search settings that don't have background reindex enabled
+                    # those should just auto-change to live shortly after creation without
+                    # requiring any indexing till that point
+                    if (
+                        not search_settings_instance.status.is_current()
+                        and not search_settings_instance.background_reindex_enabled
+                    ):
+                        continue
+
                    redis_connector_index = redis_connector.new_index(
                        search_settings_instance.id
                    )
@@ -456,23 +465,18 @@ def check_for_indexing(self: Task, *, tenant_id: str | None) -> int | None:
                        cc_pair.id, search_settings_instance.id, db_session
                    )

-                    search_settings_primary = False
-                    if search_settings_instance.id == search_settings_list[0].id:
-                        search_settings_primary = True
-
-                    if not _should_index(
+                    if not should_index(
                        cc_pair=cc_pair,
                        last_index=last_attempt,
                        search_settings_instance=search_settings_instance,
-                        search_settings_primary=search_settings_primary,
                        secondary_index_building=len(search_settings_list) > 1,
                        db_session=db_session,
                    ):
                        continue

                    reindex = False
-                    if search_settings_instance.id == search_settings_list[0].id:
-                        # the indexing trigger is only checked and cleared with the primary search settings
+                    if search_settings_instance.status.is_current():
+                        # the indexing trigger is only checked and cleared with the current search settings
                        if cc_pair.indexing_trigger is not None:
                            if cc_pair.indexing_trigger == IndexingMode.REINDEX:
                                reindex = True
@@ -598,7 +602,7 @@ def connector_indexing_task(
    cc_pair_id: int,
    search_settings_id: int,
    is_ee: bool,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> int | None:
    """Indexing task. For a cc pair, this task pulls all document IDs from the source
    and compares those IDs to locally stored documents and deletes all locally stored IDs missing
@@ -890,7 +894,7 @@ def connector_indexing_proxy_task(
    index_attempt_id: int,
    cc_pair_id: int,
    search_settings_id: int,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> None:
    """celery out of process task execution strategy is pool=prefork, but it uses fork,
    and forking is inherently unstable.
@@ -899,6 +903,9 @@ def connector_indexing_proxy_task(

    TODO(rkuo): refactor this so that there is a single return path where we canonically
    log the result of running this function.
+
+    NOTE: we try/except all db access in this function because as a watchdog, this function
+    needs to be extremely stable.
    """
    start = time.monotonic()

@@ -924,6 +931,7 @@ def connector_indexing_proxy_task(
        task_logger.error("self.request.id is None!")

    client = SimpleJobClient()
+    task_logger.info(f"submitting connector_indexing_task with tenant_id={tenant_id}")

    job = client.submit(
        connector_indexing_task,
@@ -1016,7 +1024,7 @@ def connector_indexing_proxy_task(
                    job.release()
                    break

-            # if a termination signal is detected, clean up and break
+            # if a termination signal is detected, break (exit point will clean up)
            if self.request.id and redis_connector_index.terminating(self.request.id):
                task_logger.warning(
                    log_builder.build("Indexing watchdog - termination signal detected")
@@ -1025,6 +1033,7 @@ def connector_indexing_proxy_task(
                result.status = IndexingWatchdogTerminalStatus.TERMINATED_BY_SIGNAL
                break

+            # if activity timeout is detected, break (exit point will clean up)
            if not redis_connector_index.connector_active():
                task_logger.warning(
                    log_builder.build(
@@ -1033,25 +1042,6 @@ def connector_indexing_proxy_task(
                    )
                )

-                try:
-                    with get_session_with_current_tenant() as db_session:
-                        mark_attempt_failed(
-                            index_attempt_id,
-                            db_session,
-                            "Indexing watchdog - activity timeout exceeded: "
-                            f"attempt={index_attempt_id} "
-                            f"timeout={CELERY_INDEXING_WATCHDOG_CONNECTOR_TIMEOUT}s",
-                        )
-                except Exception:
-                    # if the DB exceptions, we'll just get an unfriendly failure message
-                    # in the UI instead of the cancellation message
-                    logger.exception(
-                        log_builder.build(
-                            "Indexing watchdog - transient exception marking index attempt as failed"
-                        )
-                    )
-
-                job.cancel()
                result.status = (
                    IndexingWatchdogTerminalStatus.TERMINATED_BY_ACTIVITY_TIMEOUT
                )
@@ -1070,15 +1060,15 @@ def connector_indexing_proxy_task(

                    if not index_attempt.is_finished():
                        continue
+
            except Exception:
-                # if the DB exceptioned, just restart the check.
-                # polling the index attempt status doesn't need to be strongly consistent
                task_logger.exception(
                    log_builder.build(
                        "Indexing watchdog - transient exception looking up index attempt"
                    )
                )
                continue
+
    except Exception as e:
        result.status = IndexingWatchdogTerminalStatus.WATCHDOG_EXCEPTIONED
        if isinstance(e, ConnectorValidationError):
@@ -1139,8 +1129,6 @@ def connector_indexing_proxy_task(
                    "Connector termination signal detected",
                )
        except Exception:
-            # if the DB exceptions, we'll just get an unfriendly failure message
-            # in the UI instead of the cancellation message
            task_logger.exception(
                log_builder.build(
                    "Indexing watchdog - transient exception marking index attempt as canceled"
@@ -1148,6 +1136,25 @@ def connector_indexing_proxy_task(
            )

        job.cancel()
+    elif result.status == IndexingWatchdogTerminalStatus.TERMINATED_BY_ACTIVITY_TIMEOUT:
+        try:
+            with get_session_with_current_tenant() as db_session:
+                mark_attempt_failed(
+                    index_attempt_id,
+                    db_session,
+                    "Indexing watchdog - activity timeout exceeded: "
+                    f"attempt={index_attempt_id} "
+                    f"timeout={CELERY_INDEXING_WATCHDOG_CONNECTOR_TIMEOUT}s",
+                )
+        except Exception:
+            logger.exception(
+                log_builder.build(
+                    "Indexing watchdog - transient exception marking index attempt as failed"
+                )
+            )
+        job.cancel()
+    else:
+        pass

    task_logger.info(
        log_builder.build(
@@ -1167,7 +1174,7 @@ def connector_indexing_proxy_task(
    name=OnyxCeleryTask.CHECK_FOR_CHECKPOINT_CLEANUP,
    soft_time_limit=300,
 )
-def check_for_checkpoint_cleanup(*, tenant_id: str | None) -> None:
+def check_for_checkpoint_cleanup(*, tenant_id: str) -> None:
    """Clean up old checkpoints that are older than 7 days."""
    locked = False
    redis_client = get_redis_client(tenant_id=tenant_id)
--- a/backend/onyx/background/celery/tasks/indexing/utils.py
+++ b/backend/onyx/background/celery/tasks/indexing/utils.py
@@ -187,7 +187,7 @@ class IndexingCallback(IndexingCallbackBase):


 def validate_indexing_fence(
-    tenant_id: str | None,
+    tenant_id: str,
    key_bytes: bytes,
    reserved_tasks: set[str],
    r_celery: Redis,
@@ -311,7 +311,7 @@ def validate_indexing_fence(


 def validate_indexing_fences(
-    tenant_id: str | None,
+    tenant_id: str,
    r_replica: Redis,
    r_celery: Redis,
    lock_beat: RedisLock,
@@ -346,11 +346,10 @@ def validate_indexing_fences(
    return


-def _should_index(
+def should_index(
    cc_pair: ConnectorCredentialPair,
    last_index: IndexAttempt | None,
    search_settings_instance: SearchSettings,
-    search_settings_primary: bool,
    secondary_index_building: bool,
    db_session: Session,
 ) -> bool:
@@ -415,9 +414,9 @@ def _should_index(
    ):
        return False

-    if search_settings_primary:
+    if search_settings_instance.status.is_current():
        if cc_pair.indexing_trigger is not None:
-            # if a manual indexing trigger is on the cc pair, honor it for primary search settings
+            # if a manual indexing trigger is on the cc pair, honor it for live search settings
            return True

    # if no attempt has ever occurred, we should index regardless of refresh_freq
@@ -442,7 +441,7 @@ def try_creating_indexing_task(
    reindex: bool,
    db_session: Session,
    r: Redis,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> int | None:
    """Checks for any conditions that should block the indexing task from being
    created, then creates the task.
--- a/backend/onyx/background/celery/tasks/llm_model_update/tasks.py
+++ b/backend/onyx/background/celery/tasks/llm_model_update/tasks.py
@@ -59,7 +59,7 @@ def _process_model_list_response(model_list_json: Any) -> list[str]:
    trail=False,
    bind=True,
 )
-def check_for_llm_model_update(self: Task, *, tenant_id: str | None) -> bool | None:
+def check_for_llm_model_update(self: Task, *, tenant_id: str) -> bool | None:
    if not LLM_MODEL_UPDATE_API_URL:
        raise ValueError("LLM model update API URL not configured")

--- a/backend/onyx/background/celery/tasks/monitoring/tasks.py
+++ b/backend/onyx/background/celery/tasks/monitoring/tasks.py
@@ -91,7 +91,7 @@ class Metric(BaseModel):
        }
        task_logger.info(json.dumps(data))

-    def emit(self, tenant_id: str | None) -> None:
+    def emit(self, tenant_id: str) -> None:
        # Convert value to appropriate type based on the input value
        bool_value = None
        float_value = None
@@ -656,7 +656,7 @@ def build_job_id(
    queue=OnyxCeleryQueues.MONITORING,
    bind=True,
 )
-def monitor_background_processes(self: Task, *, tenant_id: str | None) -> None:
+def monitor_background_processes(self: Task, *, tenant_id: str) -> None:
    """Collect and emit metrics about background processes.
    This task runs periodically to gather metrics about:
    - Queue lengths for different Celery queues
@@ -864,7 +864,7 @@ def cloud_monitor_celery_queues(


@shared_task(name=OnyxCeleryTask.MONITOR_CELERY_QUEUES, ignore_result=True, bind=True)
-def monitor_celery_queues(self: Task, *, tenant_id: str | None) -> None:
+def monitor_celery_queues(self: Task, *, tenant_id: str) -> None:
    return monitor_celery_queues_helper(self)


--- a/backend/onyx/background/celery/tasks/periodic/tasks.py
+++ b/backend/onyx/background/celery/tasks/periodic/tasks.py
@@ -24,7 +24,7 @@ from onyx.db.engine import get_session_with_current_tenant
    bind=True,
    base=AbortableTask,
 )
-def kombu_message_cleanup_task(self: Any, tenant_id: str | None) -> int:
+def kombu_message_cleanup_task(self: Any, tenant_id: str) -> int:
    """Runs periodically to clean up the kombu_message table"""

    # we will select messages older than this amount to clean up
--- a/backend/onyx/background/celery/tasks/pruning/tasks.py
+++ b/backend/onyx/background/celery/tasks/pruning/tasks.py
@@ -55,6 +55,7 @@ from onyx.redis.redis_connector_prune import RedisConnectorPrunePayload
 from onyx.redis.redis_pool import get_redis_client
 from onyx.redis.redis_pool import get_redis_replica_client
 from onyx.server.utils import make_short_id
+from onyx.utils.logger import format_error_for_logging
 from onyx.utils.logger import LoggerContextVars
 from onyx.utils.logger import pruning_ctx
 from onyx.utils.logger import setup_logger
@@ -113,7 +114,7 @@ def _is_pruning_due(cc_pair: ConnectorCredentialPair) -> bool:
    soft_time_limit=JOB_TIMEOUT,
    bind=True,
 )
-def check_for_pruning(self: Task, *, tenant_id: str | None) -> bool | None:
+def check_for_pruning(self: Task, *, tenant_id: str) -> bool | None:
    r = get_redis_client()
    r_replica = get_redis_replica_client()
    r_celery: Redis = self.app.broker_connection().channel().client  # type: ignore
@@ -194,12 +195,14 @@ def check_for_pruning(self: Task, *, tenant_id: str | None) -> bool | None:
        task_logger.info(
            "Soft time limit exceeded, task is being terminated gracefully."
        )
-    except Exception:
+    except Exception as e:
+        error_msg = format_error_for_logging(e)
+        task_logger.warning(f"Unexpected pruning check exception: {error_msg}")
        task_logger.exception("Unexpected exception during pruning check")
    finally:
        if lock_beat.owned():
            lock_beat.release()
-
+    task_logger.info(f"check_for_pruning finished: tenant={tenant_id}")
    return True


@@ -208,7 +211,7 @@ def try_creating_prune_generator_task(
    cc_pair: ConnectorCredentialPair,
    db_session: Session,
    r: Redis,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> str | None:
    """Checks for any conditions that should block the pruning generator task from being
    created, then creates the task.
@@ -301,13 +304,19 @@ def try_creating_prune_generator_task(
        redis_connector.prune.set_fence(payload)

        payload_id = payload.id
-    except Exception:
+    except Exception as e:
+        error_msg = format_error_for_logging(e)
+        task_logger.warning(
+            f"Unexpected try_creating_prune_generator_task exception: cc_pair={cc_pair.id} {error_msg}"
+        )
        task_logger.exception(f"Unexpected exception: cc_pair={cc_pair.id}")
        return None
    finally:
        if lock.owned():
            lock.release()
-
+    task_logger.info(
+        f"try_creating_prune_generator_task finished: cc_pair={cc_pair.id} payload_id={payload_id}"
+    )
    return payload_id


@@ -324,7 +333,7 @@ def connector_pruning_generator_task(
    cc_pair_id: int,
    connector_id: int,
    credential_id: int,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> None:
    """connector pruning task. For a cc pair, this task pulls all document IDs from the source
    and compares those IDs to locally stored documents and deletes all locally stored IDs missing
@@ -512,7 +521,7 @@ def connector_pruning_generator_task(


 def monitor_ccpair_pruning_taskset(
-    tenant_id: str | None, key_bytes: bytes, r: Redis, db_session: Session
+    tenant_id: str, key_bytes: bytes, r: Redis, db_session: Session
 ) -> None:
    fence_key = key_bytes.decode("utf-8")
    cc_pair_id_str = RedisConnector.get_id_from_fence_key(fence_key)
@@ -558,7 +567,7 @@ def monitor_ccpair_pruning_taskset(


 def validate_pruning_fences(
-    tenant_id: str | None,
+    tenant_id: str,
    r: Redis,
    r_replica: Redis,
    r_celery: Redis,
@@ -606,7 +615,7 @@ def validate_pruning_fences(


 def validate_pruning_fence(
-    tenant_id: str | None,
+    tenant_id: str,
    key_bytes: bytes,
    reserved_tasks: set[str],
    queued_tasks: set[str],
--- a/backend/onyx/background/celery/tasks/shared/RetryDocumentIndex.py
+++ b/backend/onyx/background/celery/tasks/shared/RetryDocumentIndex.py
@@ -32,7 +32,7 @@ class RetryDocumentIndex:
        self,
        doc_id: str,
        *,
-        tenant_id: str | None,
+        tenant_id: str,
        chunk_count: int | None,
    ) -> int:
        return self.index.delete_single(
@@ -50,7 +50,7 @@ class RetryDocumentIndex:
        self,
        doc_id: str,
        *,
-        tenant_id: str | None,
+        tenant_id: str,
        chunk_count: int | None,
        fields: VespaDocumentFields,
    ) -> int:
--- a/backend/onyx/background/celery/tasks/shared/tasks.py
+++ b/backend/onyx/background/celery/tasks/shared/tasks.py
@@ -1,4 +1,5 @@
 import time
+from enum import Enum
 from http import HTTPStatus

 import httpx
@@ -45,6 +46,24 @@ LIGHT_SOFT_TIME_LIMIT = 105
 LIGHT_TIME_LIMIT = LIGHT_SOFT_TIME_LIMIT + 15


+class OnyxCeleryTaskCompletionStatus(str, Enum):
+    """The different statuses the watchdog can finish with.
+
+    TODO: create broader success/failure/abort categories
+    """
+
+    UNDEFINED = "undefined"
+
+    SUCCEEDED = "succeeded"
+
+    SKIPPED = "skipped"
+
+    SOFT_TIME_LIMIT = "soft_time_limit"
+
+    NON_RETRYABLE_EXCEPTION = "non_retryable_exception"
+    RETRYABLE_EXCEPTION = "retryable_exception"
+
+
@shared_task(
    name=OnyxCeleryTask.DOCUMENT_BY_CC_PAIR_CLEANUP_TASK,
    soft_time_limit=LIGHT_SOFT_TIME_LIMIT,
@@ -57,7 +76,7 @@ def document_by_cc_pair_cleanup_task(
    document_id: str,
    connector_id: int,
    credential_id: int,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> bool:
    """A lightweight subtask used to clean up document to cc pair relationships.
    Created by connection deletion and connector pruning parent tasks."""
@@ -78,6 +97,8 @@ def document_by_cc_pair_cleanup_task(

    start = time.monotonic()

+    completion_status = OnyxCeleryTaskCompletionStatus.UNDEFINED
+
    try:
        with get_session_with_current_tenant() as db_session:
            action = "skip"
@@ -110,6 +131,9 @@ def document_by_cc_pair_cleanup_task(
                    db_session=db_session,
                    document_ids=[document_id],
                )
+                db_session.commit()
+
+                completion_status = OnyxCeleryTaskCompletionStatus.SUCCEEDED
            elif count > 1:
                action = "update"

@@ -153,10 +177,11 @@ def document_by_cc_pair_cleanup_task(
                )

                mark_document_as_synced(document_id, db_session)
-            else:
-                pass
+                db_session.commit()

-            db_session.commit()
+                completion_status = OnyxCeleryTaskCompletionStatus.SUCCEEDED
+            else:
+                completion_status = OnyxCeleryTaskCompletionStatus.SKIPPED

            elapsed = time.monotonic() - start
            task_logger.info(
@@ -168,57 +193,79 @@ def document_by_cc_pair_cleanup_task(
            )
    except SoftTimeLimitExceeded:
        task_logger.info(f"SoftTimeLimitExceeded exception. doc={document_id}")
-        return False
+        completion_status = OnyxCeleryTaskCompletionStatus.SOFT_TIME_LIMIT
    except Exception as ex:
        e: Exception | None = None
-        if isinstance(ex, RetryError):
-            task_logger.warning(
-                f"Tenacity retry failed: num_attempts={ex.last_attempt.attempt_number}"
+        while True:
+            if isinstance(ex, RetryError):
+                task_logger.warning(
+                    f"Tenacity retry failed: num_attempts={ex.last_attempt.attempt_number}"
+                )
+
+                # only set the inner exception if it is of type Exception
+                e_temp = ex.last_attempt.exception()
+                if isinstance(e_temp, Exception):
+                    e = e_temp
+            else:
+                e = ex
+
+            if isinstance(e, httpx.HTTPStatusError):
+                if e.response.status_code == HTTPStatus.BAD_REQUEST:
+                    task_logger.exception(
+                        f"Non-retryable HTTPStatusError: "
+                        f"doc={document_id} "
+                        f"status={e.response.status_code}"
+                    )
+                completion_status = (
+                    OnyxCeleryTaskCompletionStatus.NON_RETRYABLE_EXCEPTION
+                )
+                break
+
+            task_logger.exception(
+                f"document_by_cc_pair_cleanup_task exceptioned: doc={document_id}"
            )

-            # only set the inner exception if it is of type Exception
-            e_temp = ex.last_attempt.exception()
-            if isinstance(e_temp, Exception):
-                e = e_temp
-        else:
-            e = ex
-
-        if isinstance(e, httpx.HTTPStatusError):
-            if e.response.status_code == HTTPStatus.BAD_REQUEST:
-                task_logger.exception(
-                    f"Non-retryable HTTPStatusError: "
-                    f"doc={document_id} "
-                    f"status={e.response.status_code}"
+            completion_status = OnyxCeleryTaskCompletionStatus.RETRYABLE_EXCEPTION
+            if (
+                self.max_retries is not None
+                and self.request.retries >= self.max_retries
+            ):
+                # This is the last attempt! mark the document as dirty in the db so that it
+                # eventually gets fixed out of band via stale document reconciliation
+                task_logger.warning(
+                    f"Max celery task retries reached. Marking doc as dirty for reconciliation: "
+                    f"doc={document_id}"
                )
-            return False
+                with get_session_with_current_tenant() as db_session:
+                    # delete the cc pair relationship now and let reconciliation clean it up
+                    # in vespa
+                    delete_document_by_connector_credential_pair__no_commit(
+                        db_session=db_session,
+                        document_id=document_id,
+                        connector_credential_pair_identifier=ConnectorCredentialPairIdentifier(
+                            connector_id=connector_id,
+                            credential_id=credential_id,
+                        ),
+                    )
+                    mark_document_as_modified(document_id, db_session)
+                completion_status = (
+                    OnyxCeleryTaskCompletionStatus.NON_RETRYABLE_EXCEPTION
+                )
+                break

-        task_logger.exception(f"Unexpected exception: doc={document_id}")
-
-        if self.request.retries < DOCUMENT_BY_CC_PAIR_CLEANUP_MAX_RETRIES:
-            # Still retrying. Exponential backoff from 2^4 to 2^6 ... i.e. 16, 32, 64
+            # Exponential backoff from 2^4 to 2^6 ... i.e. 16, 32, 64
            countdown = 2 ** (self.request.retries + 4)
-            self.retry(exc=e, countdown=countdown)
-        else:
-            # This is the last attempt! mark the document as dirty in the db so that it
-            # eventually gets fixed out of band via stale document reconciliation
-            task_logger.warning(
-                f"Max celery task retries reached. Marking doc as dirty for reconciliation: "
-                f"doc={document_id}"
-            )
-            with get_session_with_current_tenant() as db_session:
-                # delete the cc pair relationship now and let reconciliation clean it up
-                # in vespa
-                delete_document_by_connector_credential_pair__no_commit(
-                    db_session=db_session,
-                    document_id=document_id,
-                    connector_credential_pair_identifier=ConnectorCredentialPairIdentifier(
-                        connector_id=connector_id,
-                        credential_id=credential_id,
-                    ),
-                )
-                mark_document_as_modified(document_id, db_session)
+            self.retry(exc=e, countdown=countdown)  # this will raise a celery exception
+            break  # we won't hit this, but it looks weird not to have it
+    finally:
+        task_logger.info(
+            f"document_by_cc_pair_cleanup_task completed: status={completion_status.value} doc={document_id}"
+        )
+
+    if completion_status != OnyxCeleryTaskCompletionStatus.SUCCEEDED:
        return False

+    task_logger.info(f"document_by_cc_pair_cleanup_task finished: doc={document_id}")
    return True


@@ -250,7 +297,8 @@ def cloud_beat_task_generator(
        return None

    last_lock_time = time.monotonic()
-    tenant_ids: list[str] | list[None] = []
+    tenant_ids: list[str] = []
+    num_processed_tenants = 0

    try:
        tenant_ids = get_all_tenant_ids()
@@ -278,6 +326,8 @@ def cloud_beat_task_generator(
                expires=expires,
                ignore_result=True,
            )
+
+            num_processed_tenants += 1
    except SoftTimeLimitExceeded:
        task_logger.info(
            "Soft time limit exceeded, task is being terminated gracefully."
@@ -297,6 +347,7 @@ def cloud_beat_task_generator(
    task_logger.info(
        f"cloud_beat_task_generator finished: "
        f"task={task_name} "
+        f"num_processed_tenants={num_processed_tenants} "
        f"num_tenants={len(tenant_ids)} "
        f"elapsed={time_elapsed:.2f}"
    )
--- a/backend/onyx/background/celery/tasks/vespa/tasks.py
+++ b/backend/onyx/background/celery/tasks/vespa/tasks.py
@@ -19,6 +19,7 @@ from onyx.background.celery.apps.app_base import task_logger
 from onyx.background.celery.tasks.shared.RetryDocumentIndex import RetryDocumentIndex
 from onyx.background.celery.tasks.shared.tasks import LIGHT_SOFT_TIME_LIMIT
 from onyx.background.celery.tasks.shared.tasks import LIGHT_TIME_LIMIT
+from onyx.background.celery.tasks.shared.tasks import OnyxCeleryTaskCompletionStatus
 from onyx.configs.app_configs import JOB_TIMEOUT
 from onyx.configs.app_configs import VESPA_SYNC_MAX_TASKS
 from onyx.configs.constants import CELERY_VESPA_SYNC_BEAT_LOCK_TIMEOUT
@@ -75,7 +76,7 @@ logger = setup_logger()
    trail=False,
    bind=True,
 )
-def check_for_vespa_sync_task(self: Task, *, tenant_id: str | None) -> bool | None:
+def check_for_vespa_sync_task(self: Task, *, tenant_id: str) -> bool | None:
    """Runs periodically to check if any document needs syncing.
    Generates sets of tasks for Celery if syncing is needed."""

@@ -207,7 +208,7 @@ def try_generate_stale_document_sync_tasks(
    db_session: Session,
    r: Redis,
    lock_beat: RedisLock,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> int | None:
    # the fence is up, do nothing

@@ -283,7 +284,7 @@ def try_generate_document_set_sync_tasks(
    db_session: Session,
    r: Redis,
    lock_beat: RedisLock,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> int | None:
    lock_beat.reacquire()

@@ -360,7 +361,7 @@ def try_generate_user_group_sync_tasks(
    db_session: Session,
    r: Redis,
    lock_beat: RedisLock,
-    tenant_id: str | None,
+    tenant_id: str,
 ) -> int | None:
    lock_beat.reacquire()

@@ -447,7 +448,7 @@ def monitor_connector_taskset(r: Redis) -> None:


 def monitor_document_set_taskset(
-    tenant_id: str | None, key_bytes: bytes, r: Redis, db_session: Session
+    tenant_id: str, key_bytes: bytes, r: Redis, db_session: Session
 ) -> None:
    fence_key = key_bytes.decode("utf-8")
    document_set_id_str = RedisDocumentSet.get_id_from_fence_key(fence_key)
@@ -522,11 +523,11 @@ def monitor_document_set_taskset(
    time_limit=LIGHT_TIME_LIMIT,
    max_retries=3,
 )
-def vespa_metadata_sync_task(
-    self: Task, document_id: str, *, tenant_id: str | None
-) -> bool:
+def vespa_metadata_sync_task(self: Task, document_id: str, *, tenant_id: str) -> bool:
    start = time.monotonic()

+    completion_status = OnyxCeleryTaskCompletionStatus.UNDEFINED
+
    try:
        with get_session_with_current_tenant() as db_session:
            active_search_settings = get_active_search_settings(db_session)
@@ -540,75 +541,103 @@ def vespa_metadata_sync_task(

            doc = get_document(document_id, db_session)
            if not doc:
-                return False
+                elapsed = time.monotonic() - start
+                task_logger.info(
+                    f"doc={document_id} "
+                    f"action=no_operation "
+                    f"elapsed={elapsed:.2f}"
+                )
+                completion_status = OnyxCeleryTaskCompletionStatus.SKIPPED
+            else:
+                # document set sync
+                doc_sets = fetch_document_sets_for_document(document_id, db_session)
+                update_doc_sets: set[str] = set(doc_sets)

-            # document set sync
-            doc_sets = fetch_document_sets_for_document(document_id, db_session)
-            update_doc_sets: set[str] = set(doc_sets)
+                # User group sync
+                doc_access = get_access_for_document(
+                    document_id=document_id, db_session=db_session
+                )

-            # User group sync
-            doc_access = get_access_for_document(
-                document_id=document_id, db_session=db_session
-            )
+                fields = VespaDocumentFields(
+                    document_sets=update_doc_sets,
+                    access=doc_access,
+                    boost=doc.boost,
+                    hidden=doc.hidden,
+                )

-            fields = VespaDocumentFields(
-                document_sets=update_doc_sets,
-                access=doc_access,
-                boost=doc.boost,
-                hidden=doc.hidden,
-            )
+                # update Vespa. OK if doc doesn't exist. Raises exception otherwise.
+                chunks_affected = retry_index.update_single(
+                    document_id,
+                    tenant_id=tenant_id,
+                    chunk_count=doc.chunk_count,
+                    fields=fields,
+                )

-            # update Vespa. OK if doc doesn't exist. Raises exception otherwise.
-            chunks_affected = retry_index.update_single(
-                document_id,
-                tenant_id=tenant_id,
-                chunk_count=doc.chunk_count,
-                fields=fields,
-            )
+                # update db last. Worst case = we crash right before this and
+                # the sync might repeat again later
+                mark_document_as_synced(document_id, db_session)

-            # update db last. Worst case = we crash right before this and
-            # the sync might repeat again later
-            mark_document_as_synced(document_id, db_session)
-
-            elapsed = time.monotonic() - start
-            task_logger.info(
-                f"doc={document_id} "
-                f"action=sync "
-                f"chunks={chunks_affected} "
-                f"elapsed={elapsed:.2f}"
-            )
+                elapsed = time.monotonic() - start
+                task_logger.info(
+                    f"doc={document_id} "
+                    f"action=sync "
+                    f"chunks={chunks_affected} "
+                    f"elapsed={elapsed:.2f}"
+                )
+                completion_status = OnyxCeleryTaskCompletionStatus.SUCCEEDED
    except SoftTimeLimitExceeded:
        task_logger.info(f"SoftTimeLimitExceeded exception. doc={document_id}")
-        return False
+        completion_status = OnyxCeleryTaskCompletionStatus.SOFT_TIME_LIMIT
    except Exception as ex:
        e: Exception | None = None
-        if isinstance(ex, RetryError):
-            task_logger.warning(
-                f"Tenacity retry failed: num_attempts={ex.last_attempt.attempt_number}"
+        while True:
+            if isinstance(ex, RetryError):
+                task_logger.warning(
+                    f"Tenacity retry failed: num_attempts={ex.last_attempt.attempt_number}"
+                )
+
+                # only set the inner exception if it is of type Exception
+                e_temp = ex.last_attempt.exception()
+                if isinstance(e_temp, Exception):
+                    e = e_temp
+            else:
+                e = ex
+
+            if isinstance(e, httpx.HTTPStatusError):
+                if e.response.status_code == HTTPStatus.BAD_REQUEST:
+                    task_logger.exception(
+                        f"Non-retryable HTTPStatusError: "
+                        f"doc={document_id} "
+                        f"status={e.response.status_code}"
+                    )
+                completion_status = (
+                    OnyxCeleryTaskCompletionStatus.NON_RETRYABLE_EXCEPTION
+                )
+                break
+
+            task_logger.exception(
+                f"vespa_metadata_sync_task exceptioned: doc={document_id}"
            )

-            # only set the inner exception if it is of type Exception
-            e_temp = ex.last_attempt.exception()
-            if isinstance(e_temp, Exception):
-                e = e_temp
-        else:
-            e = ex
-
-        if isinstance(e, httpx.HTTPStatusError):
-            if e.response.status_code == HTTPStatus.BAD_REQUEST:
-                task_logger.exception(
-                    f"Non-retryable HTTPStatusError: "
-                    f"doc={document_id} "
-                    f"status={e.response.status_code}"
+            completion_status = OnyxCeleryTaskCompletionStatus.RETRYABLE_EXCEPTION
+            if (
+                self.max_retries is not None
+                and self.request.retries >= self.max_retries
+            ):
+                completion_status = (
+                    OnyxCeleryTaskCompletionStatus.NON_RETRYABLE_EXCEPTION
                )
-            return False

-        task_logger.exception(
-            f"Unexpected exception during vespa metadata sync: doc={document_id}"
+            # Exponential backoff from 2^4 to 2^6 ... i.e. 16, 32, 64
+            countdown = 2 ** (self.request.retries + 4)
+            self.retry(exc=e, countdown=countdown)  # this will raise a celery exception
+            break  # we won't hit this, but it looks weird not to have it
+    finally:
+        task_logger.info(
+            f"vespa_metadata_sync_task completed: status={completion_status.value} doc={document_id}"
        )

-        # Exponential backoff from 2^4 to 2^6 ... i.e. 16, 32, 64
-        countdown = 2 ** (self.request.retries + 4)
-        self.retry(exc=e, countdown=countdown)
+    if completion_status != OnyxCeleryTaskCompletionStatus.SUCCEEDED:
+        return False

    return True
--- a/backend/onyx/background/error_logging.py
+++ b/backend/onyx/background/error_logging.py
@@ -1,3 +1,5 @@
+from sqlalchemy.exc import IntegrityError
+
 from onyx.db.background_error import create_background_error
 from onyx.db.engine import get_session_with_current_tenant

@@ -9,5 +11,27 @@ def emit_background_error(
    """Currently just saves a row in the background_errors table.

    In the future, could create notifications based on the severity."""
-    with get_session_with_current_tenant() as db_session:
-        create_background_error(db_session, message, cc_pair_id)
+    error_message = ""
+
+    # try to write to the db, but handle IntegrityError specifically
+    try:
+        with get_session_with_current_tenant() as db_session:
+            create_background_error(db_session, message, cc_pair_id)
+    except IntegrityError as e:
+        # Log an error if the cc_pair_id was deleted or any other exception occurs
+        error_message = (
+            f"Failed to create background error: {str(e)}. Original message: {message}"
+        )
+    except Exception:
+        pass
+
+    if not error_message:
+        return
+
+    # if we get here from an IntegrityError, try to write the error message to the db
+    # we need a new session because the first session is now invalid
+    try:
+        with get_session_with_current_tenant() as db_session:
+            create_background_error(db_session, error_message, None)
+    except Exception:
+        pass
--- a/backend/onyx/background/indexing/job_client.py
+++ b/backend/onyx/background/indexing/job_client.py
@@ -16,7 +16,10 @@ from typing import Optional

 from onyx.configs.constants import POSTGRES_CELERY_WORKER_INDEXING_CHILD_APP_NAME
 from onyx.db.engine import SqlEngine
-from onyx.utils.logger import setup_logger
+from onyx.setup import setup_logger
+from shared_configs.configs import POSTGRES_DEFAULT_SCHEMA
+from shared_configs.configs import TENANT_ID_PREFIX
+from shared_configs.contextvars import CURRENT_TENANT_ID_CONTEXTVAR

 logger = setup_logger()

@@ -54,6 +57,15 @@ def _initializer(
        kwargs = {}

    logger.info("Initializing spawned worker child process.")
+    # 1. Get tenant_id from args or fallback to default
+    tenant_id = POSTGRES_DEFAULT_SCHEMA
+    for arg in reversed(args):
+        if isinstance(arg, str) and arg.startswith(TENANT_ID_PREFIX):
+            tenant_id = arg
+            break
+
+    # 2. Set the tenant context before running anything
+    token = CURRENT_TENANT_ID_CONTEXTVAR.set(tenant_id)

    # Reset the engine in the child process
    SqlEngine.reset_engine()
@@ -81,6 +93,8 @@ def _initializer(
        queue.put(error_msg)  # Send the exception to the parent process

        sys.exit(255)  # use 255 to indicate a generic exception
+    finally:
+        CURRENT_TENANT_ID_CONTEXTVAR.reset(token)


 def _run_in_process(
--- a/backend/onyx/background/indexing/run_indexing.py
+++ b/backend/onyx/background/indexing/run_indexing.py
@@ -15,13 +15,15 @@ from onyx.background.indexing.memory_tracer import MemoryTracer
 from onyx.configs.app_configs import INDEX_BATCH_SIZE
 from onyx.configs.app_configs import INDEXING_SIZE_WARNING_THRESHOLD
 from onyx.configs.app_configs import INDEXING_TRACER_INTERVAL
+from onyx.configs.app_configs import INTEGRATION_TESTS_MODE
 from onyx.configs.app_configs import LEAVE_CONNECTOR_ACTIVE_ON_INITIALIZATION_FAILURE
 from onyx.configs.app_configs import POLL_CONNECTOR_OFFSET
 from onyx.configs.constants import DocumentSource
 from onyx.configs.constants import MilestoneRecordType
 from onyx.connectors.connector_runner import ConnectorRunner
+from onyx.connectors.exceptions import ConnectorValidationError
+from onyx.connectors.exceptions import UnexpectedValidationError
 from onyx.connectors.factory import instantiate_connector
-from onyx.connectors.interfaces import ConnectorValidationError
 from onyx.connectors.models import ConnectorCheckpoint
 from onyx.connectors.models import ConnectorFailure
 from onyx.connectors.models import Document
@@ -54,6 +56,7 @@ from onyx.utils.logger import setup_logger
 from onyx.utils.logger import TaskAttemptSingleton
 from onyx.utils.telemetry import create_milestone_and_report
 from onyx.utils.variable_functionality import global_version
+from shared_configs.configs import MULTI_TENANT

 logger = setup_logger()

@@ -66,7 +69,6 @@ def _get_connector_runner(
    batch_size: int,
    start_time: datetime,
    end_time: datetime,
-    tenant_id: str | None,
    leave_connector_active: bool = LEAVE_CONNECTOR_ACTIVE_ON_INITIALIZATION_FAILURE,
 ) -> ConnectorRunner:
    """
@@ -85,18 +87,23 @@ def _get_connector_runner(
            input_type=task,
            connector_specific_config=attempt.connector_credential_pair.connector.connector_specific_config,
            credential=attempt.connector_credential_pair.credential,
-            tenant_id=tenant_id,
        )

        # validate the connector settings
+        if not INTEGRATION_TESTS_MODE:
+            runnable_connector.validate_connector_settings()

-        runnable_connector.validate_connector_settings()
-
+    except UnexpectedValidationError as e:
+        logger.exception(
+            "Unable to instantiate connector due to an unexpected temporary issue."
+        )
+        raise e
    except Exception as e:
-        logger.exception(f"Unable to instantiate connector due to {e}")
-
+        logger.exception("Unable to instantiate connector. Pausing until fixed.")
        # since we failed to even instantiate the connector, we pause the CCPair since
-        # it will never succeed. Sometimes there are cases where the connector will
+        # it will never succeed
+
+        # Sometimes there are cases where the connector will
        # intermittently fail to initialize in which case we should pass in
        # leave_connector_active=True to allow it to continue.
        # For example, if there is nightly maintenance on a Confluence Server instance,
@@ -240,7 +247,7 @@ def _check_failure_threshold(
 def _run_indexing(
    db_session: Session,
    index_attempt_id: int,
-    tenant_id: str | None,
+    tenant_id: str,
    callback: IndexingHeartbeatInterface | None = None,
 ) -> None:
    """
@@ -387,7 +394,6 @@ def _run_indexing(
                batch_size=INDEX_BATCH_SIZE,
                start_time=window_start,
                end_time=window_end,
-                tenant_id=tenant_id,
            )

            # don't use a checkpoint if we're explicitly indexing from
@@ -680,7 +686,7 @@ def _run_indexing(

 def run_indexing_entrypoint(
    index_attempt_id: int,
-    tenant_id: str | None,
+    tenant_id: str,
    connector_credential_pair_id: int,
    is_ee: bool = False,
    callback: IndexingHeartbeatInterface | None = None,
@@ -700,7 +706,7 @@ def run_indexing_entrypoint(
        attempt = transition_attempt_to_in_progress(index_attempt_id, db_session)

        tenant_str = ""
-        if tenant_id is not None:
+        if MULTI_TENANT:
            tenant_str = f" for tenant {tenant_id}"

        connector_name = attempt.connector_credential_pair.connector.name
--- a/backend/onyx/chat/chat_utils.py
+++ b/backend/onyx/chat/chat_utils.py
@@ -190,7 +190,8 @@ def create_chat_chain(
            and previous_message.message_type == MessageType.ASSISTANT
            and mainline_messages
        ):
-            mainline_messages[-1] = current_message
+            if current_message.refined_answer_improvement:
+                mainline_messages[-1] = current_message
        else:
            mainline_messages.append(current_message)

--- a/backend/onyx/chat/llm_response_handler.py
+++ b/backend/onyx/chat/llm_response_handler.py
@@ -15,6 +15,8 @@ from onyx.chat.stream_processing.answer_response_handler import (
 from onyx.chat.tool_handling.tool_response_handler import ToolResponseHandler


+# This is Legacy code that is not used anymore.
+# It is kept here for reference.
 class LLMResponseHandlerManager:
    """
    This class is responsible for postprocessing the LLM response stream.
--- a/backend/onyx/chat/models.py
+++ b/backend/onyx/chat/models.py
@@ -142,6 +142,15 @@ class MessageResponseIDInfo(BaseModel):
    reserved_assistant_message_id: int


+class AgentMessageIDInfo(BaseModel):
+    level: int
+    message_id: int
+
+
+class AgenticMessageResponseIDInfo(BaseModel):
+    agentic_message_ids: list[AgentMessageIDInfo]
+
+
 class StreamingError(BaseModel):
    error: str
    stack_trace: str | None = None
--- a/backend/onyx/chat/process_message.py
+++ b/backend/onyx/chat/process_message.py
@@ -11,6 +11,8 @@ from onyx.agents.agent_search.orchestration.nodes.call_tool import ToolCallExcep
 from onyx.chat.answer import Answer
 from onyx.chat.chat_utils import create_chat_chain
 from onyx.chat.chat_utils import create_temporary_persona
+from onyx.chat.models import AgenticMessageResponseIDInfo
+from onyx.chat.models import AgentMessageIDInfo
 from onyx.chat.models import AgentSearchPacket
 from onyx.chat.models import AllCitations
 from onyx.chat.models import AnswerPostInfo
@@ -308,6 +310,7 @@ ChatPacket = (
    | CustomToolResponse
    | MessageSpecificCitations
    | MessageResponseIDInfo
+    | AgenticMessageResponseIDInfo
    | StreamStopInfo
    | AgentSearchPacket
 )
@@ -744,16 +747,16 @@ def stream_chat_message_objects(
                files=latest_query_files,
                single_message_history=single_message_history,
            ),
-            system_message=default_build_system_message(prompt_config),
+            system_message=default_build_system_message(prompt_config, llm.config),
            message_history=message_history,
            llm_config=llm.config,
            raw_user_query=final_msg.message,
            raw_user_uploaded_files=latest_query_files or [],
            single_message_history=single_message_history,
        )
-        prompt_builder.update_system_prompt(default_build_system_message(prompt_config))

        # LLM prompt building, response capturing, etc.
+
        answer = Answer(
            prompt_builder=prompt_builder,
            is_connected=is_connected,
@@ -867,7 +870,6 @@ def stream_chat_message_objects(
                            for img in img_generation_response
                            if img.image_data
                        ],
-                        tenant_id=tenant_id,
                    )
                    info.ai_message_files.extend(
                        [
@@ -1035,6 +1037,7 @@ def stream_chat_message_objects(
        next_level = 1
        prev_message = gen_ai_response_message
        agent_answers = answer.llm_answer_by_level()
+        agentic_message_ids = []
        while next_level in agent_answers:
            next_answer = agent_answers[next_level]
            info = info_by_subq[
@@ -1059,17 +1062,18 @@ def stream_chat_message_objects(
                refined_answer_improvement=refined_answer_improvement,
                is_agentic=True,
            )
+            agentic_message_ids.append(
+                AgentMessageIDInfo(level=next_level, message_id=next_answer_message.id)
+            )
            next_level += 1
            prev_message = next_answer_message

        logger.debug("Committing messages")
        db_session.commit()  # actually save user / assistant message

-        msg_detail_response = translate_db_message_to_chat_message_detail(
-            gen_ai_response_message
-        )
+        yield AgenticMessageResponseIDInfo(agentic_message_ids=agentic_message_ids)

-        yield msg_detail_response
+        yield translate_db_message_to_chat_message_detail(gen_ai_response_message)
    except Exception as e:
        error_msg = str(e)
        logger.exception(error_msg)
--- a/backend/onyx/chat/prompt_builder/answer_prompt_builder.py
+++ b/backend/onyx/chat/prompt_builder/answer_prompt_builder.py
@@ -12,6 +12,7 @@ from onyx.chat.prompt_builder.citations_prompt import compute_max_llm_input_toke
 from onyx.chat.prompt_builder.utils import translate_history_to_basemessages
 from onyx.file_store.models import InMemoryChatFile
 from onyx.llm.interfaces import LLMConfig
+from onyx.llm.llm_provider_options import OPENAI_PROVIDER_NAME
 from onyx.llm.models import PreviousMessage
 from onyx.llm.utils import build_content_with_imgs
 from onyx.llm.utils import check_message_tokens
@@ -19,6 +20,7 @@ from onyx.llm.utils import message_to_prompt_and_imgs
 from onyx.llm.utils import model_supports_image_input
 from onyx.natural_language_processing.utils import get_tokenizer
 from onyx.prompts.chat_prompts import CHAT_USER_CONTEXT_FREE_PROMPT
+from onyx.prompts.chat_prompts import CODE_BLOCK_MARKDOWN
 from onyx.prompts.direct_qa_prompts import HISTORY_BLOCK
 from onyx.prompts.prompt_utils import drop_messages_history_overflow
 from onyx.prompts.prompt_utils import handle_onyx_date_awareness
@@ -31,8 +33,16 @@ from onyx.tools.tool import Tool

 def default_build_system_message(
    prompt_config: PromptConfig,
+    llm_config: LLMConfig,
 ) -> SystemMessage | None:
    system_prompt = prompt_config.system_prompt.strip()
+    # See https://simonwillison.net/tags/markdown/ for context on this temporary fix
+    # for o-series markdown generation
+    if (
+        llm_config.model_provider == OPENAI_PROVIDER_NAME
+        and llm_config.model_name.startswith("o")
+    ):
+        system_prompt = CODE_BLOCK_MARKDOWN + system_prompt
    tag_handled_prompt = handle_onyx_date_awareness(
        system_prompt,
        prompt_config,
@@ -110,21 +120,8 @@ class AnswerPromptBuilder:
            ),
        )

-        self.system_message_and_token_cnt: tuple[SystemMessage, int] | None = (
-            (
-                system_message,
-                check_message_tokens(system_message, self.llm_tokenizer_encode_func),
-            )
-            if system_message
-            else None
-        )
-        self.user_message_and_token_cnt = (
-            user_message,
-            check_message_tokens(
-                user_message,
-                self.llm_tokenizer_encode_func,
-            ),
-        )
+        self.update_system_prompt(system_message)
+        self.update_user_prompt(user_message)

        self.new_messages_and_token_cnts: list[tuple[BaseMessage, int]] = []

--- a/backend/onyx/chat/stream_processing/citation_processing.py
+++ b/backend/onyx/chat/stream_processing/citation_processing.py
@@ -90,97 +90,97 @@ class CitationProcessor:
                    next(group for group in citation.groups() if group is not None)
                )

-                if 1 <= numerical_value <= self.max_citation_num:
-                    context_llm_doc = self.context_docs[numerical_value - 1]
-                    final_citation_num = self.final_order_mapping[
+                if not (1 <= numerical_value <= self.max_citation_num):
+                    continue
+
+                context_llm_doc = self.context_docs[numerical_value - 1]
+                final_citation_num = self.final_order_mapping[
+                    context_llm_doc.document_id
+                ]
+
+                if final_citation_num not in self.citation_order:
+                    self.citation_order.append(final_citation_num)
+
+                citation_order_idx = self.citation_order.index(final_citation_num) + 1
+
+                # get the value that was displayed to user, should always
+                # be in the display_doc_order_dict. But check anyways
+                if context_llm_doc.document_id in self.display_order_mapping:
+                    displayed_citation_num = self.display_order_mapping[
                        context_llm_doc.document_id
                    ]
-
-                    if final_citation_num not in self.citation_order:
-                        self.citation_order.append(final_citation_num)
-
-                    citation_order_idx = (
-                        self.citation_order.index(final_citation_num) + 1
+                else:
+                    displayed_citation_num = final_citation_num
+                    logger.warning(
+                        f"Doc {context_llm_doc.document_id} not in display_doc_order_dict. Used LLM citation number instead."
                    )

-                    # get the value that was displayed to user, should always
-                    # be in the display_doc_order_dict. But check anyways
-                    if context_llm_doc.document_id in self.display_order_mapping:
-                        displayed_citation_num = self.display_order_mapping[
-                            context_llm_doc.document_id
-                        ]
-                    else:
-                        displayed_citation_num = final_citation_num
-                        logger.warning(
-                            f"Doc {context_llm_doc.document_id} not in display_doc_order_dict. Used LLM citation number instead."
-                        )
-
-                    # Skip consecutive citations of the same work
-                    if final_citation_num in self.current_citations:
-                        start, end = citation.span()
-                        real_start = length_to_add + start
-                        diff = end - start
-                        self.curr_segment = (
-                            self.curr_segment[: length_to_add + start]
-                            + self.curr_segment[real_start + diff :]
-                        )
-                        length_to_add -= diff
-                        continue
-
-                    # Handle edge case where LLM outputs citation itself
-                    if self.curr_segment.startswith("[["):
-                        match = re.match(r"\[\[(\d+)\]\]", self.curr_segment)
-                        if match:
-                            try:
-                                doc_id = int(match.group(1))
-                                context_llm_doc = self.context_docs[doc_id - 1]
-                                yield CitationInfo(
-                                    # citation_num is now the number post initial ranking, i.e. as displayed to user
-                                    citation_num=displayed_citation_num,
-                                    document_id=context_llm_doc.document_id,
-                                )
-                            except Exception as e:
-                                logger.warning(
-                                    f"Manual LLM citation didn't properly cite documents {e}"
-                                )
-                        else:
-                            logger.warning(
-                                "Manual LLM citation wasn't able to close brackets"
-                            )
-                        continue
-
-                    link = context_llm_doc.link
-
-                    self.past_cite_count = len(self.llm_out)
-                    self.current_citations.append(final_citation_num)
-
-                    if citation_order_idx not in self.cited_inds:
-                        self.cited_inds.add(citation_order_idx)
-                        yield CitationInfo(
-                            # citation number is now the one that was displayed to user
-                            citation_num=displayed_citation_num,
-                            document_id=context_llm_doc.document_id,
-                        )
-
+                # Skip consecutive citations of the same work
+                if final_citation_num in self.current_citations:
                    start, end = citation.span()
-                    if link:
-                        prev_length = len(self.curr_segment)
-                        self.curr_segment = (
-                            self.curr_segment[: start + length_to_add]
-                            + f"[[{displayed_citation_num}]]({link})"  # use the value that was displayed to user
-                            + self.curr_segment[end + length_to_add :]
-                        )
-                        length_to_add += len(self.curr_segment) - prev_length
-                    else:
-                        prev_length = len(self.curr_segment)
-                        self.curr_segment = (
-                            self.curr_segment[: start + length_to_add]
-                            + f"[[{displayed_citation_num}]]()"  # use the value that was displayed to user
-                            + self.curr_segment[end + length_to_add :]
-                        )
-                        length_to_add += len(self.curr_segment) - prev_length
+                    real_start = length_to_add + start
+                    diff = end - start
+                    self.curr_segment = (
+                        self.curr_segment[: length_to_add + start]
+                        + self.curr_segment[real_start + diff :]
+                    )
+                    length_to_add -= diff
+                    continue

-                    last_citation_end = end + length_to_add
+                # Handle edge case where LLM outputs citation itself
+                if self.curr_segment.startswith("[["):
+                    match = re.match(r"\[\[(\d+)\]\]", self.curr_segment)
+                    if match:
+                        try:
+                            doc_id = int(match.group(1))
+                            context_llm_doc = self.context_docs[doc_id - 1]
+                            yield CitationInfo(
+                                # citation_num is now the number post initial ranking, i.e. as displayed to user
+                                citation_num=displayed_citation_num,
+                                document_id=context_llm_doc.document_id,
+                            )
+                        except Exception as e:
+                            logger.warning(
+                                f"Manual LLM citation didn't properly cite documents {e}"
+                            )
+                    else:
+                        logger.warning(
+                            "Manual LLM citation wasn't able to close brackets"
+                        )
+                    continue
+
+                link = context_llm_doc.link
+
+                self.past_cite_count = len(self.llm_out)
+                self.current_citations.append(final_citation_num)
+
+                if citation_order_idx not in self.cited_inds:
+                    self.cited_inds.add(citation_order_idx)
+                    yield CitationInfo(
+                        # citation number is now the one that was displayed to user
+                        citation_num=displayed_citation_num,
+                        document_id=context_llm_doc.document_id,
+                    )
+
+                start, end = citation.span()
+                if link:
+                    prev_length = len(self.curr_segment)
+                    self.curr_segment = (
+                        self.curr_segment[: start + length_to_add]
+                        + f"[[{displayed_citation_num}]]({link})"  # use the value that was displayed to user
+                        + self.curr_segment[end + length_to_add :]
+                    )
+                    length_to_add += len(self.curr_segment) - prev_length
+                else:
+                    prev_length = len(self.curr_segment)
+                    self.curr_segment = (
+                        self.curr_segment[: start + length_to_add]
+                        + f"[[{displayed_citation_num}]]()"  # use the value that was displayed to user
+                        + self.curr_segment[end + length_to_add :]
+                    )
+                    length_to_add += len(self.curr_segment) - prev_length
+
+                last_citation_end = end + length_to_add

            if last_citation_end > 0:
                result += self.curr_segment[:last_citation_end]
--- a/backend/onyx/configs/app_configs.py
+++ b/backend/onyx/configs/app_configs.py
@@ -6,6 +6,7 @@ from typing import cast
 from onyx.auth.schemas import AuthBackend
 from onyx.configs.constants import AuthType
 from onyx.configs.constants import DocumentIndexType
+from onyx.configs.constants import QueryHistoryType
 from onyx.file_processing.enums import HtmlBasedConnectorTransformLinksStrategy

 #####
@@ -29,6 +30,9 @@ GENERATIVE_MODEL_ACCESS_CHECK_FREQ = int(
 )  # 1 day
 DISABLE_GENERATIVE_AI = os.environ.get("DISABLE_GENERATIVE_AI", "").lower() == "true"

+ONYX_QUERY_HISTORY_TYPE = QueryHistoryType(
+    (os.environ.get("ONYX_QUERY_HISTORY_TYPE") or QueryHistoryType.NORMAL.value).lower()
+)

 #####
 # Web Configs
@@ -626,6 +630,8 @@ POD_NAMESPACE = os.environ.get("POD_NAMESPACE")

 DEV_MODE = os.environ.get("DEV_MODE", "").lower() == "true"

+INTEGRATION_TESTS_MODE = os.environ.get("INTEGRATION_TESTS_MODE", "").lower() == "true"
+
 MOCK_CONNECTOR_FILE_PATH = os.environ.get("MOCK_CONNECTOR_FILE_PATH")

 TEST_ENV = os.environ.get("TEST_ENV", "").lower() == "true"
@@ -634,3 +640,6 @@ TEST_ENV = os.environ.get("TEST_ENV", "").lower() == "true"
 MOCK_LLM_RESPONSE = (
    os.environ.get("MOCK_LLM_RESPONSE") if os.environ.get("MOCK_LLM_RESPONSE") else None
 )
+
+
+DEFAULT_IMAGE_ANALYSIS_MAX_SIZE_MB = 20
--- a/backend/onyx/configs/constants.py
+++ b/backend/onyx/configs/constants.py
@@ -213,6 +213,12 @@ class AuthType(str, Enum):
    CLOUD = "cloud"


+class QueryHistoryType(str, Enum):
+    DISABLED = "disabled"
+    ANONYMIZED = "anonymized"
+    NORMAL = "normal"
+
+
 # Special characters for password validation
 PASSWORD_SPECIAL_CHARS = "!@#$%^&*()_+-=[]{}|;:,.<>?"

@@ -342,6 +348,9 @@ class OnyxRedisSignals:
    BLOCK_PRUNING = "signal:block_pruning"
    BLOCK_VALIDATE_PRUNING_FENCES = "signal:block_validate_pruning_fences"
    BLOCK_BUILD_FENCE_LOOKUP_TABLE = "signal:block_build_fence_lookup_table"
+    BLOCK_VALIDATE_CONNECTOR_DELETION_FENCES = (
+        "signal:block_validate_connector_deletion_fences"
+    )


 class OnyxRedisConstants:
--- a/backend/onyx/configs/llm_configs.py
+++ b/backend/onyx/configs/llm_configs.py
@@ -0,0 +1,38 @@
+from onyx.configs.app_configs import DEFAULT_IMAGE_ANALYSIS_MAX_SIZE_MB
+from onyx.server.settings.store import load_settings
+
+
+def get_image_extraction_and_analysis_enabled() -> bool:
+    """Get image extraction and analysis enabled setting from workspace settings or fallback to False"""
+    try:
+        settings = load_settings()
+        if settings.image_extraction_and_analysis_enabled is not None:
+            return settings.image_extraction_and_analysis_enabled
+    except Exception:
+        pass
+
+    return False
+
+
+def get_search_time_image_analysis_enabled() -> bool:
+    """Get search time image analysis enabled setting from workspace settings or fallback to False"""
+    try:
+        settings = load_settings()
+        if settings.search_time_image_analysis_enabled is not None:
+            return settings.search_time_image_analysis_enabled
+    except Exception:
+        pass
+
+    return False
+
+
+def get_image_analysis_max_size_mb() -> int:
+    """Get image analysis max size MB setting from workspace settings or fallback to environment variable"""
+    try:
+        settings = load_settings()
+        if settings.image_analysis_max_size_mb is not None:
+            return settings.image_analysis_max_size_mb
+    except Exception:
+        pass
+
+    return DEFAULT_IMAGE_ANALYSIS_MAX_SIZE_MB
--- a/backend/onyx/connectors/airtable/airtable_connector.py
+++ b/backend/onyx/connectors/airtable/airtable_connector.py
@@ -200,7 +200,6 @@ class AirtableConnector(LoadConnector):
                                        return attachment_response.content

                            logger.error(f"Failed to refresh attachment for {filename}")
-
                        raise

                attachment_content = get_attachment_with_retry(url, record_id)
--- a/backend/onyx/connectors/blob/connector.py
+++ b/backend/onyx/connectors/blob/connector.py
@@ -7,11 +7,18 @@ from typing import Optional

 import boto3  # type: ignore
 from botocore.client import Config  # type: ignore
+from botocore.exceptions import ClientError
+from botocore.exceptions import NoCredentialsError
+from botocore.exceptions import PartialCredentialsError
 from mypy_boto3_s3 import S3Client  # type: ignore

 from onyx.configs.app_configs import INDEX_BATCH_SIZE
 from onyx.configs.constants import BlobType
 from onyx.configs.constants import DocumentSource
+from onyx.connectors.exceptions import ConnectorValidationError
+from onyx.connectors.exceptions import CredentialExpiredError
+from onyx.connectors.exceptions import InsufficientPermissionsError
+from onyx.connectors.exceptions import UnexpectedValidationError
 from onyx.connectors.interfaces import GenerateDocumentsOutput
 from onyx.connectors.interfaces import LoadConnector
 from onyx.connectors.interfaces import PollConnector
@@ -240,6 +247,73 @@ class BlobStorageConnector(LoadConnector, PollConnector):

        return None

+    def validate_connector_settings(self) -> None:
+        if self.s3_client is None:
+            raise ConnectorMissingCredentialError(
+                "Blob storage credentials not loaded."
+            )
+
+        if not self.bucket_name:
+            raise ConnectorValidationError(
+                "No bucket name was provided in connector settings."
+            )
+
+        try:
+            # We only fetch one object/page as a light-weight validation step.
+            # This ensures we trigger typical S3 permission checks (ListObjectsV2, etc.).
+            self.s3_client.list_objects_v2(
+                Bucket=self.bucket_name, Prefix=self.prefix, MaxKeys=1
+            )
+
+        except NoCredentialsError:
+            raise ConnectorMissingCredentialError(
+                "No valid blob storage credentials found or provided to boto3."
+            )
+        except PartialCredentialsError:
+            raise ConnectorMissingCredentialError(
+                "Partial or incomplete blob storage credentials provided to boto3."
+            )
+        except ClientError as e:
+            error_code = e.response["Error"].get("Code", "")
+            status_code = e.response["ResponseMetadata"].get("HTTPStatusCode")
+
+            # Most common S3 error cases
+            if error_code in [
+                "AccessDenied",
+                "InvalidAccessKeyId",
+                "SignatureDoesNotMatch",
+            ]:
+                if status_code == 403 or error_code == "AccessDenied":
+                    raise InsufficientPermissionsError(
+                        f"Insufficient permissions to list objects in bucket '{self.bucket_name}'. "
+                        "Please check your bucket policy and/or IAM policy."
+                    )
+                if status_code == 401 or error_code == "SignatureDoesNotMatch":
+                    raise CredentialExpiredError(
+                        "Provided blob storage credentials appear invalid or expired."
+                    )
+
+                raise CredentialExpiredError(
+                    f"Credential issue encountered ({error_code})."
+                )
+
+            if error_code == "NoSuchBucket" or status_code == 404:
+                raise ConnectorValidationError(
+                    f"Bucket '{self.bucket_name}' does not exist or cannot be found."
+                )
+
+            raise ConnectorValidationError(
+                f"Unexpected S3 client error (code={error_code}, status={status_code}): {e}"
+            )
+
+        except Exception as e:
+            # Catch-all for anything not captured by the above
+            # Since we are unsure of the error and it may not disable the connector,
+            #  raise an unexpected error (does not disable connector)
+            raise UnexpectedValidationError(
+                f"Unexpected error during blob storage settings validation: {e}"
+            )
+

 if __name__ == "__main__":
    credentials_dict = {
--- a/backend/onyx/connectors/bookstack/connector.py
+++ b/backend/onyx/connectors/bookstack/connector.py
@@ -9,10 +9,10 @@ from onyx.configs.constants import DocumentSource
 from onyx.connectors.bookstack.client import BookStackApiClient
 from onyx.connectors.bookstack.client import BookStackClientRequestFailedError
 from onyx.connectors.cross_connector_utils.miscellaneous_utils import time_str_to_utc
-from onyx.connectors.interfaces import ConnectorValidationError
-from onyx.connectors.interfaces import CredentialExpiredError
+from onyx.connectors.exceptions import ConnectorValidationError
+from onyx.connectors.exceptions import CredentialExpiredError
+from onyx.connectors.exceptions import InsufficientPermissionsError
 from onyx.connectors.interfaces import GenerateDocumentsOutput
-from onyx.connectors.interfaces import InsufficientPermissionsError
 from onyx.connectors.interfaces import LoadConnector
 from onyx.connectors.interfaces import PollConnector
 from onyx.connectors.interfaces import SecondsSinceUnixEpoch
--- a/backend/onyx/connectors/confluence/connector.py
+++ b/backend/onyx/connectors/confluence/connector.py
@@ -4,18 +4,26 @@ from datetime import timezone
 from typing import Any
 from urllib.parse import quote

+from requests.exceptions import HTTPError
+
 from onyx.configs.app_configs import CONFLUENCE_CONNECTOR_LABELS_TO_SKIP
 from onyx.configs.app_configs import CONFLUENCE_TIMEZONE_OFFSET
 from onyx.configs.app_configs import CONTINUE_ON_CONNECTOR_FAILURE
 from onyx.configs.app_configs import INDEX_BATCH_SIZE
 from onyx.configs.constants import DocumentSource
-from onyx.connectors.confluence.onyx_confluence import build_confluence_client
+from onyx.connectors.confluence.onyx_confluence import extract_text_from_confluence_html
 from onyx.connectors.confluence.onyx_confluence import OnyxConfluence
-from onyx.connectors.confluence.utils import attachment_to_content
 from onyx.connectors.confluence.utils import build_confluence_document_id
+from onyx.connectors.confluence.utils import convert_attachment_to_content
 from onyx.connectors.confluence.utils import datetime_from_string
-from onyx.connectors.confluence.utils import extract_text_from_confluence_html
+from onyx.connectors.confluence.utils import process_attachment
 from onyx.connectors.confluence.utils import validate_attachment_filetype
+from onyx.connectors.exceptions import ConnectorValidationError
+from onyx.connectors.exceptions import CredentialExpiredError
+from onyx.connectors.exceptions import InsufficientPermissionsError
+from onyx.connectors.exceptions import UnexpectedValidationError
+from onyx.connectors.interfaces import CredentialsConnector
+from onyx.connectors.interfaces import CredentialsProviderInterface
 from onyx.connectors.interfaces import GenerateDocumentsOutput
 from onyx.connectors.interfaces import GenerateSlimDocumentOutput
 from onyx.connectors.interfaces import LoadConnector
@@ -27,28 +35,26 @@ from onyx.connectors.models import ConnectorMissingCredentialError
 from onyx.connectors.models import Document
 from onyx.connectors.models import Section
 from onyx.connectors.models import SlimDocument
+from onyx.connectors.vision_enabled_connector import VisionEnabledConnector
 from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
 from onyx.utils.logger import setup_logger

 logger = setup_logger()
-
 # Potential Improvements
-# 1. Include attachments, etc
-# 2. Segment into Sections for more accurate linking, can split by headers but make sure no text/ordering is lost
-
+# 1. Segment into Sections for more accurate linking, can split by headers but make sure no text/ordering is lost
 _COMMENT_EXPANSION_FIELDS = ["body.storage.value"]
 _PAGE_EXPANSION_FIELDS = [
    "body.storage.value",
    "version",
    "space",
    "metadata.labels",
+    "history.lastUpdated",
 ]
 _ATTACHMENT_EXPANSION_FIELDS = [
    "version",
    "space",
    "metadata.labels",
 ]
-
 _RESTRICTIONS_EXPANSION_FIELDS = [
    "space",
    "restrictions.read.restrictions.user",
@@ -77,7 +83,13 @@ _FULL_EXTENSION_FILTER_STRING = "".join(
 )


-class ConfluenceConnector(LoadConnector, PollConnector, SlimConnector):
+class ConfluenceConnector(
+    LoadConnector,
+    PollConnector,
+    SlimConnector,
+    CredentialsConnector,
+    VisionEnabledConnector,
+):
    def __init__(
        self,
        wiki_base: str,
@@ -94,14 +106,24 @@ class ConfluenceConnector(LoadConnector, PollConnector, SlimConnector):
        labels_to_skip: list[str] = CONFLUENCE_CONNECTOR_LABELS_TO_SKIP,
        timezone_offset: float = CONFLUENCE_TIMEZONE_OFFSET,
    ) -> None:
+        self.wiki_base = wiki_base
+        self.is_cloud = is_cloud
+        self.space = space
+        self.page_id = page_id
+        self.index_recursively = index_recursively
+        self.cql_query = cql_query
        self.batch_size = batch_size
        self.continue_on_failure = continue_on_failure
+        self.labels_to_skip = labels_to_skip
+        self.timezone_offset = timezone_offset
        self._confluence_client: OnyxConfluence | None = None
-        self.is_cloud = is_cloud
+        self._fetched_titles: set[str] = set()
+
+        # Initialize vision LLM using the mixin
+        self.initialize_vision_llm()

        # Remove trailing slash from wiki_base if present
        self.wiki_base = wiki_base.rstrip("/")
-
        """
        If nothing is provided, we default to fetching all pages
        Only one or none of the following options should be specified so
@@ -131,6 +153,17 @@ class ConfluenceConnector(LoadConnector, PollConnector, SlimConnector):
            self.cql_label_filter = f" and label not in ({comma_separated_labels})"

        self.timezone: timezone = timezone(offset=timedelta(hours=timezone_offset))
+        self.credentials_provider: CredentialsProviderInterface | None = None
+
+        self.probe_kwargs = {
+            "max_backoff_retries": 6,
+            "max_backoff_seconds": 10,
+        }
+
+        self.final_kwargs = {
+            "max_backoff_retries": 10,
+            "max_backoff_seconds": 60,
+        }

    @property
    def confluence_client(self) -> OnyxConfluence:
@@ -138,15 +171,22 @@ class ConfluenceConnector(LoadConnector, PollConnector, SlimConnector):
            raise ConnectorMissingCredentialError("Confluence")
        return self._confluence_client

-    def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None:
-        # see https://github.com/atlassian-api/atlassian-python-api/blob/master/atlassian/rest_client.py
-        # for a list of other hidden constructor args
-        self._confluence_client = build_confluence_client(
-            credentials=credentials,
-            is_cloud=self.is_cloud,
-            wiki_base=self.wiki_base,
+    def set_credentials_provider(
+        self, credentials_provider: CredentialsProviderInterface
+    ) -> None:
+        self.credentials_provider = credentials_provider
+
+        # raises exception if there's a problem
+        confluence_client = OnyxConfluence(
+            self.is_cloud, self.wiki_base, credentials_provider
        )
-        return None
+        confluence_client._probe_connection(**self.probe_kwargs)
+        confluence_client._initialize_connection(**self.final_kwargs)
+
+        self._confluence_client = confluence_client
+
+    def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None:
+        raise NotImplementedError("Use set_credentials_provider with this connector.")

    def _construct_page_query(
        self,
@@ -154,7 +194,6 @@ class ConfluenceConnector(LoadConnector, PollConnector, SlimConnector):
        end: SecondsSinceUnixEpoch | None = None,
    ) -> str:
        page_query = self.base_cql_page_query + self.cql_label_filter
-
        # Add time filters
        if start:
            formatted_start_time = datetime.fromtimestamp(
@@ -166,7 +205,6 @@ class ConfluenceConnector(LoadConnector, PollConnector, SlimConnector):
                "%Y-%m-%d %H:%M"
            )
            page_query += f" and lastmodified <= '{formatted_end_time}'"
-
        return page_query

    def _construct_attachment_query(self, confluence_page_id: str) -> str:
@@ -177,11 +215,10 @@ class ConfluenceConnector(LoadConnector, PollConnector, SlimConnector):

    def _get_comment_string_for_page_id(self, page_id: str) -> str:
        comment_string = ""
-
        comment_cql = f"type=comment and container='{page_id}'"
        comment_cql += self.cql_label_filter
-
        expand = ",".join(_COMMENT_EXPANSION_FIELDS)
+
        for comment in self.confluence_client.paginated_cql_retrieval(
            cql=comment_cql,
            expand=expand,
@@ -192,116 +229,177 @@ class ConfluenceConnector(LoadConnector, PollConnector, SlimConnector):
                confluence_object=comment,
                fetched_titles=set(),
            )
-
        return comment_string

-    def _convert_object_to_document(
-        self, confluence_object: dict[str, Any]
-    ) -> Document | None:
+    def _convert_page_to_document(self, page: dict[str, Any]) -> Document | None:
        """
-        Takes in a confluence object, extracts all metadata, and converts it into a document.
-        If its a page, it extracts the text, adds the comments for the document text.
-        If its an attachment, it just downloads the attachment and converts that into a document.
+        Converts a Confluence page to a Document object.
+        Includes the page content, comments, and attachments.
        """
-        # The url and the id are the same
-        object_url = build_confluence_document_id(
-            self.wiki_base, confluence_object["_links"]["webui"], self.is_cloud
-        )
+        try:
+            # Extract basic page information
+            page_id = page["id"]
+            page_title = page["title"]
+            page_url = f"{self.wiki_base}{page['_links']['webui']}"

-        object_text = None
-        # Extract text from page
-        if confluence_object["type"] == "page":
-            object_text = extract_text_from_confluence_html(
-                confluence_client=self.confluence_client,
-                confluence_object=confluence_object,
-                fetched_titles={confluence_object.get("title", "")},
-            )
-            # Add comments to text
-            object_text += self._get_comment_string_for_page_id(confluence_object["id"])
-        elif confluence_object["type"] == "attachment":
-            object_text = attachment_to_content(
-                confluence_client=self.confluence_client, attachment=confluence_object
+            # Get the page content
+            page_content = extract_text_from_confluence_html(
+                self.confluence_client, page, self._fetched_titles
            )

-        if object_text is None:
-            # This only happens for attachments that are not parseable
+            # Create the main section for the page content
+            sections = [Section(text=page_content, link=page_url)]
+
+            # Process comments if available
+            comment_text = self._get_comment_string_for_page_id(page_id)
+            if comment_text:
+                sections.append(Section(text=comment_text, link=f"{page_url}#comments"))
+
+            # Process attachments
+            if "children" in page and "attachment" in page["children"]:
+                attachments = self.confluence_client.get_attachments_for_page(
+                    page_id, expand="metadata"
+                )
+
+                for attachment in attachments.get("results", []):
+                    # Process each attachment
+                    result = process_attachment(
+                        self.confluence_client,
+                        attachment,
+                        page_title,
+                        self.image_analysis_llm,
+                    )
+
+                    if result.text:
+                        # Create a section for the attachment text
+                        attachment_section = Section(
+                            text=result.text,
+                            link=f"{page_url}#attachment-{attachment['id']}",
+                            image_file_name=result.file_name,
+                        )
+                        sections.append(attachment_section)
+                    elif result.error:
+                        logger.warning(
+                            f"Error processing attachment '{attachment.get('title')}': {result.error}"
+                        )
+
+            # Extract metadata
+            metadata = {}
+            if "space" in page:
+                metadata["space"] = page["space"].get("name", "")
+
+            # Extract labels
+            labels = []
+            if "metadata" in page and "labels" in page["metadata"]:
+                for label in page["metadata"]["labels"].get("results", []):
+                    labels.append(label.get("name", ""))
+            if labels:
+                metadata["labels"] = labels
+
+            # Extract owners
+            primary_owners = []
+            if "version" in page and "by" in page["version"]:
+                author = page["version"]["by"]
+                display_name = author.get("displayName", "Unknown")
+                primary_owners.append(BasicExpertInfo(display_name=display_name))
+
+            # Create the document
+            return Document(
+                id=build_confluence_document_id(self.wiki_base, page["_links"]["webui"], self.is_cloud),
+                sections=sections,
+                source=DocumentSource.CONFLUENCE,
+                semantic_identifier=page_title,
+                metadata=metadata,
+                doc_updated_at=datetime_from_string(page["version"]["when"]),
+                primary_owners=primary_owners if primary_owners else None,
+            )
+        except Exception as e:
+            logger.error(f"Error converting page {page.get('id', 'unknown')}: {e}")
+            if not self.continue_on_failure:
+                raise
            return None

-        # Get space name
-        doc_metadata: dict[str, str | list[str]] = {
-            "Wiki Space Name": confluence_object["space"]["name"]
-        }
-
-        # Get labels
-        label_dicts = (
-            confluence_object.get("metadata", {}).get("labels", {}).get("results", [])
-        )
-        page_labels = [label.get("name") for label in label_dicts if label.get("name")]
-        if page_labels:
-            doc_metadata["labels"] = page_labels
-
-        # Get last modified and author email
-        version_dict = confluence_object.get("version", {})
-        last_modified = (
-            datetime_from_string(version_dict.get("when"))
-            if version_dict.get("when")
-            else None
-        )
-        author_email = version_dict.get("by", {}).get("email")
-
-        title = confluence_object.get("title", "Untitled Document")
-
-        return Document(
-            id=object_url,
-            sections=[Section(link=object_url, text=object_text)],
-            source=DocumentSource.CONFLUENCE,
-            semantic_identifier=title,
-            doc_updated_at=last_modified,
-            primary_owners=(
-                [BasicExpertInfo(email=author_email)] if author_email else None
-            ),
-            metadata=doc_metadata,
-        )
-
    def _fetch_document_batches(
        self,
        start: SecondsSinceUnixEpoch | None = None,
        end: SecondsSinceUnixEpoch | None = None,
    ) -> GenerateDocumentsOutput:
+        """
+        Yields batches of Documents. For each page:
+         - Create a Document with 1 Section for the page text/comments
+         - Then fetch attachments. For each attachment:
+             - Attempt to convert it with convert_attachment_to_content(...)
+             - If successful, create a new Section with the extracted text or summary.
+        """
        doc_batch: list[Document] = []
-        confluence_page_ids: list[str] = []

        page_query = self._construct_page_query(start, end)
        logger.debug(f"page_query: {page_query}")
-        # Fetch pages as Documents
+
        for page in self.confluence_client.paginated_cql_retrieval(
            cql=page_query,
            expand=",".join(_PAGE_EXPANSION_FIELDS),
            limit=self.batch_size,
        ):
-            logger.debug(f"_fetch_document_batches: {page['id']}")
-            confluence_page_ids.append(page["id"])
-            doc = self._convert_object_to_document(page)
-            if doc is not None:
-                doc_batch.append(doc)
-            if len(doc_batch) >= self.batch_size:
-                yield doc_batch
-                doc_batch = []
+            # Build doc from page
+            doc = self._convert_page_to_document(page)
+            if not doc:
+                continue
+
+            # Now get attachments for that page:
+            attachment_query = self._construct_attachment_query(page["id"])
+            # We'll use the page's XML to provide context if we summarize an image
+            confluence_xml = page.get("body", {}).get("storage", {}).get("value", "")

-        # Fetch attachments as Documents
-        for confluence_page_id in confluence_page_ids:
-            attachment_query = self._construct_attachment_query(confluence_page_id)
-            # TODO: maybe should add time filter as well?
            for attachment in self.confluence_client.paginated_cql_retrieval(
                cql=attachment_query,
                expand=",".join(_ATTACHMENT_EXPANSION_FIELDS),
            ):
-                doc = self._convert_object_to_document(attachment)
-                if doc is not None:
-                    doc_batch.append(doc)
-                if len(doc_batch) >= self.batch_size:
-                    yield doc_batch
-                    doc_batch = []
+                attachment["metadata"].get("mediaType", "")
+                if not validate_attachment_filetype(
+                    attachment, self.image_analysis_llm
+                ):
+                    continue
+
+                # Attempt to get textual content or image summarization:
+                try:
+                    logger.info(f"Processing attachment: {attachment['title']}")
+                    response = convert_attachment_to_content(
+                        confluence_client=self.confluence_client,
+                        attachment=attachment,
+                        page_context=confluence_xml,
+                        llm=self.image_analysis_llm,
+                    )
+                    if response is None:
+                        continue
+
+                    content_text, file_storage_name = response
+
+                    object_url = build_confluence_document_id(
+                        self.wiki_base, attachment["_links"]["webui"], self.is_cloud
+                    )
+
+                    if content_text:
+                        doc.sections.append(
+                            Section(
+                                text=content_text,
+                                link=object_url,
+                                image_file_name=file_storage_name,
+                            )
+                        )
+                except Exception as e:
+                    logger.error(
+                        f"Failed to extract/summarize attachment {attachment['title']}",
+                        exc_info=e,
+                    )
+                    if not self.continue_on_failure:
+                        raise
+
+            doc_batch.append(doc)
+
+            if len(doc_batch) >= self.batch_size:
+                yield doc_batch
+                doc_batch = []

        if doc_batch:
            yield doc_batch
@@ -322,55 +420,63 @@ class ConfluenceConnector(LoadConnector, PollConnector, SlimConnector):
        end: SecondsSinceUnixEpoch | None = None,
        callback: IndexingHeartbeatInterface | None = None,
    ) -> GenerateSlimDocumentOutput:
+        """
+        Return 'slim' docs (IDs + minimal permission data).
+        Does not fetch actual text. Used primarily for incremental permission sync.
+        """
        doc_metadata_list: list[SlimDocument] = []
-
        restrictions_expand = ",".join(_RESTRICTIONS_EXPANSION_FIELDS)

+        # Query pages
        page_query = self.base_cql_page_query + self.cql_label_filter
        for page in self.confluence_client.cql_paginate_all_expansions(
            cql=page_query,
            expand=restrictions_expand,
            limit=_SLIM_DOC_BATCH_SIZE,
        ):
-            # If the page has restrictions, add them to the perm_sync_data
-            # These will be used by doc_sync.py to sync permissions
            page_restrictions = page.get("restrictions")
            page_space_key = page.get("space", {}).get("key")
            page_ancestors = page.get("ancestors", [])
+
            page_perm_sync_data = {
                "restrictions": page_restrictions or {},
                "space_key": page_space_key,
-                "ancestors": page_ancestors or [],
+                "ancestors": page_ancestors,
            }

            doc_metadata_list.append(
                SlimDocument(
                    id=build_confluence_document_id(
-                        self.wiki_base,
-                        page["_links"]["webui"],
-                        self.is_cloud,
+                        self.wiki_base, page["_links"]["webui"], self.is_cloud
                    ),
                    perm_sync_data=page_perm_sync_data,
                )
            )
+
+            # Query attachments for each page
            attachment_query = self._construct_attachment_query(page["id"])
            for attachment in self.confluence_client.cql_paginate_all_expansions(
                cql=attachment_query,
                expand=restrictions_expand,
                limit=_SLIM_DOC_BATCH_SIZE,
            ):
-                if not validate_attachment_filetype(attachment):
+                # If you skip images, you'll skip them in the permission sync
+                attachment["metadata"].get("mediaType", "")
+                if not validate_attachment_filetype(
+                    attachment, self.image_analysis_llm
+                ):
                    continue
-                attachment_restrictions = attachment.get("restrictions")
+
+                attachment_restrictions = attachment.get("restrictions", {})
                if not attachment_restrictions:
-                    attachment_restrictions = page_restrictions
+                    attachment_restrictions = page_restrictions or {}

                attachment_space_key = attachment.get("space", {}).get("key")
                if not attachment_space_key:
                    attachment_space_key = page_space_key

                attachment_perm_sync_data = {
-                    "restrictions": attachment_restrictions or {},
+                    "restrictions": attachment_restrictions,
                    "space_key": attachment_space_key,
                }

@@ -384,16 +490,46 @@ class ConfluenceConnector(LoadConnector, PollConnector, SlimConnector):
                        perm_sync_data=attachment_perm_sync_data,
                    )
                )
+
            if len(doc_metadata_list) > _SLIM_DOC_BATCH_SIZE:
                yield doc_metadata_list[:_SLIM_DOC_BATCH_SIZE]
                doc_metadata_list = doc_metadata_list[_SLIM_DOC_BATCH_SIZE:]

+                if callback and callback.should_stop():
+                    raise RuntimeError(
+                        "retrieve_all_slim_documents: Stop signal detected"
+                    )
                if callback:
-                    if callback.should_stop():
-                        raise RuntimeError(
-                            "retrieve_all_slim_documents: Stop signal detected"
-                        )
-
                    callback.progress("retrieve_all_slim_documents", 1)

        yield doc_metadata_list
+
+    def validate_connector_settings(self) -> None:
+        if self._confluence_client is None:
+            raise ConnectorMissingCredentialError("Confluence credentials not loaded.")
+
+        try:
+            spaces = self._confluence_client.get_all_spaces(limit=1)
+        except HTTPError as e:
+            status_code = e.response.status_code if e.response else None
+            if status_code == 401:
+                raise CredentialExpiredError(
+                    "Invalid or expired Confluence credentials (HTTP 401)."
+                )
+            elif status_code == 403:
+                raise InsufficientPermissionsError(
+                    "Insufficient permissions to access Confluence resources (HTTP 403)."
+                )
+            raise UnexpectedValidationError(
+                f"Unexpected Confluence error (status={status_code}): {e}"
+            )
+        except Exception as e:
+            raise UnexpectedValidationError(
+                f"Unexpected error while validating Confluence settings: {e}"
+            )
+
+        if not spaces or not spaces.get("results"):
+            raise ConnectorValidationError(
+                "No Confluence spaces found. Either your credentials lack permissions, or "
+                "there truly are no spaces in this Confluence instance."
+            )
--- a/backend/onyx/connectors/confluence/onyx_confluence.py
+++ b/backend/onyx/connectors/confluence/onyx_confluence.py
@@ -1,16 +1,37 @@
-import math
+import io
+import json
 import time
 from collections.abc import Callable
 from collections.abc import Iterator
+from datetime import datetime
+from datetime import timedelta
+from datetime import timezone
 from typing import Any
 from typing import cast
 from typing import TypeVar
 from urllib.parse import quote

+import bs4
 from atlassian import Confluence  # type:ignore
 from pydantic import BaseModel
+from redis import Redis
 from requests import HTTPError

+from ee.onyx.configs.app_configs import OAUTH_CONFLUENCE_CLOUD_CLIENT_ID
+from ee.onyx.configs.app_configs import OAUTH_CONFLUENCE_CLOUD_CLIENT_SECRET
+from onyx.configs.app_configs import (
+    CONFLUENCE_CONNECTOR_ATTACHMENT_CHAR_COUNT_THRESHOLD,
+)
+from onyx.configs.app_configs import CONFLUENCE_CONNECTOR_ATTACHMENT_SIZE_THRESHOLD
+from onyx.connectors.confluence.utils import _handle_http_error
+from onyx.connectors.confluence.utils import confluence_refresh_tokens
+from onyx.connectors.confluence.utils import get_start_param_from_url
+from onyx.connectors.confluence.utils import update_param_in_path
+from onyx.connectors.confluence.utils import validate_attachment_filetype
+from onyx.connectors.interfaces import CredentialsProviderInterface
+from onyx.file_processing.extract_file_text import extract_file_text
+from onyx.file_processing.html_utils import format_document_soup
+from onyx.redis.redis_pool import get_redis_client
 from onyx.utils.logger import setup_logger

 logger = setup_logger()
@@ -19,12 +40,14 @@ logger = setup_logger()
 F = TypeVar("F", bound=Callable[..., Any])


-RATE_LIMIT_MESSAGE_LOWERCASE = "Rate limit exceeded".lower()
-
 # https://jira.atlassian.com/browse/CONFCLOUD-76433
 _PROBLEMATIC_EXPANSIONS = "body.storage.value"
 _REPLACEMENT_EXPANSIONS = "body.view.value"

+_USER_NOT_FOUND = "Unknown Confluence User"
+_USER_ID_TO_DISPLAY_NAME_CACHE: dict[str, str | None] = {}
+_USER_EMAIL_CACHE: dict[str, str | None] = {}
+

 class ConfluenceRateLimitError(Exception):
    pass
@@ -40,127 +63,358 @@ class ConfluenceUser(BaseModel):
    type: str


-def _handle_http_error(e: HTTPError, attempt: int) -> int:
-    MIN_DELAY = 2
-    MAX_DELAY = 60
-    STARTING_DELAY = 5
-    BACKOFF = 2
-
-    # Check if the response or headers are None to avoid potential AttributeError
-    if e.response is None or e.response.headers is None:
-        logger.warning("HTTPError with `None` as response or as headers")
-        raise e
-
-    if (
-        e.response.status_code != 429
-        and RATE_LIMIT_MESSAGE_LOWERCASE not in e.response.text.lower()
-    ):
-        raise e
-
-    retry_after = None
-
-    retry_after_header = e.response.headers.get("Retry-After")
-    if retry_after_header is not None:
-        try:
-            retry_after = int(retry_after_header)
-            if retry_after > MAX_DELAY:
-                logger.warning(
-                    f"Clamping retry_after from {retry_after} to {MAX_DELAY} seconds..."
-                )
-                retry_after = MAX_DELAY
-            if retry_after < MIN_DELAY:
-                retry_after = MIN_DELAY
-        except ValueError:
-            pass
-
-    if retry_after is not None:
-        logger.warning(
-            f"Rate limiting with retry header. Retrying after {retry_after} seconds..."
-        )
-        delay = retry_after
-    else:
-        logger.warning(
-            "Rate limiting without retry header. Retrying with exponential backoff..."
-        )
-        delay = min(STARTING_DELAY * (BACKOFF**attempt), MAX_DELAY)
-
-    delay_until = math.ceil(time.monotonic() + delay)
-    return delay_until
-
-
-# https://developer.atlassian.com/cloud/confluence/rate-limiting/
-# this uses the native rate limiting option provided by the
-# confluence client and otherwise applies a simpler set of error handling
-def handle_confluence_rate_limit(confluence_call: F) -> F:
-    def wrapped_call(*args: list[Any], **kwargs: Any) -> Any:
-        MAX_RETRIES = 5
-
-        TIMEOUT = 600
-        timeout_at = time.monotonic() + TIMEOUT
-
-        for attempt in range(MAX_RETRIES):
-            if time.monotonic() > timeout_at:
-                raise TimeoutError(
-                    f"Confluence call attempts took longer than {TIMEOUT} seconds."
-                )
-
-            try:
-                # we're relying more on the client to rate limit itself
-                # and applying our own retries in a more specific set of circumstances
-                return confluence_call(*args, **kwargs)
-            except HTTPError as e:
-                delay_until = _handle_http_error(e, attempt)
-                logger.warning(
-                    f"HTTPError in confluence call. "
-                    f"Retrying in {delay_until} seconds..."
-                )
-                while time.monotonic() < delay_until:
-                    # in the future, check a signal here to exit
-                    time.sleep(1)
-            except AttributeError as e:
-                # Some error within the Confluence library, unclear why it fails.
-                # Users reported it to be intermittent, so just retry
-                if attempt == MAX_RETRIES - 1:
-                    raise e
-
-                logger.exception(
-                    "Confluence Client raised an AttributeError. Retrying..."
-                )
-                time.sleep(5)
-
-    return cast(F, wrapped_call)
-
-
 _DEFAULT_PAGINATION_LIMIT = 1000
 _MINIMUM_PAGINATION_LIMIT = 50


-class OnyxConfluence(Confluence):
+class OnyxConfluence:
    """
-    This is a custom Confluence class that overrides the default Confluence class to add a custom CQL method.
+    This is a custom Confluence class that:
+
+    A. overrides the default Confluence class to add a custom CQL method.
+    B.
    This is necessary because the default Confluence class does not properly support cql expansions.
    All methods are automatically wrapped with handle_confluence_rate_limit.
    """

-    def __init__(self, url: str, *args: Any, **kwargs: Any) -> None:
-        super(OnyxConfluence, self).__init__(url, *args, **kwargs)
-        self._wrap_methods()
+    CREDENTIAL_PREFIX = "connector:confluence:credential"
+    CREDENTIAL_TTL = 300  # 5 min

-    def _wrap_methods(self) -> None:
+    def __init__(
+        self,
+        is_cloud: bool,
+        url: str,
+        credentials_provider: CredentialsProviderInterface,
+    ) -> None:
+        self._is_cloud = is_cloud
+        self._url = url.rstrip("/")
+        self._credentials_provider = credentials_provider
+
+        self.redis_client: Redis | None = None
+        self.static_credentials: dict[str, Any] | None = None
+        if self._credentials_provider.is_dynamic():
+            self.redis_client = get_redis_client(
+                tenant_id=credentials_provider.get_tenant_id()
+            )
+        else:
+            self.static_credentials = self._credentials_provider.get_credentials()
+
+        self._confluence = Confluence(url)
+        self.credential_key: str = (
+            self.CREDENTIAL_PREFIX
+            + f":credential_{self._credentials_provider.get_provider_key()}"
+        )
+
+        self._kwargs: Any = None
+
+        self.shared_base_kwargs = {
+            "api_version": "cloud" if is_cloud else "latest",
+            "backoff_and_retry": True,
+            "cloud": is_cloud,
+        }
+
+    def _renew_credentials(self) -> tuple[dict[str, Any], bool]:
+        """credential_json - the current json credentials
+        Returns a tuple
+        1. The up to date credentials
+        2. True if the credentials were updated
+
+        This method is intended to be used within a distributed lock.
+        Lock, call this, update credentials if the tokens were refreshed, then release
        """
-        For each attribute that is callable (i.e., a method) and doesn't start with an underscore,
-        wrap it with handle_confluence_rate_limit.
-        """
-        for attr_name in dir(self):
-            if callable(getattr(self, attr_name)) and not attr_name.startswith("_"):
-                setattr(
-                    self,
-                    attr_name,
-                    handle_confluence_rate_limit(getattr(self, attr_name)),
+        # static credentials are preloaded, so no locking/redis required
+        if self.static_credentials:
+            return self.static_credentials, False
+
+        if not self.redis_client:
+            raise RuntimeError("self.redis_client is None")
+
+        # dynamic credentials need locking
+        # check redis first, then fallback to the DB
+        credential_raw = self.redis_client.get(self.credential_key)
+        if credential_raw is not None:
+            credential_bytes = cast(bytes, credential_raw)
+            credential_str = credential_bytes.decode("utf-8")
+            credential_json: dict[str, Any] = json.loads(credential_str)
+        else:
+            credential_json = self._credentials_provider.get_credentials()
+
+        if "confluence_refresh_token" not in credential_json:
+            # static credentials ... cache them permanently and return
+            self.static_credentials = credential_json
+            return credential_json, False
+
+        if not OAUTH_CONFLUENCE_CLOUD_CLIENT_ID:
+            raise RuntimeError("OAUTH_CONFLUENCE_CLOUD_CLIENT_ID must be set!")
+
+        if not OAUTH_CONFLUENCE_CLOUD_CLIENT_SECRET:
+            raise RuntimeError("OAUTH_CONFLUENCE_CLOUD_CLIENT_SECRET must be set!")
+
+        # check if we should refresh tokens. we're deciding to refresh halfway
+        # to expiration
+        now = datetime.now(timezone.utc)
+        created_at = datetime.fromisoformat(credential_json["created_at"])
+        expires_in: int = credential_json["expires_in"]
+        renew_at = created_at + timedelta(seconds=expires_in // 2)
+        if now <= renew_at:
+            # cached/current credentials are reasonably up to date
+            return credential_json, False
+
+        # we need to refresh
+        logger.info("Renewing Confluence Cloud credentials...")
+        new_credentials = confluence_refresh_tokens(
+            OAUTH_CONFLUENCE_CLOUD_CLIENT_ID,
+            OAUTH_CONFLUENCE_CLOUD_CLIENT_SECRET,
+            credential_json["cloud_id"],
+            credential_json["confluence_refresh_token"],
+        )
+
+        # store the new credentials to redis and to the db thru the provider
+        # redis: we use a 5 min TTL because we are given a 10 minute grace period
+        # when keys are rotated. it's easier to expire the cached credentials
+        # reasonably frequently rather than trying to handle strong synchronization
+        # between the db and redis everywhere the credentials might be updated
+        new_credential_str = json.dumps(new_credentials)
+        self.redis_client.set(
+            self.credential_key, new_credential_str, nx=True, ex=self.CREDENTIAL_TTL
+        )
+        self._credentials_provider.set_credentials(new_credentials)
+
+        return new_credentials, True
+
+    @staticmethod
+    def _make_oauth2_dict(credentials: dict[str, Any]) -> dict[str, Any]:
+        oauth2_dict: dict[str, Any] = {}
+        if "confluence_refresh_token" in credentials:
+            oauth2_dict["client_id"] = OAUTH_CONFLUENCE_CLOUD_CLIENT_ID
+            oauth2_dict["token"] = {}
+            oauth2_dict["token"]["access_token"] = credentials[
+                "confluence_access_token"
+            ]
+        return oauth2_dict
+
+    def _probe_connection(
+        self,
+        **kwargs: Any,
+    ) -> None:
+        merged_kwargs = {**self.shared_base_kwargs, **kwargs}
+
+        with self._credentials_provider:
+            credentials, _ = self._renew_credentials()
+
+            # probe connection with direct client, no retries
+            if "confluence_refresh_token" in credentials:
+                logger.info("Probing Confluence with OAuth Access Token.")
+
+                oauth2_dict: dict[str, Any] = OnyxConfluence._make_oauth2_dict(
+                    credentials
+                )
+                url = (
+                    f"https://api.atlassian.com/ex/confluence/{credentials['cloud_id']}"
+                )
+                confluence_client_with_minimal_retries = Confluence(
+                    url=url, oauth2=oauth2_dict, **merged_kwargs
+                )
+            else:
+                logger.info("Probing Confluence with Personal Access Token.")
+                url = self._url
+                if self._is_cloud:
+                    confluence_client_with_minimal_retries = Confluence(
+                        url=url,
+                        username=credentials["confluence_username"],
+                        password=credentials["confluence_access_token"],
+                        **merged_kwargs,
+                    )
+                else:
+                    confluence_client_with_minimal_retries = Confluence(
+                        url=url,
+                        token=credentials["confluence_access_token"],
+                        **merged_kwargs,
+                    )
+
+            spaces = confluence_client_with_minimal_retries.get_all_spaces(limit=1)
+
+            # uncomment the following for testing
+            # the following is an attempt to retrieve the user's timezone
+            # Unfornately, all data is returned in UTC regardless of the user's time zone
+            # even tho CQL parses incoming times based on the user's time zone
+            # space_key = spaces["results"][0]["key"]
+            # space_details = confluence_client_with_minimal_retries.cql(f"space.key={space_key}+AND+type=space")
+
+            if not spaces:
+                raise RuntimeError(
+                    f"No spaces found at {url}! "
+                    "Check your credentials and wiki_base and make sure "
+                    "is_cloud is set correctly."
                )

+            logger.info("Confluence probe succeeded.")
+
+    def _initialize_connection(
+        self,
+        **kwargs: Any,
+    ) -> None:
+        """Called externally to init the connection in a thread safe manner."""
+        merged_kwargs = {**self.shared_base_kwargs, **kwargs}
+        with self._credentials_provider:
+            credentials, _ = self._renew_credentials()
+            self._confluence = self._initialize_connection_helper(
+                credentials, **merged_kwargs
+            )
+            self._kwargs = merged_kwargs
+
+    def _initialize_connection_helper(
+        self,
+        credentials: dict[str, Any],
+        **kwargs: Any,
+    ) -> Confluence:
+        """Called internally to init the connection. Distributed locking
+        to prevent multiple threads from modifying the credentials
+        must be handled around this function."""
+
+        confluence = None
+
+        # probe connection with direct client, no retries
+        if "confluence_refresh_token" in credentials:
+            logger.info("Connecting to Confluence Cloud with OAuth Access Token.")
+
+            oauth2_dict: dict[str, Any] = OnyxConfluence._make_oauth2_dict(credentials)
+            url = f"https://api.atlassian.com/ex/confluence/{credentials['cloud_id']}"
+            confluence = Confluence(url=url, oauth2=oauth2_dict, **kwargs)
+        else:
+            logger.info("Connecting to Confluence with Personal Access Token.")
+            if self._is_cloud:
+                confluence = Confluence(
+                    url=self._url,
+                    username=credentials["confluence_username"],
+                    password=credentials["confluence_access_token"],
+                    **kwargs,
+                )
+            else:
+                confluence = Confluence(
+                    url=self._url,
+                    token=credentials["confluence_access_token"],
+                    **kwargs,
+                )
+
+        return confluence
+
+    # https://developer.atlassian.com/cloud/confluence/rate-limiting/
+    # this uses the native rate limiting option provided by the
+    # confluence client and otherwise applies a simpler set of error handling
+    def _make_rate_limited_confluence_method(
+        self, name: str, credential_provider: CredentialsProviderInterface | None
+    ) -> Callable[..., Any]:
+        def wrapped_call(*args: list[Any], **kwargs: Any) -> Any:
+            MAX_RETRIES = 5
+
+            TIMEOUT = 600
+            timeout_at = time.monotonic() + TIMEOUT
+
+            for attempt in range(MAX_RETRIES):
+                if time.monotonic() > timeout_at:
+                    raise TimeoutError(
+                        f"Confluence call attempts took longer than {TIMEOUT} seconds."
+                    )
+
+                # we're relying more on the client to rate limit itself
+                # and applying our own retries in a more specific set of circumstances
+                try:
+                    if credential_provider:
+                        with credential_provider:
+                            credentials, renewed = self._renew_credentials()
+                            if renewed:
+                                self._confluence = self._initialize_connection_helper(
+                                    credentials, **self._kwargs
+                                )
+                            attr = getattr(self._confluence, name, None)
+                            if attr is None:
+                                # The underlying Confluence client doesn't have this attribute
+                                raise AttributeError(
+                                    f"'{type(self).__name__}' object has no attribute '{name}'"
+                                )
+
+                            return attr(*args, **kwargs)
+                    else:
+                        attr = getattr(self._confluence, name, None)
+                        if attr is None:
+                            # The underlying Confluence client doesn't have this attribute
+                            raise AttributeError(
+                                f"'{type(self).__name__}' object has no attribute '{name}'"
+                            )
+
+                        return attr(*args, **kwargs)
+
+                except HTTPError as e:
+                    delay_until = _handle_http_error(e, attempt)
+                    logger.warning(
+                        f"HTTPError in confluence call. "
+                        f"Retrying in {delay_until} seconds..."
+                    )
+                    while time.monotonic() < delay_until:
+                        # in the future, check a signal here to exit
+                        time.sleep(1)
+                except AttributeError as e:
+                    # Some error within the Confluence library, unclear why it fails.
+                    # Users reported it to be intermittent, so just retry
+                    if attempt == MAX_RETRIES - 1:
+                        raise e
+
+                    logger.exception(
+                        "Confluence Client raised an AttributeError. Retrying..."
+                    )
+                    time.sleep(5)
+
+        return wrapped_call
+
+    # def _wrap_methods(self) -> None:
+    #     """
+    #     For each attribute that is callable (i.e., a method) and doesn't start with an underscore,
+    #     wrap it with handle_confluence_rate_limit.
+    #     """
+    #     for attr_name in dir(self):
+    #         if callable(getattr(self, attr_name)) and not attr_name.startswith("_"):
+    #             setattr(
+    #                 self,
+    #                 attr_name,
+    #                 handle_confluence_rate_limit(getattr(self, attr_name)),
+    #             )
+
+    # def _ensure_token_valid(self) -> None:
+    #     if self._token_is_expired():
+    #         self._refresh_token()
+    #         # Re-init the Confluence client with the originally stored args
+    #         self._confluence = Confluence(self._url, *self._args, **self._kwargs)
+
+    def __getattr__(self, name: str) -> Any:
+        """Dynamically intercept attribute/method access."""
+        attr = getattr(self._confluence, name, None)
+        if attr is None:
+            # The underlying Confluence client doesn't have this attribute
+            raise AttributeError(
+                f"'{type(self).__name__}' object has no attribute '{name}'"
+            )
+
+        # If it's not a method, just return it after ensuring token validity
+        if not callable(attr):
+            return attr
+
+        # skip methods that start with "_"
+        if name.startswith("_"):
+            return attr
+
+        # wrap the method with our retry handler
+        rate_limited_method: Callable[
+            ..., Any
+        ] = self._make_rate_limited_confluence_method(name, self._credentials_provider)
+
+        def wrapped_method(*args: Any, **kwargs: Any) -> Any:
+            return rate_limited_method(*args, **kwargs)
+
+        return wrapped_method
+
    def _paginate_url(
-        self, url_suffix: str, limit: int | None = None
+        self, url_suffix: str, limit: int | None = None, auto_paginate: bool = False
    ) -> Iterator[dict[str, Any]]:
        """
        This will paginate through the top level query.
@@ -235,9 +489,41 @@ class OnyxConfluence(Confluence):
                raise e

            # yield the results individually
-            yield from next_response.get("results", [])
+            results = cast(list[dict[str, Any]], next_response.get("results", []))
+            yield from results

-            url_suffix = next_response.get("_links", {}).get("next")
+            old_url_suffix = url_suffix
+            url_suffix = cast(str, next_response.get("_links", {}).get("next", ""))
+
+            # make sure we don't update the start by more than the amount
+            # of results we were able to retrieve. The Confluence API has a
+            # weird behavior where if you pass in a limit that is too large for
+            # the configured server, it will artificially limit the amount of
+            # results returned BUT will not apply this to the start parameter.
+            # This will cause us to miss results.
+            if url_suffix and "start" in url_suffix:
+                new_start = get_start_param_from_url(url_suffix)
+                previous_start = get_start_param_from_url(old_url_suffix)
+                if new_start - previous_start > len(results):
+                    logger.warning(
+                        f"Start was updated by more than the amount of results "
+                        f"retrieved. This is a bug with Confluence. Start: {new_start}, "
+                        f"Previous Start: {previous_start}, Len Results: {len(results)}."
+                    )
+
+                    # Update the url_suffix to use the adjusted start
+                    adjusted_start = previous_start + len(results)
+                    url_suffix = update_param_in_path(
+                        url_suffix, "start", str(adjusted_start)
+                    )
+
+            # some APIs don't properly paginate, so we need to manually update the `start` param
+            if auto_paginate and len(results) > 0:
+                previous_start = get_start_param_from_url(old_url_suffix)
+                updated_start = previous_start + len(results)
+                url_suffix = update_param_in_path(
+                    old_url_suffix, "start", str(updated_start)
+                )

    def paginated_cql_retrieval(
        self,
@@ -297,7 +583,9 @@ class OnyxConfluence(Confluence):
            url = "rest/api/search/user"
            expand_string = f"&expand={expand}" if expand else ""
            url += f"?cql={cql}{expand_string}"
-            for user_result in self._paginate_url(url, limit):
+            # endpoint doesn't properly paginate, so we need to manually update the `start` param
+            # thus the auto_paginate flag
+            for user_result in self._paginate_url(url, limit, auto_paginate=True):
                # Example response:
                # {
                #     'user': {
@@ -470,59 +758,212 @@ class OnyxConfluence(Confluence):
        return response


-def _validate_connector_configuration(
-    credentials: dict[str, Any],
-    is_cloud: bool,
-    wiki_base: str,
-) -> None:
-    # test connection with direct client, no retries
-    confluence_client_with_minimal_retries = Confluence(
-        api_version="cloud" if is_cloud else "latest",
-        url=wiki_base.rstrip("/"),
-        username=credentials["confluence_username"] if is_cloud else None,
-        password=credentials["confluence_access_token"] if is_cloud else None,
-        token=credentials["confluence_access_token"] if not is_cloud else None,
-        backoff_and_retry=True,
-        max_backoff_retries=6,
-        max_backoff_seconds=10,
+def get_user_email_from_username__server(
+    confluence_client: OnyxConfluence, user_name: str
+) -> str | None:
+    global _USER_EMAIL_CACHE
+    if _USER_EMAIL_CACHE.get(user_name) is None:
+        try:
+            response = confluence_client.get_mobile_parameters(user_name)
+            email = response.get("email")
+        except Exception:
+            logger.warning(f"failed to get confluence email for {user_name}")
+            # For now, we'll just return None and log a warning. This means
+            # we will keep retrying to get the email every group sync.
+            email = None
+            # We may want to just return a string that indicates failure so we dont
+            # keep retrying
+            # email = f"FAILED TO GET CONFLUENCE EMAIL FOR {user_name}"
+        _USER_EMAIL_CACHE[user_name] = email
+    return _USER_EMAIL_CACHE[user_name]
+
+
+def _get_user(confluence_client: OnyxConfluence, user_id: str) -> str:
+    """Get Confluence Display Name based on the account-id or userkey value
+
+    Args:
+        user_id (str): The user id (i.e: the account-id or userkey)
+        confluence_client (Confluence): The Confluence Client
+
+    Returns:
+        str: The User Display Name. 'Unknown User' if the user is deactivated or not found
+    """
+    global _USER_ID_TO_DISPLAY_NAME_CACHE
+    if _USER_ID_TO_DISPLAY_NAME_CACHE.get(user_id) is None:
+        try:
+            result = confluence_client.get_user_details_by_userkey(user_id)
+            found_display_name = result.get("displayName")
+        except Exception:
+            found_display_name = None
+
+        if not found_display_name:
+            try:
+                result = confluence_client.get_user_details_by_accountid(user_id)
+                found_display_name = result.get("displayName")
+            except Exception:
+                found_display_name = None
+
+        _USER_ID_TO_DISPLAY_NAME_CACHE[user_id] = found_display_name
+
+    return _USER_ID_TO_DISPLAY_NAME_CACHE.get(user_id) or _USER_NOT_FOUND
+
+
+def attachment_to_content(
+    confluence_client: OnyxConfluence,
+    attachment: dict[str, Any],
+    parent_content_id: str | None = None,
+) -> str | None:
+    """If it returns None, assume that we should skip this attachment."""
+    if not validate_attachment_filetype(attachment):
+        return None
+
+    if "api.atlassian.com" in confluence_client.url:
+        # https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-content---attachments/#api-wiki-rest-api-content-id-child-attachment-attachmentid-download-get
+        if not parent_content_id:
+            logger.warning(
+                "parent_content_id is required to download attachments from Confluence Cloud!"
+            )
+            return None
+
+        download_link = (
+            confluence_client.url
+            + f"/rest/api/content/{parent_content_id}/child/attachment/{attachment['id']}/download"
+        )
+    else:
+        download_link = confluence_client.url + attachment["_links"]["download"]
+
+    attachment_size = attachment["extensions"]["fileSize"]
+    if attachment_size > CONFLUENCE_CONNECTOR_ATTACHMENT_SIZE_THRESHOLD:
+        logger.warning(
+            f"Skipping {download_link} due to size. "
+            f"size={attachment_size} "
+            f"threshold={CONFLUENCE_CONNECTOR_ATTACHMENT_SIZE_THRESHOLD}"
+        )
+        return None
+
+    logger.info(f"_attachment_to_content - _session.get: link={download_link}")
+
+    # why are we using session.get here? we probably won't retry these ... is that ok?
+    response = confluence_client._session.get(download_link)
+    if response.status_code != 200:
+        logger.warning(
+            f"Failed to fetch {download_link} with invalid status code {response.status_code}"
+        )
+        return None
+
+    extracted_text = extract_file_text(
+        io.BytesIO(response.content),
+        file_name=attachment["title"],
+        break_on_unprocessable=False,
    )
-    spaces = confluence_client_with_minimal_retries.get_all_spaces(limit=1)
+    if len(extracted_text) > CONFLUENCE_CONNECTOR_ATTACHMENT_CHAR_COUNT_THRESHOLD:
+        logger.warning(
+            f"Skipping {download_link} due to char count. "
+            f"char count={len(extracted_text)} "
+            f"threshold={CONFLUENCE_CONNECTOR_ATTACHMENT_CHAR_COUNT_THRESHOLD}"
+        )
+        return None

-    # uncomment the following for testing
-    # the following is an attempt to retrieve the user's timezone
-    # Unfornately, all data is returned in UTC regardless of the user's time zone
-    # even tho CQL parses incoming times based on the user's time zone
-    # space_key = spaces["results"][0]["key"]
-    # space_details = confluence_client_with_minimal_retries.cql(f"space.key={space_key}+AND+type=space")
+    return extracted_text

-    if not spaces:
-        raise RuntimeError(
-            f"No spaces found at {wiki_base}! "
-            "Check your credentials and wiki_base and make sure "
-            "is_cloud is set correctly."
+
+def extract_text_from_confluence_html(
+    confluence_client: OnyxConfluence,
+    confluence_object: dict[str, Any],
+    fetched_titles: set[str],
+) -> str:
+    """Parse a Confluence html page and replace the 'user Id' by the real
+        User Display Name
+
+    Args:
+        confluence_object (dict): The confluence object as a dict
+        confluence_client (Confluence): Confluence client
+        fetched_titles (set[str]): The titles of the pages that have already been fetched
+    Returns:
+        str: loaded and formated Confluence page
+    """
+    body = confluence_object["body"]
+    object_html = body.get("storage", body.get("view", {})).get("value")
+
+    soup = bs4.BeautifulSoup(object_html, "html.parser")
+    for user in soup.findAll("ri:user"):
+        user_id = (
+            user.attrs["ri:account-id"]
+            if "ri:account-id" in user.attrs
+            else user.get("ri:userkey")
+        )
+        if not user_id:
+            logger.warning(
+                "ri:userkey not found in ri:user element. " f"Found attrs: {user.attrs}"
+            )
+            continue
+        # Include @ sign for tagging, more clear for LLM
+        user.replaceWith("@" + _get_user(confluence_client, user_id))
+
+    for html_page_reference in soup.findAll("ac:structured-macro"):
+        # Here, we only want to process page within page macros
+        if html_page_reference.attrs.get("ac:name") != "include":
+            continue
+
+        page_data = html_page_reference.find("ri:page")
+        if not page_data:
+            logger.warning(
+                f"Skipping retrieval of {html_page_reference} because because page data is missing"
+            )
+            continue
+
+        page_title = page_data.attrs.get("ri:content-title")
+        if not page_title:
+            # only fetch pages that have a title
+            logger.warning(
+                f"Skipping retrieval of {html_page_reference} because it has no title"
+            )
+            continue
+
+        if page_title in fetched_titles:
+            # prevent recursive fetching of pages
+            logger.debug(f"Skipping {page_title} because it has already been fetched")
+            continue
+
+        fetched_titles.add(page_title)
+
+        # Wrap this in a try-except because there are some pages that might not exist
+        try:
+            page_query = f"type=page and title='{quote(page_title)}'"
+
+            page_contents: dict[str, Any] | None = None
+            # Confluence enforces title uniqueness, so we should only get one result here
+            for page in confluence_client.paginated_cql_retrieval(
+                cql=page_query,
+                expand="body.storage.value",
+                limit=1,
+            ):
+                page_contents = page
+                break
+        except Exception as e:
+            logger.warning(
+                f"Error getting page contents for object {confluence_object}: {e}"
+            )
+            continue
+
+        if not page_contents:
+            continue
+
+        text_from_page = extract_text_from_confluence_html(
+            confluence_client=confluence_client,
+            confluence_object=page_contents,
+            fetched_titles=fetched_titles,
        )

+        html_page_reference.replaceWith(text_from_page)

-def build_confluence_client(
-    credentials: dict[str, Any],
-    is_cloud: bool,
-    wiki_base: str,
-) -> OnyxConfluence:
-    _validate_connector_configuration(
-        credentials=credentials,
-        is_cloud=is_cloud,
-        wiki_base=wiki_base,
-    )
-    return OnyxConfluence(
-        api_version="cloud" if is_cloud else "latest",
-        # Remove trailing slash from wiki_base if present
-        url=wiki_base.rstrip("/"),
-        # passing in username causes issues for Confluence data center
-        username=credentials["confluence_username"] if is_cloud else None,
-        password=credentials["confluence_access_token"] if is_cloud else None,
-        token=credentials["confluence_access_token"] if not is_cloud else None,
-        backoff_and_retry=True,
-        max_backoff_retries=10,
-        max_backoff_seconds=60,
-        cloud=is_cloud,
-    )
+    for html_link_body in soup.findAll("ac:link-body"):
+        # This extracts the text from inline links in the page so they can be
+        # represented in the document text as plain text
+        try:
+            text_from_link = html_link_body.text
+            html_link_body.replaceWith(f"(LINK TEXT: {text_from_link})")
+        except Exception as e:
+            logger.warning(f"Error processing ac:link-body: {e}")
+
+    return format_document_soup(soup)
--- a/backend/onyx/connectors/confluence/utils.py
+++ b/backend/onyx/connectors/confluence/utils.py
@@ -1,236 +1,280 @@
 import io
+import math
+import time
+from collections.abc import Callable
 from datetime import datetime
+from datetime import timedelta
 from datetime import timezone
+from io import BytesIO
+from pathlib import Path
 from typing import Any
+from typing import cast
+from typing import TYPE_CHECKING
+from typing import TypeVar
+from urllib.parse import parse_qs
 from urllib.parse import quote
+from urllib.parse import urlparse

-import bs4
+import requests
+from pydantic import BaseModel
+from sqlalchemy.orm import Session

 from onyx.configs.app_configs import (
    CONFLUENCE_CONNECTOR_ATTACHMENT_CHAR_COUNT_THRESHOLD,
 )
-from onyx.configs.app_configs import CONFLUENCE_CONNECTOR_ATTACHMENT_SIZE_THRESHOLD
-from onyx.connectors.confluence.onyx_confluence import (
-    OnyxConfluence,
-)
+from onyx.configs.constants import FileOrigin
+
+if TYPE_CHECKING:
+    from onyx.connectors.confluence.onyx_confluence import OnyxConfluence
+
+from onyx.db.engine import get_session_with_current_tenant
+from onyx.db.models import PGFileStore
+from onyx.db.pg_file_store import create_populate_lobj
+from onyx.db.pg_file_store import save_bytes_to_pgfilestore
+from onyx.db.pg_file_store import upsert_pgfilestore
 from onyx.file_processing.extract_file_text import extract_file_text
-from onyx.file_processing.html_utils import format_document_soup
+from onyx.file_processing.file_validation import is_valid_image_type
+from onyx.file_processing.image_utils import store_image_and_create_section
+from onyx.llm.interfaces import LLM
 from onyx.utils.logger import setup_logger

 logger = setup_logger()

-
-_USER_EMAIL_CACHE: dict[str, str | None] = {}
+CONFLUENCE_OAUTH_TOKEN_URL = "https://auth.atlassian.com/oauth/token"
+RATE_LIMIT_MESSAGE_LOWERCASE = "Rate limit exceeded".lower()


-def get_user_email_from_username__server(
-    confluence_client: OnyxConfluence, user_name: str
-) -> str | None:
-    global _USER_EMAIL_CACHE
-    if _USER_EMAIL_CACHE.get(user_name) is None:
-        try:
-            response = confluence_client.get_mobile_parameters(user_name)
-            email = response.get("email")
-        except Exception:
-            logger.warning(f"failed to get confluence email for {user_name}")
-            # For now, we'll just return None and log a warning. This means
-            # we will keep retrying to get the email every group sync.
-            email = None
-            # We may want to just return a string that indicates failure so we dont
-            # keep retrying
-            # email = f"FAILED TO GET CONFLUENCE EMAIL FOR {user_name}"
-        _USER_EMAIL_CACHE[user_name] = email
-    return _USER_EMAIL_CACHE[user_name]
+class TokenResponse(BaseModel):
+    access_token: str
+    expires_in: int
+    token_type: str
+    refresh_token: str
+    scope: str


-_USER_NOT_FOUND = "Unknown Confluence User"
-_USER_ID_TO_DISPLAY_NAME_CACHE: dict[str, str | None] = {}
-
-
-def _get_user(confluence_client: OnyxConfluence, user_id: str) -> str:
-    """Get Confluence Display Name based on the account-id or userkey value
-
-    Args:
-        user_id (str): The user id (i.e: the account-id or userkey)
-        confluence_client (Confluence): The Confluence Client
-
-    Returns:
-        str: The User Display Name. 'Unknown User' if the user is deactivated or not found
+def validate_attachment_filetype(
+    attachment: dict[str, Any], llm: LLM | None = None
+) -> bool:
    """
-    global _USER_ID_TO_DISPLAY_NAME_CACHE
-    if _USER_ID_TO_DISPLAY_NAME_CACHE.get(user_id) is None:
-        try:
-            result = confluence_client.get_user_details_by_userkey(user_id)
-            found_display_name = result.get("displayName")
-        except Exception:
-            found_display_name = None
-
-        if not found_display_name:
-            try:
-                result = confluence_client.get_user_details_by_accountid(user_id)
-                found_display_name = result.get("displayName")
-            except Exception:
-                found_display_name = None
-
-        _USER_ID_TO_DISPLAY_NAME_CACHE[user_id] = found_display_name
-
-    return _USER_ID_TO_DISPLAY_NAME_CACHE.get(user_id) or _USER_NOT_FOUND
-
-
-def extract_text_from_confluence_html(
-    confluence_client: OnyxConfluence,
-    confluence_object: dict[str, Any],
-    fetched_titles: set[str],
-) -> str:
-    """Parse a Confluence html page and replace the 'user Id' by the real
-        User Display Name
-
-    Args:
-        confluence_object (dict): The confluence object as a dict
-        confluence_client (Confluence): Confluence client
-        fetched_titles (set[str]): The titles of the pages that have already been fetched
-    Returns:
-        str: loaded and formated Confluence page
+    Validates if the attachment is a supported file type.
+    If LLM is provided, also checks if it's an image that can be processed.
    """
-    body = confluence_object["body"]
-    object_html = body.get("storage", body.get("view", {})).get("value")
+    attachment.get("metadata", {})
+    media_type = attachment.get("metadata", {}).get("mediaType", "")

-    soup = bs4.BeautifulSoup(object_html, "html.parser")
-    for user in soup.findAll("ri:user"):
-        user_id = (
-            user.attrs["ri:account-id"]
-            if "ri:account-id" in user.attrs
-            else user.get("ri:userkey")
-        )
-        if not user_id:
-            logger.warning(
-                "ri:userkey not found in ri:user element. " f"Found attrs: {user.attrs}"
-            )
-            continue
-        # Include @ sign for tagging, more clear for LLM
-        user.replaceWith("@" + _get_user(confluence_client, user_id))
+    if media_type.startswith("image/"):
+        return llm is not None and is_valid_image_type(media_type)

-    for html_page_reference in soup.findAll("ac:structured-macro"):
-        # Here, we only want to process page within page macros
-        if html_page_reference.attrs.get("ac:name") != "include":
-            continue
-
-        page_data = html_page_reference.find("ri:page")
-        if not page_data:
-            logger.warning(
-                f"Skipping retrieval of {html_page_reference} because because page data is missing"
-            )
-            continue
-
-        page_title = page_data.attrs.get("ri:content-title")
-        if not page_title:
-            # only fetch pages that have a title
-            logger.warning(
-                f"Skipping retrieval of {html_page_reference} because it has no title"
-            )
-            continue
-
-        if page_title in fetched_titles:
-            # prevent recursive fetching of pages
-            logger.debug(f"Skipping {page_title} because it has already been fetched")
-            continue
-
-        fetched_titles.add(page_title)
-
-        # Wrap this in a try-except because there are some pages that might not exist
-        try:
-            page_query = f"type=page and title='{quote(page_title)}'"
-
-            page_contents: dict[str, Any] | None = None
-            # Confluence enforces title uniqueness, so we should only get one result here
-            for page in confluence_client.paginated_cql_retrieval(
-                cql=page_query,
-                expand="body.storage.value",
-                limit=1,
-            ):
-                page_contents = page
-                break
-        except Exception as e:
-            logger.warning(
-                f"Error getting page contents for object {confluence_object}: {e}"
-            )
-            continue
-
-        if not page_contents:
-            continue
-
-        text_from_page = extract_text_from_confluence_html(
-            confluence_client=confluence_client,
-            confluence_object=page_contents,
-            fetched_titles=fetched_titles,
-        )
-
-        html_page_reference.replaceWith(text_from_page)
-
-    for html_link_body in soup.findAll("ac:link-body"):
-        # This extracts the text from inline links in the page so they can be
-        # represented in the document text as plain text
-        try:
-            text_from_link = html_link_body.text
-            html_link_body.replaceWith(f"(LINK TEXT: {text_from_link})")
-        except Exception as e:
-            logger.warning(f"Error processing ac:link-body: {e}")
-
-    return format_document_soup(soup)
+    # For non-image files, check if we support the extension
+    title = attachment.get("title", "")
+    extension = Path(title).suffix.lstrip(".").lower() if "." in title else ""
+    return extension in ["pdf", "doc", "docx", "txt", "md", "rtf"]


-def validate_attachment_filetype(attachment: dict[str, Any]) -> bool:
-    return attachment["metadata"]["mediaType"] not in [
-        "image/jpeg",
-        "image/png",
-        "image/gif",
-        "image/svg+xml",
-        "video/mp4",
-        "video/quicktime",
-    ]
+class AttachmentProcessingResult(BaseModel):
+    """
+    A container for results after processing a Confluence attachment.
+    'text' is the textual content of the attachment.
+    'file_name' is the final file name used in PGFileStore to store the content.
+    'error' holds an exception or string if something failed.
+    """
+
+    text: str | None
+    file_name: str | None
+    error: str | None = None


-def attachment_to_content(
-    confluence_client: OnyxConfluence,
-    attachment: dict[str, Any],
-) -> str | None:
-    """If it returns None, assume that we should skip this attachment."""
-    if not validate_attachment_filetype(attachment):
-        return None
-
+def _download_attachment(
+    confluence_client: "OnyxConfluence", attachment: dict[str, Any]
+) -> bytes | None:
+    """
+    Retrieves the raw bytes of an attachment from Confluence. Returns None on error.
+    """
    download_link = confluence_client.url + attachment["_links"]["download"]
-
-    attachment_size = attachment["extensions"]["fileSize"]
-    if attachment_size > CONFLUENCE_CONNECTOR_ATTACHMENT_SIZE_THRESHOLD:
+    resp = confluence_client._session.get(download_link)
+    if resp.status_code != 200:
        logger.warning(
-            f"Skipping {download_link} due to size. "
-            f"size={attachment_size} "
-            f"threshold={CONFLUENCE_CONNECTOR_ATTACHMENT_SIZE_THRESHOLD}"
+            f"Failed to fetch {download_link} with status code {resp.status_code}"
        )
        return None
+    return resp.content

-    logger.info(f"_attachment_to_content - _session.get: link={download_link}")
-    response = confluence_client._session.get(download_link)
-    if response.status_code != 200:
-        logger.warning(
-            f"Failed to fetch {download_link} with invalid status code {response.status_code}"
+
+def process_attachment(
+    confluence_client: "OnyxConfluence",
+    attachment: dict[str, Any],
+    page_context: str,
+    llm: LLM | None,
+) -> AttachmentProcessingResult:
+    """
+    Processes a Confluence attachment. If it's a document, extracts text,
+    or if it's an image and an LLM is available, summarizes it. Returns a structured result.
+    """
+    try:
+        # Get the media type from the attachment metadata
+        media_type = attachment.get("metadata", {}).get("mediaType", "")
+
+        # Validate the attachment type
+        if not validate_attachment_filetype(attachment, llm):
+            return AttachmentProcessingResult(
+                text=None,
+                file_name=None,
+                error=f"Unsupported file type: {media_type}",
+            )
+
+        # Download the attachment
+        raw_bytes = _download_attachment(confluence_client, attachment)
+        if raw_bytes is None:
+            return AttachmentProcessingResult(
+                text=None, file_name=None, error="Failed to download attachment"
+            )
+
+        # Process image attachments with LLM if available
+        if media_type.startswith("image/") and llm:
+            return _process_image_attachment(
+                confluence_client, attachment, page_context, llm, raw_bytes, media_type
+            )
+
+        # Process document attachments
+        try:
+            text = extract_file_text(
+                file=BytesIO(raw_bytes),
+                file_name=attachment["title"],
+            )
+
+            # Skip if the text is too long
+            if len(text) > CONFLUENCE_CONNECTOR_ATTACHMENT_CHAR_COUNT_THRESHOLD:
+                return AttachmentProcessingResult(
+                    text=None,
+                    file_name=None,
+                    error=f"Attachment text too long: {len(text)} chars",
+                )
+
+            return AttachmentProcessingResult(text=text, file_name=None, error=None)
+        except Exception as e:
+            return AttachmentProcessingResult(
+                text=None, file_name=None, error=f"Failed to extract text: {e}"
+            )
+
+    except Exception as e:
+        return AttachmentProcessingResult(
+            text=None, file_name=None, error=f"Failed to process attachment: {e}"
        )
-        return None

-    extracted_text = extract_file_text(
-        io.BytesIO(response.content),
-        file_name=attachment["title"],
-        break_on_unprocessable=False,
-    )
+
+def _process_image_attachment(
+    confluence_client: "OnyxConfluence",
+    attachment: dict[str, Any],
+    page_context: str,
+    llm: LLM,
+    raw_bytes: bytes,
+    media_type: str,
+) -> AttachmentProcessingResult:
+    """Process an image attachment by saving it and generating a summary."""
+    try:
+        # Use the standardized image storage and section creation
+        with get_session_with_current_tenant() as db_session:
+            section, file_name = store_image_and_create_section(
+                db_session=db_session,
+                image_data=raw_bytes,
+                file_name=Path(attachment["id"]).name,
+                display_name=attachment["title"],
+                media_type=media_type,
+                llm=llm,
+                file_origin=FileOrigin.CONNECTOR,
+            )
+
+            return AttachmentProcessingResult(
+                text=section.text, file_name=file_name, error=None
+            )
+    except Exception as e:
+        msg = f"Image summarization failed for {attachment['title']}: {e}"
+        logger.error(msg, exc_info=e)
+        return AttachmentProcessingResult(text=None, file_name=None, error=msg)
+
+
+def _process_text_attachment(
+    attachment: dict[str, Any],
+    raw_bytes: bytes,
+    media_type: str,
+) -> AttachmentProcessingResult:
+    """Process a text-based attachment by extracting its content."""
+    try:
+        extracted_text = extract_file_text(
+            io.BytesIO(raw_bytes),
+            file_name=attachment["title"],
+            break_on_unprocessable=False,
+        )
+    except Exception as e:
+        msg = f"Failed to extract text for '{attachment['title']}': {e}"
+        logger.error(msg, exc_info=e)
+        return AttachmentProcessingResult(text=None, file_name=None, error=msg)
+
+    # Check length constraints
+    if extracted_text is None or len(extracted_text) == 0:
+        msg = f"No text extracted for {attachment['title']}"
+        logger.warning(msg)
+        return AttachmentProcessingResult(text=None, file_name=None, error=msg)
+
    if len(extracted_text) > CONFLUENCE_CONNECTOR_ATTACHMENT_CHAR_COUNT_THRESHOLD:
+        msg = (
+            f"Skipping attachment {attachment['title']} due to char count "
+            f"({len(extracted_text)} > {CONFLUENCE_CONNECTOR_ATTACHMENT_CHAR_COUNT_THRESHOLD})"
+        )
+        logger.warning(msg)
+        return AttachmentProcessingResult(text=None, file_name=None, error=msg)
+
+    # Save the attachment
+    try:
+        with get_session_with_current_tenant() as db_session:
+            saved_record = save_bytes_to_pgfilestore(
+                db_session=db_session,
+                raw_bytes=raw_bytes,
+                media_type=media_type,
+                identifier=attachment["id"],
+                display_name=attachment["title"],
+            )
+    except Exception as e:
+        msg = f"Failed to save attachment '{attachment['title']}' to PG: {e}"
+        logger.error(msg, exc_info=e)
+        return AttachmentProcessingResult(
+            text=extracted_text, file_name=None, error=msg
+        )
+
+    return AttachmentProcessingResult(
+        text=extracted_text, file_name=saved_record.file_name, error=None
+    )
+
+
+def convert_attachment_to_content(
+    confluence_client: "OnyxConfluence",
+    attachment: dict[str, Any],
+    page_context: str,
+    llm: LLM | None,
+) -> tuple[str | None, str | None] | None:
+    """
+    Facade function which:
+      1. Validates attachment type
+      2. Extracts or summarizes content
+      3. Returns (content_text, stored_file_name) or None if we should skip it
+    """
+    media_type = attachment["metadata"]["mediaType"]
+    # Quick check for unsupported types:
+    if media_type.startswith("video/") or media_type == "application/gliffy+json":
        logger.warning(
-            f"Skipping {download_link} due to char count. "
-            f"char count={len(extracted_text)} "
-            f"threshold={CONFLUENCE_CONNECTOR_ATTACHMENT_CHAR_COUNT_THRESHOLD}"
+            f"Skipping unsupported attachment type: '{media_type}' for {attachment['title']}"
        )
        return None

-    return extracted_text
+    result = process_attachment(confluence_client, attachment, page_context, llm)
+    if result.error is not None:
+        logger.warning(
+            f"Attachment {attachment['title']} encountered error: {result.error}"
+        )
+        return None
+
+    # Return the text and the file name
+    return result.text, result.file_name


 def build_confluence_document_id(
@@ -251,23 +295,6 @@ def build_confluence_document_id(
    return f"{base_url}{content_url}"


-def _extract_referenced_attachment_names(page_text: str) -> list[str]:
-    """Parse a Confluence html page to generate a list of current
-        attachments in use
-
-    Args:
-        text (str): The page content
-
-    Returns:
-        list[str]: List of filenames currently in use by the page text
-    """
-    referenced_attachment_filenames = []
-    soup = bs4.BeautifulSoup(page_text, "html.parser")
-    for attachment in soup.findAll("ri:attachment"):
-        referenced_attachment_filenames.append(attachment.attrs["ri:filename"])
-    return referenced_attachment_filenames
-
-
 def datetime_from_string(datetime_string: str) -> datetime:
    datetime_object = datetime.fromisoformat(datetime_string)

@@ -279,3 +306,197 @@ def datetime_from_string(datetime_string: str) -> datetime:
        datetime_object = datetime_object.astimezone(timezone.utc)

    return datetime_object
+
+
+def confluence_refresh_tokens(
+    client_id: str, client_secret: str, cloud_id: str, refresh_token: str
+) -> dict[str, Any]:
+    # rotate the refresh and access token
+    # Note that access tokens are only good for an hour in confluence cloud,
+    # so we're going to have problems if the connector runs for longer
+    # https://developer.atlassian.com/cloud/confluence/oauth-2-3lo-apps/#use-a-refresh-token-to-get-another-access-token-and-refresh-token-pair
+    response = requests.post(
+        CONFLUENCE_OAUTH_TOKEN_URL,
+        headers={"Content-Type": "application/x-www-form-urlencoded"},
+        data={
+            "grant_type": "refresh_token",
+            "client_id": client_id,
+            "client_secret": client_secret,
+            "refresh_token": refresh_token,
+        },
+    )
+
+    try:
+        token_response = TokenResponse.model_validate_json(response.text)
+    except Exception:
+        raise RuntimeError("Confluence Cloud token refresh failed.")
+
+    now = datetime.now(timezone.utc)
+    expires_at = now + timedelta(seconds=token_response.expires_in)
+
+    new_credentials: dict[str, Any] = {}
+    new_credentials["confluence_access_token"] = token_response.access_token
+    new_credentials["confluence_refresh_token"] = token_response.refresh_token
+    new_credentials["created_at"] = now.isoformat()
+    new_credentials["expires_at"] = expires_at.isoformat()
+    new_credentials["expires_in"] = token_response.expires_in
+    new_credentials["scope"] = token_response.scope
+    new_credentials["cloud_id"] = cloud_id
+    return new_credentials
+
+
+F = TypeVar("F", bound=Callable[..., Any])
+
+
+# https://developer.atlassian.com/cloud/confluence/rate-limiting/
+# this uses the native rate limiting option provided by the
+# confluence client and otherwise applies a simpler set of error handling
+def handle_confluence_rate_limit(confluence_call: F) -> F:
+    def wrapped_call(*args: list[Any], **kwargs: Any) -> Any:
+        MAX_RETRIES = 5
+
+        TIMEOUT = 600
+        timeout_at = time.monotonic() + TIMEOUT
+
+        for attempt in range(MAX_RETRIES):
+            if time.monotonic() > timeout_at:
+                raise TimeoutError(
+                    f"Confluence call attempts took longer than {TIMEOUT} seconds."
+                )
+
+            try:
+                # we're relying more on the client to rate limit itself
+                # and applying our own retries in a more specific set of circumstances
+                return confluence_call(*args, **kwargs)
+            except requests.HTTPError as e:
+                delay_until = _handle_http_error(e, attempt)
+                logger.warning(
+                    f"HTTPError in confluence call. "
+                    f"Retrying in {delay_until} seconds..."
+                )
+                while time.monotonic() < delay_until:
+                    # in the future, check a signal here to exit
+                    time.sleep(1)
+            except AttributeError as e:
+                # Some error within the Confluence library, unclear why it fails.
+                # Users reported it to be intermittent, so just retry
+                if attempt == MAX_RETRIES - 1:
+                    raise e
+
+                logger.exception(
+                    "Confluence Client raised an AttributeError. Retrying..."
+                )
+                time.sleep(5)
+
+    return cast(F, wrapped_call)
+
+
+def _handle_http_error(e: requests.HTTPError, attempt: int) -> int:
+    MIN_DELAY = 2
+    MAX_DELAY = 60
+    STARTING_DELAY = 5
+    BACKOFF = 2
+
+    # Check if the response or headers are None to avoid potential AttributeError
+    if e.response is None or e.response.headers is None:
+        logger.warning("HTTPError with `None` as response or as headers")
+        raise e
+
+    if (
+        e.response.status_code != 429
+        and RATE_LIMIT_MESSAGE_LOWERCASE not in e.response.text.lower()
+    ):
+        raise e
+
+    retry_after = None
+
+    retry_after_header = e.response.headers.get("Retry-After")
+    if retry_after_header is not None:
+        try:
+            retry_after = int(retry_after_header)
+            if retry_after > MAX_DELAY:
+                logger.warning(
+                    f"Clamping retry_after from {retry_after} to {MAX_DELAY} seconds..."
+                )
+                retry_after = MAX_DELAY
+            if retry_after < MIN_DELAY:
+                retry_after = MIN_DELAY
+        except ValueError:
+            pass
+
+    if retry_after is not None:
+        logger.warning(
+            f"Rate limiting with retry header. Retrying after {retry_after} seconds..."
+        )
+        delay = retry_after
+    else:
+        logger.warning(
+            "Rate limiting without retry header. Retrying with exponential backoff..."
+        )
+        delay = min(STARTING_DELAY * (BACKOFF**attempt), MAX_DELAY)
+
+    delay_until = math.ceil(time.monotonic() + delay)
+    return delay_until
+
+
+def get_single_param_from_url(url: str, param: str) -> str | None:
+    """Get a parameter from a url"""
+    parsed_url = urlparse(url)
+    return parse_qs(parsed_url.query).get(param, [None])[0]
+
+
+def get_start_param_from_url(url: str) -> int:
+    """Get the start parameter from a url"""
+    start_str = get_single_param_from_url(url, "start")
+    if start_str is None:
+        return 0
+    return int(start_str)
+
+
+def update_param_in_path(path: str, param: str, value: str) -> str:
+    """Update a parameter in a path. Path should look something like:
+
+    /api/rest/users?start=0&limit=10
+    """
+    parsed_url = urlparse(path)
+    query_params = parse_qs(parsed_url.query)
+    query_params[param] = [value]
+    return (
+        path.split("?")[0]
+        + "?"
+        + "&".join(f"{k}={quote(v[0])}" for k, v in query_params.items())
+    )
+
+
+def attachment_to_file_record(
+    confluence_client: "OnyxConfluence",
+    attachment: dict[str, Any],
+    db_session: Session,
+) -> tuple[PGFileStore, bytes]:
+    """Save an attachment to the file store and return the file record."""
+    download_link = _attachment_to_download_link(confluence_client, attachment)
+    image_data = confluence_client.get(
+        download_link, absolute=True, not_json_response=True
+    )
+
+    # Save image to file store
+    file_name = f"confluence_attachment_{attachment['id']}"
+    lobj_oid = create_populate_lobj(BytesIO(image_data), db_session)
+    pgfilestore = upsert_pgfilestore(
+        file_name=file_name,
+        display_name=attachment["title"],
+        file_origin=FileOrigin.OTHER,
+        file_type=attachment["metadata"]["mediaType"],
+        lobj_oid=lobj_oid,
+        db_session=db_session,
+        commit=True,
+    )
+
+    return pgfilestore, image_data
+
+
+def _attachment_to_download_link(
+    confluence_client: "OnyxConfluence", attachment: dict[str, Any]
+) -> str:
+    """Extracts the download link to images."""
+    return confluence_client.url + attachment["_links"]["download"]
--- a/backend/onyx/connectors/credentials_provider.py
+++ b/backend/onyx/connectors/credentials_provider.py
@@ -0,0 +1,135 @@
+import uuid
+from types import TracebackType
+from typing import Any
+
+from redis.lock import Lock as RedisLock
+from sqlalchemy import select
+
+from onyx.connectors.interfaces import CredentialsProviderInterface
+from onyx.db.engine import get_session_with_tenant
+from onyx.db.models import Credential
+from onyx.redis.redis_pool import get_redis_client
+
+
+class OnyxDBCredentialsProvider(
+    CredentialsProviderInterface["OnyxDBCredentialsProvider"]
+):
+    """Implementation to allow the connector to callback and update credentials in the db.
+    Required in cases where credentials can rotate while the connector is running.
+    """
+
+    LOCK_TTL = 900  # TTL of the lock
+
+    def __init__(self, tenant_id: str, connector_name: str, credential_id: int):
+        self._tenant_id = tenant_id
+        self._connector_name = connector_name
+        self._credential_id = credential_id
+
+        self.redis_client = get_redis_client(tenant_id=tenant_id)
+
+        # lock used to prevent overlapping renewal of credentials
+        self.lock_key = f"da_lock:connector:{connector_name}:credential_{credential_id}"
+        self._lock: RedisLock = self.redis_client.lock(self.lock_key, self.LOCK_TTL)
+
+    def __enter__(self) -> "OnyxDBCredentialsProvider":
+        acquired = self._lock.acquire(blocking_timeout=self.LOCK_TTL)
+        if not acquired:
+            raise RuntimeError(f"Could not acquire lock for key: {self.lock_key}")
+
+        return self
+
+    def __exit__(
+        self,
+        exc_type: type[BaseException] | None,
+        exc_value: BaseException | None,
+        traceback: TracebackType | None,
+    ) -> None:
+        """Release the lock when exiting the context."""
+        if self._lock and self._lock.owned():
+            self._lock.release()
+
+    def get_tenant_id(self) -> str | None:
+        return self._tenant_id
+
+    def get_provider_key(self) -> str:
+        return str(self._credential_id)
+
+    def get_credentials(self) -> dict[str, Any]:
+        with get_session_with_tenant(tenant_id=self._tenant_id) as db_session:
+            credential = db_session.execute(
+                select(Credential).where(Credential.id == self._credential_id)
+            ).scalar_one()
+
+            if credential is None:
+                raise ValueError(
+                    f"No credential found: credential={self._credential_id}"
+                )
+
+            return credential.credential_json
+
+    def set_credentials(self, credential_json: dict[str, Any]) -> None:
+        with get_session_with_tenant(tenant_id=self._tenant_id) as db_session:
+            try:
+                credential = db_session.execute(
+                    select(Credential)
+                    .where(Credential.id == self._credential_id)
+                    .with_for_update()
+                ).scalar_one()
+
+                if credential is None:
+                    raise ValueError(
+                        f"No credential found: credential={self._credential_id}"
+                    )
+
+                credential.credential_json = credential_json
+                db_session.commit()
+            except Exception:
+                db_session.rollback()
+                raise
+
+    def is_dynamic(self) -> bool:
+        return True
+
+
+class OnyxStaticCredentialsProvider(
+    CredentialsProviderInterface["OnyxStaticCredentialsProvider"]
+):
+    """Implementation (a very simple one!) to handle static credentials."""
+
+    def __init__(
+        self,
+        tenant_id: str | None,
+        connector_name: str,
+        credential_json: dict[str, Any],
+    ):
+        self._tenant_id = tenant_id
+        self._connector_name = connector_name
+        self._credential_json = credential_json
+
+        self._provider_key = str(uuid.uuid4())
+
+    def __enter__(self) -> "OnyxStaticCredentialsProvider":
+        return self
+
+    def __exit__(
+        self,
+        exc_type: type[BaseException] | None,
+        exc_value: BaseException | None,
+        traceback: TracebackType | None,
+    ) -> None:
+        pass
+
+    def get_tenant_id(self) -> str | None:
+        return self._tenant_id
+
+    def get_provider_key(self) -> str:
+        return self._provider_key
+
+    def get_credentials(self) -> dict[str, Any]:
+        return self._credential_json
+
+    def set_credentials(self, credential_json: dict[str, Any]) -> None:
+        self._credential_json = credential_json
+
+    def is_dynamic(self) -> bool:
+        return False
--- a/backend/onyx/connectors/dropbox/connector.py
+++ b/backend/onyx/connectors/dropbox/connector.py
@@ -10,10 +10,10 @@ from dropbox.files import FolderMetadata  # type:ignore

 from onyx.configs.app_configs import INDEX_BATCH_SIZE
 from onyx.configs.constants import DocumentSource
-from onyx.connectors.interfaces import ConnectorValidationError
-from onyx.connectors.interfaces import CredentialInvalidError
+from onyx.connectors.exceptions import ConnectorValidationError
+from onyx.connectors.exceptions import CredentialInvalidError
+from onyx.connectors.exceptions import InsufficientPermissionsError
 from onyx.connectors.interfaces import GenerateDocumentsOutput
-from onyx.connectors.interfaces import InsufficientPermissionsError
 from onyx.connectors.interfaces import LoadConnector
 from onyx.connectors.interfaces import PollConnector
 from onyx.connectors.interfaces import SecondsSinceUnixEpoch
--- a/backend/onyx/connectors/exceptions.py
+++ b/backend/onyx/connectors/exceptions.py
@@ -0,0 +1,52 @@
+class ValidationError(Exception):
+    """General exception for validation errors."""
+
+    def __init__(self, message: str):
+        self.message = message
+        super().__init__(self.message)
+
+
+class ConnectorValidationError(ValidationError):
+    """General exception for connector validation errors."""
+
+    def __init__(self, message: str):
+        self.message = message
+        super().__init__(self.message)
+
+
+class UnexpectedValidationError(ValidationError):
+    """Raised when an unexpected error occurs during connector validation.
+
+    Unexpected errors don't necessarily mean the credential is invalid,
+    but rather that there was an error during the validation process
+    or we encountered a currently unhandled error case.
+
+    Currently, unexpected validation errors are defined as transient and should not be
+    used to disable the connector.
+    """
+
+    def __init__(self, message: str = "Unexpected error during connector validation"):
+        super().__init__(message)
+
+
+class CredentialInvalidError(ConnectorValidationError):
+    """Raised when a connector's credential is invalid."""
+
+    def __init__(self, message: str = "Credential is invalid"):
+        super().__init__(message)
+
+
+class CredentialExpiredError(ConnectorValidationError):
+    """Raised when a connector's credential is expired."""
+
+    def __init__(self, message: str = "Credential has expired"):
+        super().__init__(message)
+
+
+class InsufficientPermissionsError(ConnectorValidationError):
+    """Raised when the credential does not have sufficient API permissions."""
+
+    def __init__(
+        self, message: str = "Insufficient permissions for the requested operation"
+    ):
+        super().__init__(message)
--- a/backend/onyx/connectors/factory.py
+++ b/backend/onyx/connectors/factory.py
@@ -3,8 +3,8 @@ from typing import Type

 from sqlalchemy.orm import Session

+from onyx.configs.app_configs import INTEGRATION_TESTS_MODE
 from onyx.configs.constants import DocumentSource
-from onyx.configs.constants import DocumentSourceRequiringTenantContext
 from onyx.connectors.airtable.airtable_connector import AirtableConnector
 from onyx.connectors.asana.connector import AsanaConnector
 from onyx.connectors.axero.connector import AxeroConnector
@@ -12,11 +12,13 @@ from onyx.connectors.blob.connector import BlobStorageConnector
 from onyx.connectors.bookstack.connector import BookstackConnector
 from onyx.connectors.clickup.connector import ClickupConnector
 from onyx.connectors.confluence.connector import ConfluenceConnector
+from onyx.connectors.credentials_provider import OnyxDBCredentialsProvider
 from onyx.connectors.discord.connector import DiscordConnector
 from onyx.connectors.discourse.connector import DiscourseConnector
 from onyx.connectors.document360.connector import Document360Connector
 from onyx.connectors.dropbox.connector import DropboxConnector
 from onyx.connectors.egnyte.connector import EgnyteConnector
+from onyx.connectors.exceptions import ConnectorValidationError
 from onyx.connectors.file.connector import LocalFileConnector
 from onyx.connectors.fireflies.connector import FirefliesConnector
 from onyx.connectors.freshdesk.connector import FreshdeskConnector
@@ -31,7 +33,7 @@ from onyx.connectors.guru.connector import GuruConnector
 from onyx.connectors.hubspot.connector import HubSpotConnector
 from onyx.connectors.interfaces import BaseConnector
 from onyx.connectors.interfaces import CheckpointConnector
-from onyx.connectors.interfaces import ConnectorValidationError
+from onyx.connectors.interfaces import CredentialsConnector
 from onyx.connectors.interfaces import EventConnector
 from onyx.connectors.interfaces import LoadConnector
 from onyx.connectors.interfaces import PollConnector
@@ -55,9 +57,9 @@ from onyx.connectors.zendesk.connector import ZendeskConnector
 from onyx.connectors.zulip.connector import ZulipConnector
 from onyx.db.connector import fetch_connector_by_id
 from onyx.db.credentials import backend_update_credential_json
-from onyx.db.credentials import fetch_credential_by_id_for_user
+from onyx.db.credentials import fetch_credential_by_id
 from onyx.db.models import Credential
-from onyx.db.models import User
+from shared_configs.contextvars import get_current_tenant_id


 class ConnectorMissingException(Exception):
@@ -164,18 +166,21 @@ def instantiate_connector(
    input_type: InputType,
    connector_specific_config: dict[str, Any],
    credential: Credential,
-    tenant_id: str | None = None,
 ) -> BaseConnector:
    connector_class = identify_connector_class(source, input_type)

-    if source in DocumentSourceRequiringTenantContext:
-        connector_specific_config["tenant_id"] = tenant_id
-
    connector = connector_class(**connector_specific_config)
-    new_credentials = connector.load_credentials(credential.credential_json)

-    if new_credentials is not None:
-        backend_update_credential_json(credential, new_credentials, db_session)
+    if isinstance(connector, CredentialsConnector):
+        provider = OnyxDBCredentialsProvider(
+            get_current_tenant_id(), str(source), credential.id
+        )
+        connector.set_credentials_provider(provider)
+    else:
+        new_credentials = connector.load_credentials(credential.credential_json)
+
+        if new_credentials is not None:
+            backend_update_credential_json(credential, new_credentials, db_session)

    return connector

@@ -184,22 +189,30 @@ def validate_ccpair_for_user(
    connector_id: int,
    credential_id: int,
    db_session: Session,
-    user: User | None,
-    tenant_id: str | None,
-) -> None:
+    enforce_creation: bool = True,
+) -> bool:
+    if INTEGRATION_TESTS_MODE:
+        return True
+
    # Validate the connector settings
    connector = fetch_connector_by_id(connector_id, db_session)
-    credential = fetch_credential_by_id_for_user(
+    credential = fetch_credential_by_id(
        credential_id,
-        user,
        db_session,
-        get_editable=False,
    )
-    if not credential:
-        raise ValueError("Credential not found")
+
    if not connector:
        raise ValueError("Connector not found")

+    if (
+        connector.source == DocumentSource.INGESTION_API
+        or connector.source == DocumentSource.MOCK_CONNECTOR
+    ):
+        return True
+
+    if not credential:
+        raise ValueError("Credential not found")
+
    try:
        runnable_connector = instantiate_connector(
            db_session=db_session,
@@ -207,9 +220,14 @@ def validate_ccpair_for_user(
            input_type=connector.input_type,
            connector_specific_config=connector.connector_specific_config,
            credential=credential,
-            tenant_id=tenant_id,
        )
+    except ConnectorValidationError as e:
+        raise e
    except Exception as e:
-        raise ConnectorValidationError(str(e))
+        if enforce_creation:
+            raise ConnectorValidationError(str(e))
+        else:
+            return False

    runnable_connector.validate_connector_settings()
+    return True
--- a/backend/onyx/connectors/file/connector.py
+++ b/backend/onyx/connectors/file/connector.py
@@ -10,25 +10,24 @@ from sqlalchemy.orm import Session

 from onyx.configs.app_configs import INDEX_BATCH_SIZE
 from onyx.configs.constants import DocumentSource
+from onyx.configs.constants import FileOrigin
 from onyx.connectors.cross_connector_utils.miscellaneous_utils import time_str_to_utc
 from onyx.connectors.interfaces import GenerateDocumentsOutput
 from onyx.connectors.interfaces import LoadConnector
 from onyx.connectors.models import BasicExpertInfo
 from onyx.connectors.models import Document
 from onyx.connectors.models import Section
-from onyx.db.engine import get_session_with_tenant
-from onyx.file_processing.extract_file_text import detect_encoding
-from onyx.file_processing.extract_file_text import extract_file_text
+from onyx.connectors.vision_enabled_connector import VisionEnabledConnector
+from onyx.db.engine import get_session_with_current_tenant
+from onyx.db.pg_file_store import get_pgfilestore_by_file_name
+from onyx.file_processing.extract_file_text import extract_text_and_images
 from onyx.file_processing.extract_file_text import get_file_ext
-from onyx.file_processing.extract_file_text import is_text_file_extension
 from onyx.file_processing.extract_file_text import is_valid_file_ext
 from onyx.file_processing.extract_file_text import load_files_from_zip
-from onyx.file_processing.extract_file_text import read_pdf_file
-from onyx.file_processing.extract_file_text import read_text_file
+from onyx.file_processing.image_utils import store_image_and_create_section
 from onyx.file_store.file_store import get_default_file_store
+from onyx.llm.interfaces import LLM
 from onyx.utils.logger import setup_logger
-from shared_configs.configs import POSTGRES_DEFAULT_SCHEMA
-from shared_configs.contextvars import CURRENT_TENANT_ID_CONTEXTVAR

 logger = setup_logger()

@@ -37,81 +36,115 @@ def _read_files_and_metadata(
    file_name: str,
    db_session: Session,
 ) -> Iterator[tuple[str, IO, dict[str, Any]]]:
-    """Reads the file into IO, in the case of a zip file, yields each individual
-    file contained within, also includes the metadata dict if packaged in the zip"""
+    """
+    Reads the file from Postgres. If the file is a .zip, yields subfiles.
+    """
    extension = get_file_ext(file_name)
    metadata: dict[str, Any] = {}
    directory_path = os.path.dirname(file_name)

+    # Read file from Postgres store
    file_content = get_default_file_store(db_session).read_file(file_name, mode="b")

+    # If it's a zip, expand it
    if extension == ".zip":
-        for file_info, file, metadata in load_files_from_zip(
+        for file_info, subfile, metadata in load_files_from_zip(
            file_content, ignore_dirs=True
        ):
-            yield os.path.join(directory_path, file_info.filename), file, metadata
+            yield os.path.join(directory_path, file_info.filename), subfile, metadata
    elif is_valid_file_ext(extension):
        yield file_name, file_content, metadata
    else:
        logger.warning(f"Skipping file '{file_name}' with extension '{extension}'")


+def _create_image_section(
+    llm: LLM | None,
+    image_data: bytes,
+    db_session: Session,
+    parent_file_name: str,
+    display_name: str,
+    idx: int = 0,
+) -> tuple[Section, str | None]:
+    """
+    Create a Section object for a single image and store the image in PGFileStore.
+    If summarization is enabled and we have an LLM, summarize the image.
+
+    Returns:
+        tuple: (Section object, file_name in PGFileStore or None if storage failed)
+    """
+    # Create a unique file name for the embedded image
+    file_name = f"{parent_file_name}_embedded_{idx}"
+
+    # Use the standardized utility to store the image and create a section
+    return store_image_and_create_section(
+        db_session=db_session,
+        image_data=image_data,
+        file_name=file_name,
+        display_name=display_name,
+        llm=llm,
+        file_origin=FileOrigin.OTHER,
+    )
+
+
 def _process_file(
    file_name: str,
    file: IO[Any],
-    metadata: dict[str, Any] | None = None,
-    pdf_pass: str | None = None,
+    metadata: dict[str, Any] | None,
+    pdf_pass: str | None,
+    db_session: Session,
+    llm: LLM | None,
 ) -> list[Document]:
+    """
+    Processes a single file, returning a list of Documents (typically one).
+    Also handles embedded images if 'EMBEDDED_IMAGE_EXTRACTION_ENABLED' is true.
+    """
    extension = get_file_ext(file_name)
-    if not is_valid_file_ext(extension):
-        logger.warning(f"Skipping file '{file_name}' with extension '{extension}'")
+
+    # Fetch the DB record so we know the ID for internal URL
+    pg_record = get_pgfilestore_by_file_name(file_name=file_name, db_session=db_session)
+    if not pg_record:
+        logger.warning(f"No file record found for '{file_name}' in PG; skipping.")
        return []

-    file_metadata: dict[str, Any] = {}
-
-    if is_text_file_extension(file_name):
-        encoding = detect_encoding(file)
-        file_content_raw, file_metadata = read_text_file(
-            file, encoding=encoding, ignore_onyx_metadata=False
+    if not is_valid_file_ext(extension):
+        logger.warning(
+            f"Skipping file '{file_name}' with unrecognized extension '{extension}'"
        )
+        return []

-    # Using the PDF reader function directly to pass in password cleanly
-    elif extension == ".pdf" and pdf_pass is not None:
-        file_content_raw, file_metadata = read_pdf_file(file=file, pdf_pass=pdf_pass)
+    # Prepare doc metadata
+    if metadata is None:
+        metadata = {}
+    file_display_name = metadata.get("file_display_name") or os.path.basename(file_name)

-    else:
-        file_content_raw = extract_file_text(
-            file=file,
-            file_name=file_name,
-            break_on_unprocessable=True,
-        )
-
-    all_metadata = {**metadata, **file_metadata} if metadata else file_metadata
-
-    # add a prefix to avoid conflicts with other connectors
-    doc_id = f"FILE_CONNECTOR__{file_name}"
-    if metadata:
-        doc_id = metadata.get("document_id") or doc_id
-
-    # If this is set, we will show this in the UI as the "name" of the file
-    file_display_name = all_metadata.get("file_display_name") or os.path.basename(
-        file_name
-    )
-    title = (
-        all_metadata["title"] or "" if "title" in all_metadata else file_display_name
-    )
-
-    time_updated = all_metadata.get("time_updated", datetime.now(timezone.utc))
+    # Timestamps
+    current_datetime = datetime.now(timezone.utc)
+    time_updated = metadata.get("time_updated", current_datetime)
    if isinstance(time_updated, str):
        time_updated = time_str_to_utc(time_updated)

-    dt_str = all_metadata.get("doc_updated_at")
+    dt_str = metadata.get("doc_updated_at")
    final_time_updated = time_str_to_utc(dt_str) if dt_str else time_updated

-    # Metadata tags separate from the Onyx specific fields
+    # Collect owners
+    p_owner_names = metadata.get("primary_owners")
+    s_owner_names = metadata.get("secondary_owners")
+    p_owners = (
+        [BasicExpertInfo(display_name=name) for name in p_owner_names]
+        if p_owner_names
+        else None
+    )
+    s_owners = (
+        [BasicExpertInfo(display_name=name) for name in s_owner_names]
+        if s_owner_names
+        else None
+    )
+
+    # Additional tags we store as doc metadata
    metadata_tags = {
        k: v
-        for k, v in all_metadata.items()
+        for k, v in metadata.items()
        if k
        not in [
            "document_id",
@@ -124,91 +157,151 @@ def _process_file(
            "file_display_name",
            "title",
            "connector_type",
+            "pdf_password",
        ]
    }

-    source_type_str = all_metadata.get("connector_type")
-    source_type = DocumentSource(source_type_str) if source_type_str else None
-
-    p_owner_names = all_metadata.get("primary_owners")
-    s_owner_names = all_metadata.get("secondary_owners")
-    p_owners = (
-        [BasicExpertInfo(display_name=name) for name in p_owner_names]
-        if p_owner_names
-        else None
-    )
-    s_owners = (
-        [BasicExpertInfo(display_name=name) for name in s_owner_names]
-        if s_owner_names
-        else None
+    source_type_str = metadata.get("connector_type")
+    source_type = (
+        DocumentSource(source_type_str) if source_type_str else DocumentSource.FILE
    )

+    doc_id = metadata.get("document_id") or f"FILE_CONNECTOR__{file_name}"
+    title = metadata.get("title") or file_display_name
+
+    # 1) If the file itself is an image, handle that scenario quickly
+    IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png", ".webp"}
+    if extension in IMAGE_EXTENSIONS:
+        # Summarize or produce empty doc
+        image_data = file.read()
+        image_section, _ = _create_image_section(
+            llm, image_data, db_session, pg_record.file_name, title
+        )
+        return [
+            Document(
+                id=doc_id,
+                sections=[image_section],
+                source=source_type,
+                semantic_identifier=file_display_name,
+                title=title,
+                doc_updated_at=final_time_updated,
+                primary_owners=p_owners,
+                secondary_owners=s_owners,
+                metadata=metadata_tags,
+            )
+        ]
+
+    # 2) Otherwise: text-based approach. Possibly with embedded images if enabled.
+    #    (For example .docx with inline images).
+    file.seek(0)
+    text_content = ""
+    embedded_images: list[tuple[bytes, str]] = []
+
+    text_content, embedded_images = extract_text_and_images(
+        file=file,
+        file_name=file_name,
+        pdf_pass=pdf_pass,
+    )
+
+    # Build sections: first the text as a single Section
+    sections = []
+    link_in_meta = metadata.get("link")
+    if text_content.strip():
+        sections.append(Section(link=link_in_meta, text=text_content.strip()))
+
+    # Then any extracted images from docx, etc.
+    for idx, (img_data, img_name) in enumerate(embedded_images, start=1):
+        # Store each embedded image as a separate file in PGFileStore
+        # and create a section with the image summary
+        image_section, _ = _create_image_section(
+            llm,
+            img_data,
+            db_session,
+            pg_record.file_name,
+            f"{title} - image {idx}",
+            idx,
+        )
+        sections.append(image_section)
    return [
        Document(
            id=doc_id,
-            sections=[
-                Section(link=all_metadata.get("link"), text=file_content_raw.strip())
-            ],
-            source=source_type or DocumentSource.FILE,
+            sections=sections,
+            source=source_type,
            semantic_identifier=file_display_name,
            title=title,
            doc_updated_at=final_time_updated,
            primary_owners=p_owners,
            secondary_owners=s_owners,
-            # currently metadata just houses tags, other stuff like owners / updated at have dedicated fields
            metadata=metadata_tags,
        )
    ]


-class LocalFileConnector(LoadConnector):
+class LocalFileConnector(LoadConnector, VisionEnabledConnector):
+    """
+    Connector that reads files from Postgres and yields Documents, including
+    optional embedded image extraction.
+    """
+
    def __init__(
        self,
        file_locations: list[Path | str],
-        tenant_id: str = POSTGRES_DEFAULT_SCHEMA,
        batch_size: int = INDEX_BATCH_SIZE,
    ) -> None:
-        self.file_locations = [Path(file_location) for file_location in file_locations]
+        self.file_locations = [str(loc) for loc in file_locations]
        self.batch_size = batch_size
-        self.tenant_id = tenant_id
        self.pdf_pass: str | None = None

+        # Initialize vision LLM using the mixin
+        self.initialize_vision_llm()
+
    def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None:
        self.pdf_pass = credentials.get("pdf_password")
+
        return None

    def load_from_state(self) -> GenerateDocumentsOutput:
+        """
+        Iterates over each file path, fetches from Postgres, tries to parse text
+        or images, and yields Document batches.
+        """
        documents: list[Document] = []
-        token = CURRENT_TENANT_ID_CONTEXTVAR.set(self.tenant_id)

-        with get_session_with_tenant(tenant_id=self.tenant_id) as db_session:
+        with get_session_with_current_tenant() as db_session:
            for file_path in self.file_locations:
                current_datetime = datetime.now(timezone.utc)
-                files = _read_files_and_metadata(
-                    file_name=str(file_path), db_session=db_session
+
+                files_iter = _read_files_and_metadata(
+                    file_name=file_path,
+                    db_session=db_session,
                )

-                for file_name, file, metadata in files:
+                for actual_file_name, file, metadata in files_iter:
                    metadata["time_updated"] = metadata.get(
                        "time_updated", current_datetime
                    )
-                    documents.extend(
-                        _process_file(file_name, file, metadata, self.pdf_pass)
+                    new_docs = _process_file(
+                        file_name=actual_file_name,
+                        file=file,
+                        metadata=metadata,
+                        pdf_pass=self.pdf_pass,
+                        db_session=db_session,
+                        llm=self.image_analysis_llm,
                    )
+                    documents.extend(new_docs)

                    if len(documents) >= self.batch_size:
                        yield documents
+
                        documents = []

            if documents:
                yield documents

-        CURRENT_TENANT_ID_CONTEXTVAR.reset(token)
-

 if __name__ == "__main__":
    connector = LocalFileConnector(file_locations=[os.environ["TEST_FILE"]])
-    connector.load_credentials({"pdf_password": os.environ["PDF_PASSWORD"]})
-
-    document_batches = connector.load_from_state()
-    print(next(document_batches))
+    connector.load_credentials({"pdf_password": os.environ.get("PDF_PASSWORD")})
+    doc_batches = connector.load_from_state()
+    for batch in doc_batches:
+        print("BATCH:", batch)
--- a/backend/onyx/connectors/gitbook/connector.py
+++ b/backend/onyx/connectors/gitbook/connector.py
@@ -229,16 +229,20 @@ class GitbookConnector(LoadConnector, PollConnector):

        try:
            content = self.client.get(f"/spaces/{self.space_id}/content")
-            pages = content.get("pages", [])
-
+            pages: list[dict[str, Any]] = content.get("pages", [])
            current_batch: list[Document] = []
-            for page in pages:
-                updated_at = datetime.fromisoformat(page["updatedAt"])

+            while pages:
+                page = pages.pop(0)
+
+                updated_at_raw = page.get("updatedAt")
+                if updated_at_raw is None:
+                    # if updatedAt is not present, that means the page has never been edited
+                    continue
+
+                updated_at = datetime.fromisoformat(updated_at_raw)
                if start and updated_at < start:
-                    if current_batch:
-                        yield current_batch
-                    return
+                    continue
                if end and updated_at > end:
                    continue

@@ -250,6 +254,8 @@ class GitbookConnector(LoadConnector, PollConnector):
                    yield current_batch
                    current_batch = []

+                pages.extend(page.get("pages", []))
+
            if current_batch:
                yield current_batch

--- a/backend/onyx/connectors/github/connector.py
+++ b/backend/onyx/connectors/github/connector.py
@@ -17,14 +17,14 @@ from github.PullRequest import PullRequest
 from onyx.configs.app_configs import GITHUB_CONNECTOR_BASE_URL
 from onyx.configs.app_configs import INDEX_BATCH_SIZE
 from onyx.configs.constants import DocumentSource
-from onyx.connectors.interfaces import ConnectorValidationError
-from onyx.connectors.interfaces import CredentialExpiredError
+from onyx.connectors.exceptions import ConnectorValidationError
+from onyx.connectors.exceptions import CredentialExpiredError
+from onyx.connectors.exceptions import InsufficientPermissionsError
+from onyx.connectors.exceptions import UnexpectedValidationError
 from onyx.connectors.interfaces import GenerateDocumentsOutput
-from onyx.connectors.interfaces import InsufficientPermissionsError
 from onyx.connectors.interfaces import LoadConnector
 from onyx.connectors.interfaces import PollConnector
 from onyx.connectors.interfaces import SecondsSinceUnixEpoch
-from onyx.connectors.interfaces import UnexpectedError
 from onyx.connectors.models import ConnectorMissingCredentialError
 from onyx.connectors.models import Document
 from onyx.connectors.models import Section
@@ -124,14 +124,14 @@ class GithubConnector(LoadConnector, PollConnector):
    def __init__(
        self,
        repo_owner: str,
-        repo_name: str,
+        repositories: str | None = None,
        batch_size: int = INDEX_BATCH_SIZE,
        state_filter: str = "all",
        include_prs: bool = True,
        include_issues: bool = False,
    ) -> None:
        self.repo_owner = repo_owner
-        self.repo_name = repo_name
+        self.repositories = repositories
        self.batch_size = batch_size
        self.state_filter = state_filter
        self.include_prs = include_prs
@@ -157,58 +157,123 @@ class GithubConnector(LoadConnector, PollConnector):
            )

        try:
-            return github_client.get_repo(f"{self.repo_owner}/{self.repo_name}")
+            return github_client.get_repo(f"{self.repo_owner}/{self.repositories}")
        except RateLimitExceededException:
            _sleep_after_rate_limit_exception(github_client)
            return self._get_github_repo(github_client, attempt_num + 1)

+    def _get_github_repos(
+        self, github_client: Github, attempt_num: int = 0
+    ) -> list[Repository.Repository]:
+        """Get specific repositories based on comma-separated repo_name string."""
+        if attempt_num > _MAX_NUM_RATE_LIMIT_RETRIES:
+            raise RuntimeError(
+                "Re-tried fetching repos too many times. Something is going wrong with fetching objects from Github"
+            )
+
+        try:
+            repos = []
+            # Split repo_name by comma and strip whitespace
+            repo_names = [
+                name.strip() for name in (cast(str, self.repositories)).split(",")
+            ]
+
+            for repo_name in repo_names:
+                if repo_name:  # Skip empty strings
+                    try:
+                        repo = github_client.get_repo(f"{self.repo_owner}/{repo_name}")
+                        repos.append(repo)
+                    except GithubException as e:
+                        logger.warning(
+                            f"Could not fetch repo {self.repo_owner}/{repo_name}: {e}"
+                        )
+
+            return repos
+        except RateLimitExceededException:
+            _sleep_after_rate_limit_exception(github_client)
+            return self._get_github_repos(github_client, attempt_num + 1)
+
+    def _get_all_repos(
+        self, github_client: Github, attempt_num: int = 0
+    ) -> list[Repository.Repository]:
+        if attempt_num > _MAX_NUM_RATE_LIMIT_RETRIES:
+            raise RuntimeError(
+                "Re-tried fetching repos too many times. Something is going wrong with fetching objects from Github"
+            )
+
+        try:
+            # Try to get organization first
+            try:
+                org = github_client.get_organization(self.repo_owner)
+                return list(org.get_repos())
+            except GithubException:
+                # If not an org, try as a user
+                user = github_client.get_user(self.repo_owner)
+                return list(user.get_repos())
+        except RateLimitExceededException:
+            _sleep_after_rate_limit_exception(github_client)
+            return self._get_all_repos(github_client, attempt_num + 1)
+
    def _fetch_from_github(
        self, start: datetime | None = None, end: datetime | None = None
    ) -> GenerateDocumentsOutput:
        if self.github_client is None:
            raise ConnectorMissingCredentialError("GitHub")

-        repo = self._get_github_repo(self.github_client)
+        repos = []
+        if self.repositories:
+            if "," in self.repositories:
+                # Multiple repositories specified
+                repos = self._get_github_repos(self.github_client)
+            else:
+                # Single repository (backward compatibility)
+                repos = [self._get_github_repo(self.github_client)]
+        else:
+            # All repositories
+            repos = self._get_all_repos(self.github_client)

-        if self.include_prs:
-            pull_requests = repo.get_pulls(
-                state=self.state_filter, sort="updated", direction="desc"
-            )
+        for repo in repos:
+            if self.include_prs:
+                logger.info(f"Fetching PRs for repo: {repo.name}")
+                pull_requests = repo.get_pulls(
+                    state=self.state_filter, sort="updated", direction="desc"
+                )

-            for pr_batch in _batch_github_objects(
-                pull_requests, self.github_client, self.batch_size
-            ):
-                doc_batch: list[Document] = []
-                for pr in pr_batch:
-                    if start is not None and pr.updated_at < start:
-                        yield doc_batch
-                        return
-                    if end is not None and pr.updated_at > end:
-                        continue
-                    doc_batch.append(_convert_pr_to_document(cast(PullRequest, pr)))
-                yield doc_batch
+                for pr_batch in _batch_github_objects(
+                    pull_requests, self.github_client, self.batch_size
+                ):
+                    doc_batch: list[Document] = []
+                    for pr in pr_batch:
+                        if start is not None and pr.updated_at < start:
+                            yield doc_batch
+                            break
+                        if end is not None and pr.updated_at > end:
+                            continue
+                        doc_batch.append(_convert_pr_to_document(cast(PullRequest, pr)))
+                    yield doc_batch

-        if self.include_issues:
-            issues = repo.get_issues(
-                state=self.state_filter, sort="updated", direction="desc"
-            )
+            if self.include_issues:
+                logger.info(f"Fetching issues for repo: {repo.name}")
+                issues = repo.get_issues(
+                    state=self.state_filter, sort="updated", direction="desc"
+                )

-            for issue_batch in _batch_github_objects(
-                issues, self.github_client, self.batch_size
-            ):
-                doc_batch = []
-                for issue in issue_batch:
-                    issue = cast(Issue, issue)
-                    if start is not None and issue.updated_at < start:
-                        yield doc_batch
-                        return
-                    if end is not None and issue.updated_at > end:
-                        continue
-                    if issue.pull_request is not None:
-                        # PRs are handled separately
-                        continue
-                    doc_batch.append(_convert_issue_to_document(issue))
-                yield doc_batch
+                for issue_batch in _batch_github_objects(
+                    issues, self.github_client, self.batch_size
+                ):
+                    doc_batch = []
+                    for issue in issue_batch:
+                        issue = cast(Issue, issue)
+                        if start is not None and issue.updated_at < start:
+                            yield doc_batch
+                            break
+                        if end is not None and issue.updated_at > end:
+                            continue
+                        if issue.pull_request is not None:
+                            # PRs are handled separately
+                            continue
+                        doc_batch.append(_convert_issue_to_document(issue))
+                    yield doc_batch

    def load_from_state(self) -> GenerateDocumentsOutput:
        return self._fetch_from_github()
@@ -234,19 +299,66 @@ class GithubConnector(LoadConnector, PollConnector):
        if self.github_client is None:
            raise ConnectorMissingCredentialError("GitHub credentials not loaded.")

-        if not self.repo_owner or not self.repo_name:
+        if not self.repo_owner:
            raise ConnectorValidationError(
-                "Invalid connector settings: 'repo_owner' and 'repo_name' must be provided."
+                "Invalid connector settings: 'repo_owner' must be provided."
            )

        try:
-            test_repo = self.github_client.get_repo(
-                f"{self.repo_owner}/{self.repo_name}"
-            )
-            test_repo.get_contents("")
+            if self.repositories:
+                if "," in self.repositories:
+                    # Multiple repositories specified
+                    repo_names = [name.strip() for name in self.repositories.split(",")]
+                    if not repo_names:
+                        raise ConnectorValidationError(
+                            "Invalid connector settings: No valid repository names provided."
+                        )
+
+                    # Validate at least one repository exists and is accessible
+                    valid_repos = False
+                    validation_errors = []
+
+                    for repo_name in repo_names:
+                        if not repo_name:
+                            continue
+
+                        try:
+                            test_repo = self.github_client.get_repo(
+                                f"{self.repo_owner}/{repo_name}"
+                            )
+                            test_repo.get_contents("")
+                            valid_repos = True
+                            # If at least one repo is valid, we can proceed
+                            break
+                        except GithubException as e:
+                            validation_errors.append(
+                                f"Repository '{repo_name}': {e.data.get('message', str(e))}"
+                            )
+
+                    if not valid_repos:
+                        error_msg = (
+                            "None of the specified repositories could be accessed: "
+                        )
+                        error_msg += ", ".join(validation_errors)
+                        raise ConnectorValidationError(error_msg)
+                else:
+                    # Single repository (backward compatibility)
+                    test_repo = self.github_client.get_repo(
+                        f"{self.repo_owner}/{self.repositories}"
+                    )
+                    test_repo.get_contents("")
+            else:
+                # Try to get organization first
+                try:
+                    org = self.github_client.get_organization(self.repo_owner)
+                    org.get_repos().totalCount  # Just check if we can access repos
+                except GithubException:
+                    # If not an org, try as a user
+                    user = self.github_client.get_user(self.repo_owner)
+                    user.get_repos().totalCount  # Just check if we can access repos

        except RateLimitExceededException:
-            raise UnexpectedError(
+            raise UnexpectedValidationError(
                "Validation failed due to GitHub rate-limits being exceeded. Please try again later."
            )

@@ -260,13 +372,24 @@ class GithubConnector(LoadConnector, PollConnector):
                    "Your GitHub token does not have sufficient permissions for this repository (HTTP 403)."
                )
            elif e.status == 404:
-                raise ConnectorValidationError(
-                    f"GitHub repository not found with name: {self.repo_owner}/{self.repo_name}"
-                )
+                if self.repositories:
+                    if "," in self.repositories:
+                        raise ConnectorValidationError(
+                            f"None of the specified GitHub repositories could be found for owner: {self.repo_owner}"
+                        )
+                    else:
+                        raise ConnectorValidationError(
+                            f"GitHub repository not found with name: {self.repo_owner}/{self.repositories}"
+                        )
+                else:
+                    raise ConnectorValidationError(
+                        f"GitHub user or organization not found: {self.repo_owner}"
+                    )
            else:
                raise ConnectorValidationError(
                    f"Unexpected GitHub error (status={e.status}): {e.data}"
                )
+
        except Exception as exc:
            raise Exception(
                f"Unexpected error during GitHub settings validation: {exc}"
@@ -278,7 +401,7 @@ if __name__ == "__main__":

    connector = GithubConnector(
        repo_owner=os.environ["REPO_OWNER"],
-        repo_name=os.environ["REPO_NAME"],
+        repositories=os.environ["REPOSITORIES"],
    )
    connector.load_credentials(
        {"github_access_token": os.environ["GITHUB_ACCESS_TOKEN"]}
--- a/backend/onyx/connectors/gmail/connector.py
+++ b/backend/onyx/connectors/gmail/connector.py
@@ -305,6 +305,7 @@ class GmailConnector(LoadConnector, PollConnector, SlimConnector):
                    userId=user_email,
                    fields=THREAD_FIELDS,
                    id=thread["id"],
+                    continue_on_404_or_403=True,
                )
                # full_threads is an iterator containing a single thread
                # so we need to convert it to a list and grab the first element
@@ -336,6 +337,7 @@ class GmailConnector(LoadConnector, PollConnector, SlimConnector):
                userId=user_email,
                fields=THREAD_LIST_FIELDS,
                q=query,
+                continue_on_404_or_403=True,
            ):
                doc_batch.append(
                    SlimDocument(
--- a/backend/onyx/connectors/google_drive/connector.py
+++ b/backend/onyx/connectors/google_drive/connector.py
@@ -4,15 +4,16 @@ from concurrent.futures import as_completed
 from concurrent.futures import ThreadPoolExecutor
 from functools import partial
 from typing import Any
-from typing import cast

 from google.oauth2.credentials import Credentials as OAuthCredentials  # type: ignore
 from google.oauth2.service_account import Credentials as ServiceAccountCredentials  # type: ignore
 from googleapiclient.errors import HttpError  # type: ignore

 from onyx.configs.app_configs import INDEX_BATCH_SIZE
-from onyx.configs.app_configs import MAX_FILE_SIZE_BYTES
 from onyx.configs.constants import DocumentSource
+from onyx.connectors.exceptions import ConnectorValidationError
+from onyx.connectors.exceptions import CredentialExpiredError
+from onyx.connectors.exceptions import InsufficientPermissionsError
 from onyx.connectors.google_drive.doc_conversion import build_slim_document
 from onyx.connectors.google_drive.doc_conversion import (
    convert_drive_item_to_document,
@@ -33,7 +34,6 @@ from onyx.connectors.google_utils.shared_constants import (
 )
 from onyx.connectors.google_utils.shared_constants import MISSING_SCOPES_ERROR_STR
 from onyx.connectors.google_utils.shared_constants import ONYX_SCOPE_INSTRUCTIONS
-from onyx.connectors.google_utils.shared_constants import SCOPE_DOC_URL
 from onyx.connectors.google_utils.shared_constants import SLIM_BATCH_SIZE
 from onyx.connectors.google_utils.shared_constants import USER_FIELDS
 from onyx.connectors.interfaces import GenerateDocumentsOutput
@@ -42,7 +42,10 @@ from onyx.connectors.interfaces import LoadConnector
 from onyx.connectors.interfaces import PollConnector
 from onyx.connectors.interfaces import SecondsSinceUnixEpoch
 from onyx.connectors.interfaces import SlimConnector
+from onyx.connectors.models import ConnectorMissingCredentialError
+from onyx.connectors.vision_enabled_connector import VisionEnabledConnector
 from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
+from onyx.llm.interfaces import LLM
 from onyx.utils.logger import setup_logger
 from onyx.utils.retry_wrapper import retry_builder

@@ -62,7 +65,10 @@ def _extract_ids_from_urls(urls: list[str]) -> list[str]:


 def _convert_single_file(
-    creds: Any, primary_admin_email: str, file: dict[str, Any]
+    creds: Any,
+    primary_admin_email: str,
+    file: dict[str, Any],
+    image_analysis_llm: LLM | None,
 ) -> Any:
    user_email = file.get("owners", [{}])[0].get("emailAddress") or primary_admin_email
    user_drive_service = get_drive_service(creds, user_email=user_email)
@@ -71,11 +77,14 @@ def _convert_single_file(
        file=file,
        drive_service=user_drive_service,
        docs_service=docs_service,
+        image_analysis_llm=image_analysis_llm,  # pass the LLM so doc_conversion can summarize images
    )


 def _process_files_batch(
-    files: list[GoogleDriveFileType], convert_func: Callable, batch_size: int
+    files: list[GoogleDriveFileType],
+    convert_func: Callable[[GoogleDriveFileType], Any],
+    batch_size: int,
 ) -> GenerateDocumentsOutput:
    doc_batch = []
    with ThreadPoolExecutor(max_workers=min(16, len(files))) as executor:
@@ -107,7 +116,9 @@ def _clean_requested_drive_ids(
    return valid_requested_drive_ids, filtered_folder_ids


-class GoogleDriveConnector(LoadConnector, PollConnector, SlimConnector):
+class GoogleDriveConnector(
+    LoadConnector, PollConnector, SlimConnector, VisionEnabledConnector
+):
    def __init__(
        self,
        include_shared_drives: bool = False,
@@ -125,23 +136,23 @@ class GoogleDriveConnector(LoadConnector, PollConnector, SlimConnector):
        continue_on_failure: bool | None = None,
    ) -> None:
        # Check for old input parameters
-        if (
-            folder_paths is not None
-            or include_shared is not None
-            or follow_shortcuts is not None
-            or only_org_public is not None
-            or continue_on_failure is not None
-        ):
-            logger.exception(
-                "Google Drive connector received old input parameters. "
-                "Please visit the docs for help with the new setup: "
-                f"{SCOPE_DOC_URL}"
+        if folder_paths is not None:
+            logger.warning(
+                "The 'folder_paths' parameter is deprecated. Use 'shared_folder_urls' instead."
            )
-            raise ValueError(
-                "Google Drive connector received old input parameters. "
-                "Please visit the docs for help with the new setup: "
-                f"{SCOPE_DOC_URL}"
+        if include_shared is not None:
+            logger.warning(
+                "The 'include_shared' parameter is deprecated. Use 'include_files_shared_with_me' instead."
            )
+        if follow_shortcuts is not None:
+            logger.warning("The 'follow_shortcuts' parameter is deprecated.")
+        if only_org_public is not None:
+            logger.warning("The 'only_org_public' parameter is deprecated.")
+        if continue_on_failure is not None:
+            logger.warning("The 'continue_on_failure' parameter is deprecated.")
+
+        # Initialize vision LLM using the mixin
+        self.initialize_vision_llm()

        if (
            not include_shared_drives
@@ -151,7 +162,7 @@ class GoogleDriveConnector(LoadConnector, PollConnector, SlimConnector):
            and not my_drive_emails
            and not shared_drive_urls
        ):
-            raise ValueError(
+            raise ConnectorValidationError(
                "Nothing to index. Please specify at least one of the following: "
                "include_shared_drives, include_my_drives, include_files_shared_with_me, "
                "shared_folder_urls, or my_drive_emails"
@@ -220,12 +231,20 @@ class GoogleDriveConnector(LoadConnector, PollConnector, SlimConnector):
        return self._creds

    def load_credentials(self, credentials: dict[str, Any]) -> dict[str, str] | None:
-        self._primary_admin_email = credentials[DB_CREDENTIALS_PRIMARY_ADMIN_KEY]
+        try:
+            self._primary_admin_email = credentials[DB_CREDENTIALS_PRIMARY_ADMIN_KEY]
+        except KeyError:
+            raise ValueError(
+                "Primary admin email missing, "
+                "should not call this property "
+                "before calling load_credentials"
+            )

        self._creds, new_creds_dict = get_google_creds(
            credentials=credentials,
            source=DocumentSource.GOOGLE_DRIVE,
        )
+
        return new_creds_dict

    def _update_traversed_parent_ids(self, folder_id: str) -> None:
@@ -297,7 +316,9 @@ class GoogleDriveConnector(LoadConnector, PollConnector, SlimConnector):
        # validate that the user has access to the drive APIs by performing a simple
        # request and checking for a 401
        try:
-            retry_builder()(get_root_folder_id)(drive_service)
+            # default is ~17mins of retries, don't do that here for cases so we don't
+            # waste 17mins everytime we run into a user without access to drive APIs
+            retry_builder(tries=3, delay=1)(get_root_folder_id)(drive_service)
        except HttpError as e:
            if e.status_code == 401:
                # fail gracefully, let the other impersonations continue
@@ -512,37 +533,53 @@ class GoogleDriveConnector(LoadConnector, PollConnector, SlimConnector):
        end: SecondsSinceUnixEpoch | None = None,
    ) -> GenerateDocumentsOutput:
        # Create a larger process pool for file conversion
-        convert_func = partial(
-            _convert_single_file, self.creds, self.primary_admin_email
-        )
-
-        # Process files in larger batches
-        LARGE_BATCH_SIZE = self.batch_size * 4
-        files_to_process = []
-        # Gather the files into batches to be processed in parallel
-        for file in self._fetch_drive_items(is_slim=False, start=start, end=end):
-            if (
-                file.get("size")
-                and int(cast(str, file.get("size"))) > MAX_FILE_SIZE_BYTES
-            ):
-                logger.warning(
-                    f"Skipping file {file.get('name', 'Unknown')} as it is too large: {file.get('size')} bytes"
-                )
-                continue
-
-            files_to_process.append(file)
-            if len(files_to_process) >= LARGE_BATCH_SIZE:
-                yield from _process_files_batch(
-                    files_to_process, convert_func, self.batch_size
-                )
-                files_to_process = []
-
-        # Process any remaining files
-        if files_to_process:
-            yield from _process_files_batch(
-                files_to_process, convert_func, self.batch_size
+        with ThreadPoolExecutor(max_workers=8) as executor:
+            # Prepare a partial function with the credentials and admin email
+            convert_func = partial(
+                _convert_single_file,
+                self.creds,
+                self.primary_admin_email,
+                image_analysis_llm=self.image_analysis_llm,  # Use the mixin's LLM
            )

+            # Fetch files in batches
+            files_batch: list[GoogleDriveFileType] = []
+            for file in self._fetch_drive_items(is_slim=False, start=start, end=end):
+                files_batch.append(file)
+
+                if len(files_batch) >= self.batch_size:
+                    # Process the batch
+                    futures = [
+                        executor.submit(convert_func, file) for file in files_batch
+                    ]
+                    documents = []
+                    for future in as_completed(futures):
+                        try:
+                            doc = future.result()
+                            if doc is not None:
+                                documents.append(doc)
+                        except Exception as e:
+                            logger.error(f"Error converting file: {e}")
+
+                    if documents:
+                        yield documents
+                    files_batch = []
+
+            # Process any remaining files
+            if files_batch:
+                futures = [executor.submit(convert_func, file) for file in files_batch]
+                documents = []
+                for future in as_completed(futures):
+                    try:
+                        doc = future.result()
+                        if doc is not None:
+                            documents.append(doc)
+                    except Exception as e:
+                        logger.error(f"Error converting file: {e}")
+
+                if documents:
+                    yield documents
+
    def load_from_state(self) -> GenerateDocumentsOutput:
        try:
            yield from self._extract_docs_from_google_drive()
@@ -602,3 +639,50 @@ class GoogleDriveConnector(LoadConnector, PollConnector, SlimConnector):
            if MISSING_SCOPES_ERROR_STR in str(e):
                raise PermissionError(ONYX_SCOPE_INSTRUCTIONS) from e
            raise e
+
+    def validate_connector_settings(self) -> None:
+        if self._creds is None:
+            raise ConnectorMissingCredentialError(
+                "Google Drive credentials not loaded."
+            )
+
+        if self._primary_admin_email is None:
+            raise ConnectorValidationError(
+                "Primary admin email not found in credentials. "
+                "Ensure DB_CREDENTIALS_PRIMARY_ADMIN_KEY is set."
+            )
+
+        try:
+            drive_service = get_drive_service(self._creds, self._primary_admin_email)
+            drive_service.files().list(pageSize=1, fields="files(id)").execute()
+
+            if isinstance(self._creds, ServiceAccountCredentials):
+                retry_builder()(get_root_folder_id)(drive_service)
+
+        except HttpError as e:
+            status_code = e.resp.status if e.resp else None
+            if status_code == 401:
+                raise CredentialExpiredError(
+                    "Invalid or expired Google Drive credentials (401)."
+                )
+            elif status_code == 403:
+                raise InsufficientPermissionsError(
+                    "Google Drive app lacks required permissions (403). "
+                    "Please ensure the necessary scopes are granted and Drive "
+                    "apps are enabled."
+                )
+            else:
+                raise ConnectorValidationError(
+                    f"Unexpected Google Drive error (status={status_code}): {e}"
+                )
+
+        except Exception as e:
+            # Check for scope-related hints from the error message
+            if MISSING_SCOPES_ERROR_STR in str(e):
+                raise InsufficientPermissionsError(
+                    "Google Drive credentials are missing required scopes. "
+                    f"{ONYX_SCOPE_INSTRUCTIONS}"
+                )
+            raise ConnectorValidationError(
+                f"Unexpected error during Google Drive validation: {e}"
+            )
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
pablonyx	0593d33cbe	tests fixed	2025-03-06 16:58:27 -08:00
pablonyx	56b108c313	k	2025-03-06 16:51:17 -08:00
pablonyx	e713a0c58a	k	2025-03-06 16:37:45 -08:00
pablonyx	d5d124c5db	possible confluence fix	2025-03-06 16:30:07 -08:00
pablonyx	3ec1d79034	run connector tests	2025-03-06 16:30:07 -08:00
pablonyx	bf4983e35a	Ensure consistent UX (#4222 ) * ux consistent * nit * Update web/src/app/admin/configuration/llm/interfaces.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>	2025-03-06 23:13:32 +00:00
evan-danswer	b7da91e3ae	improved basic search latency (#4186 ) * improved basic search latency * address PR comments + minor cleanup	2025-03-06 22:22:59 +00:00
Weves	29382656fc	Stop trying a million times for the user validity check	2025-03-06 15:35:49 -08:00
pablonyx	7d6db8d500	Comma separated list for Github repos (#4199 )	2025-03-06 14:46:57 -08:00
Chris Weaver	a7a374dc81	Confluence fixes (#4220 ) * Confluence fixes * Small tweak * Address greptile comments	2025-03-06 20:57:07 +00:00
rkuo-danswer	facc8cc2fa	add scope needed for permission sync (#4198 ) Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-03-06 20:03:38 +00:00
rkuo-danswer	2c0af0a0ca	Feature/helm updates (#4201 ) * add ingress for api and web * helm setup docs * add letsencrypt. close blocks * use pathType ImplementationSpecific as Prefix is deprecated * fix backend labels. configure nginx routes. update annotations * fix linting --------- Co-authored-by: Sajjad Anwar <sajjadkm@gmail.com> Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-03-06 19:48:20 +00:00
pablonyx	bfbc1cd954	k (#4172 )	2025-03-06 18:55:12 +00:00
pablonyx	626da583aa	Fix gated tenants (#4177 ) * fix * mypy .	2025-03-06 18:07:15 +00:00
pablonyx	92faca139d	Fix extra tenant mystery (#4197 ) * fix extra tenant mystery * nit	2025-03-06 18:06:49 +00:00
pablonyx	cec05c5ee9	Revert "k" This reverts commit `687122911d`.	2025-03-06 09:38:31 -08:00
Richard Kuo (Danswer)	eaf054ef06	oauth router went missing?	2025-03-05 15:50:23 -08:00
pablonyx	a7a1a24658	minor nit	2025-03-05 15:35:02 -08:00
pablonyx	687122911d	k	2025-03-05 15:27:14 -08:00
pablonyx	40953bd4fe	Workspace configs (#4202 )	2025-03-05 12:28:44 -08:00
rkuo-danswer	a7acc07e79	fix usage report pagination (#4183 ) * early work in progress * rename utility script * move actual data seeding to a shareable function * add test * make the test pass with the fix * fix comment --------- Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-03-05 19:13:51 +00:00
pablonyx	b6e9e65bb8	* Replaces Amazon and Anthropic Icons with version better suitable fo… (#4190 ) * * Replaces Amazon and Anthropic Icons with version better suitable for both Dark and Light modes; * Adds icon for DeepSeek; * Simplify logic on icon selection; * Adds entries for Phi-4, Claude 3.7, Ministral and Gemini 2.0 models * nit * k * k --------- Co-authored-by: Emerson Gomes <emerson.gomes@thalesgroup.com>	2025-03-05 17:57:39 +00:00
pablonyx	20f2b9b2bb	Add image support for search (#4090 ) * add support for image search * quick fix up * k * k * k * k * nit * quick fix for connector tests	2025-03-05 17:44:18 +00:00
Chris Weaver	f731beca1f	Add ONYX_QUERY_HISTORY_TYPE to the dev compose files (#4196 )	2025-03-05 17:34:55 +00:00
Weves	fe246aecbb	Attempt to address tool happy claude	2025-03-05 09:47:27 -08:00
pablonyx	50ad066712	Better filtering (#4185 ) * k * k * k * k * k	2025-03-05 04:35:50 +00:00
rkuo-danswer	870b59a1cc	Bugfix/vertex crash (#4181 ) * Update text embedding model to version 005 and enhance embedding retrieval process * re * Fix formatting issues * Add support for Bedrock reranking provider and AWS credentials handling * fix: improve AWS key format validation and error messages * Fix vertex embedding model crash * feat: add environment template for local development setup * Add display name for Claude 3.7 Sonnet model * Add display names for Gemini 2.0 models and update Claude 3.7 Sonnet entry * Fix ruff errors by ensuring lines are within 130 characters * revert to currently default onyx browser settings * add / fix boto requirements --------- Co-authored-by: ferdinand loesch <f.loesch@sportradar.com> Co-authored-by: Ferdinand Loesch <ferdinandloesch@me.com> Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-03-05 01:59:46 +00:00
pablonyx	5c896cb0f7	add minor fixes (#4170 )	2025-03-04 20:29:28 +00:00
pablonyx	184b30643d	Nit: logging adjustments (#4182 )	2025-03-04 11:39:53 -08:00
pablonyx	ae585fd84c	Delete all chats (#4171 ) * nit * k	2025-03-04 10:00:08 -08:00
rkuo-danswer	61e8f371b9	fix blowing up the entire task on exception and trying to reuse an in… (#4179 ) * fix blowing up the entire task on exception and trying to reuse an invalid db session * list comprehension --------- Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-03-04 00:57:27 +00:00
rkuo-danswer	33cc4be492	Bugfix/GitHub validation (#4173 ) * fixing unexpected errors disabling connectors * rename UnexpectedError to UnexpectedValidationError --------- Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-03-04 00:09:49 +00:00
joachim-danswer	117c8c0d78	Enable ephemeral message responses by Onyx Slack Bots (#4142 ) A new setting 'is_ephemeral' has been added to the Slack channel configurations. Key features/effects: - if is_ephemeral is set for standard channel (and a Search Assistant is chosen): - the answer is only shown to user as an ephemeral message - the user has access to his private documents for a search (as the answer is only shown to them) - the user has the ability to share the answer with the channel or keep private - a recipient list cannot be defined if the channel is set up as ephemeral - if is_ephemeral is set and DM with bot: - the user has access to private docs in searches - the message is not sent as ephemeral, as it is a 1:1 discussion with bot - if is_ephemeral is not set but recipient list is set: - the user search does not have access to their private documents as the information goes to the recipient list team members, and they may have different access rights - Overall: - Unless the channel is set to is_ephemeral or it is a direct conversation with the Bot, only public docs are accessible - The ACL is never bypassed, also not in cases where the admin explicitly attached a document set to the bot config.	2025-03-03 15:02:21 -08:00
rkuo-danswer	9bb8cdfff1	fix web connector tests to handle new deduping (#4175 ) Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-03-03 20:54:20 +00:00
Weves	a52d0d29be	Small tweak to NumberInput	2025-03-03 11:20:53 -08:00
Chris Weaver	f25e1e80f6	Add option to not re-index (#4157 ) * Add option to not re-index * Add quantizaton / dimensionality override support * Fix build / ut	2025-03-03 10:54:11 -08:00
Yuhong Sun	39fd6919ad	Fix web scrolling	2025-03-03 09:00:05 -08:00
Yuhong Sun	7f0653d173	Handling of #! sites (#4169 )	2025-03-03 08:18:44 -08:00
SubashMohan	e9905a398b	Enhance iframe content extraction and add thresholds for JavaScript disabled scenarios (#4167 )	2025-03-02 19:29:10 -08:00
Brad Slavin	3ed44e8bae	Update Unstructured documentation URL to new location (#4168 )	2025-03-02 19:16:38 -08:00
pablonyx	64158a5bdf	silence_logs (#4165 )	2025-03-02 19:00:59 +00:00
pablonyx	afb2393596	fix dark mode index attempt failure (#4163 )	2025-03-02 01:23:16 +00:00
pablonyx	d473c4e876	Fix curator default persona editing (#4158 ) * k * k	2025-03-02 00:40:14 +00:00
pablonyx	692058092f	fix typo	2025-03-01 13:00:07 -08:00
pablonyx	e88325aad6	bump version (#4164 )	2025-03-01 01:58:45 +00:00
pablonyx	7490250e91	Fix user group edge case (#4159 ) * fix user group * k	2025-02-28 23:55:21 +00:00
pablonyx	e5369fcef8	Update warning copy (#4160 ) * k * k * quick nit	2025-02-28 23:46:21 +00:00
Yuhong Sun	b0f00953bc	Add CODEOWNERS	2025-02-28 13:57:33 -08:00
rkuo-danswer	f6a75c86c6	Bugfix/emit background error (#4156 ) * print the test name when it runs * type hints * can't reuse session after an exception * better logging --------- Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-02-28 18:35:24 +00:00
pablonyx	ed9989282f	nit- update casing enforcement on frontend	2025-02-28 10:09:06 -08:00
pablonyx	e80a0f2716	Improved google connector flow (#4155 ) * fix handling * k * k * fix function * k * k	2025-02-28 05:13:39 +00:00
rkuo-danswer	909403a648	Feature/confluence oauth (#3477 ) * first cut at slack oauth flow * fix usage of hooks * fix button spacing * add additional error logging * no dev redirect * early cut at google drive oauth * second pass * switch to production uri's * try handling oauth_interactive differently * pass through client id and secret if uploaded * fix call * fix test * temporarily disable check for testing * Revert "temporarily disable check for testing" This reverts commit `4b5a022a5f`. * support visibility in test * missed file * first cut at confluence oauth * work in progress * work in progress * work in progress * work in progress * work in progress * first cut at distributed locking * WIP to make test work * add some dev mode affordances and gate usage of redis behind dynamic credentials * mypy and credentials provider fixes * WIP * fix created at * fix setting initialValue on everything * remove debugging, fix ??? some TextFormField issues * npm fixes * comment cleanup * fix comments * pin the size of the card section * more review fixes * more fixes --------- Co-authored-by: Richard Kuo <rkuo@rkuo.com> Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-02-28 03:48:51 +00:00
pablonyx	cd84b65011	quick fix (#4154 )	2025-02-28 02:03:34 +00:00
pablonyx	413f21cec0	Filter assistants fix (#4153 ) * k * quick nit * minor assistant filtering fix	2025-02-28 02:03:21 +00:00
pablonyx	eb369384a7	Log server side auth error + slackbot pagination fix (#4149 )	2025-02-27 18:05:28 -08:00
pablonyx	0a24dbc52c	k# Please enter the commit message for your changes. Lines starting (#4144 )	2025-02-27 23:34:20 +00:00
pablonyx	a7ba0da8cc	Lowercase multi tenant email mapping (#4141 )	2025-02-27 15:33:40 -08:00
Richard Kuo (Danswer)	aaced6d551	scan images	2025-02-27 15:25:29 -08:00
Richard Kuo (Danswer)	4c230f92ea	trivy test	2025-02-27 15:05:03 -08:00
Richard Kuo (Danswer)	07d75b04d1	enable trivy scan	2025-02-27 14:22:44 -08:00
evan-danswer	a8d10750c1	fix propagation of is_agentic (#4150 )	2025-02-27 11:56:51 -08:00
pablonyx	85e3ed57f1	Order chat sessions by time updated, not created (#4143 ) * order chat sessions by time updated, not created * quick update * k	2025-02-27 17:35:42 +00:00
pablonyx	e10cc8ccdb	Multi tenant user google auth fix (#4145 )	2025-02-27 10:35:38 -08:00
pablonyx	7018bc974b	Better looking errors (#4050 ) * add error handling * fix * k	2025-02-27 04:58:25 +00:00
pablonyx	9c9075d71d	Minor improvements to provisioning (#4109 ) * quick fix * k * nit	2025-02-27 04:57:31 +00:00
pablonyx	338e084062	Improved tenant handling for slack bot (#4099 )	2025-02-27 04:06:26 +00:00
pablonyx	2f64031f5c	Improved tenant handling for slack bot1 (#4104 )	2025-02-27 03:40:50 +00:00
pablonyx	abb74f2eaa	Improved chat search (#4137 ) * functional + fast * k * adapt * k * nit * k * k * fix typing * k	2025-02-27 02:27:45 +00:00
pablonyx	a3e3d83b7e	Improve viewable assistant logic (#4125 ) * k * quick fix * k	2025-02-27 01:24:39 +00:00
pablonyx	4dc88ca037	debug playwright failure case	2025-02-26 17:32:26 -08:00
rkuo-danswer	11e7e1c4d6	log processed tenant count (#4139 ) Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-02-26 17:26:48 -08:00
pablonyx	f2d74ce540	Address Auth Edge Case (#4138 )	2025-02-26 17:24:23 -08:00
rkuo-danswer	25389c5120	first cut at anonymizing query history (#4123 ) Co-authored-by: Richard Kuo <rkuo@rkuo.com>	2025-02-26 21:32:01 +00:00
pablonyx	ad0721ecd8	update (#4086 )	2025-02-26 18:12:07 +00:00
pablonyx	426a8842ae	Markdown copying / html formatting (#4120 ) * k * delete unnecessary util	2025-02-26 04:56:38 +00:00
pablonyx	a98dcbc7de	Update tenant logic (#4122 ) * k * k * k * quick nit * nit	2025-02-26 03:53:46 +00:00
pablonyx	6f389dc100	Improve lengthy chats (#4126 ) * remove scroll * working well * nit * k * nit	2025-02-26 03:22:21 +00:00
pablonyx	d56177958f	fix email headers (#4100 )	2025-02-26 03:12:30 +00:00
Kaveen Jayamanna	0e42ae9024	Content of .xlsl are not properly read during indexing. (#4035 )	2025-02-25 21:10:47 -08:00
Weves	ce2b4de245	temp remove	2025-02-25 20:46:55 -08:00
Chris Weaver	a515aa78d2	Fix confluence test (#4130 )	2025-02-26 03:03:54 +00:00
Weves	23073d91b9	reduce number of chars to index for search	2025-02-25 19:27:50 -08:00
Chris Weaver	f767b1f476	Fix confluence permission syncing at scale (#4129 ) * Fix confluence permission syncing at scale * Remove line * Better log message * Adjust log	2025-02-25 19:22:52 -08:00
pablonyx	9ffc8cb2c4	k	2025-02-25 18:15:49 -08:00
pablonyx	98bfb58147	Handle bad slack configurations– multi tenant (#4118 ) * k * quick nit * k * k	2025-02-25 22:22:54 +00:00
evan-danswer	6ce810e957	faster indexing status at scale plus minor cleanups (#4081 ) * faster indexing status at scale plus minor cleanups * mypy * address chris comments * remove extra prints	2025-02-25 21:22:26 +00:00
pablonyx	07b0b57b31	(nit) bump timeout	2025-02-25 14:10:30 -08:00
pablonyx	118cdd7701	Chat search (#4113 ) * add chat search * don't add the bible * base functional * k * k * functioning * functioning well * functioning well * k * delete bible * quick cleanup * quick cleanup * k * fixed frontend hooks * delete bible * nit * nit * nit * fix build * k * improved debouncing * address comments * fix alembic * k	2025-02-25 20:49:46 +00:00
rkuo-danswer	ac83b4c365	validate connector deletion (#4108 ) * validate connector deletion * fixes --------- Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-02-25 20:35:21 +00:00
pablonyx	fa408ff447	add 3.7 (#4116 )	2025-02-25 12:41:40 -08:00
rkuo-danswer	4aa8eb8b75	fix scrolling test (#4117 ) Co-authored-by: Richard Kuo <rkuo@rkuo.com>	2025-02-25 10:23:04 -08:00
rkuo-danswer	60bd9271f7	Bugfix/model tests (#4092 ) * trying out a fix * add ability to manually run model tests * add log dump * check status code, not text? * just the model server * add port mapping to host * pass through more api keys * add azure tests * fix litellm env vars * fix env vars in github workflow * temp disable litellm test --------- Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-02-25 04:53:51 +00:00
Weves	5d58a5e3ea	Add ability to index all of Github	2025-02-24 18:56:36 -08:00
Chris Weaver	a99dd05533	Add option to index all Jira projects (#4106 ) * Add option to index all Jira projects * Fix test * Fix web build * Address comment	2025-02-25 02:07:00 +00:00
pablonyx	0dce67094e	Prettier formatting for bedrock (#4111 ) * k * k	2025-02-25 02:05:29 +00:00
pablonyx	ffd14435a4	Text overflow logic (#4051 ) * proper components * k * k * k	2025-02-25 01:05:22 +00:00
rkuo-danswer	c9a3b45ad4	more aggressive handling of tasks blocking deletion (#4093 ) * more aggressive handling of tasks blocking deletion * comment updated --------- Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-02-24 22:41:13 +00:00
pablonyx	7d40676398	Heavy task improvements, logging, and validation (#4058 )	2025-02-24 13:48:53 -08:00
rkuo-danswer	b9e79e5db3	tighten up logs (#4076 ) Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-02-24 19:23:00 +00:00
rkuo-danswer	558bbe16e4	Bugfix/termination cleanup (#4077 ) * move activity timeout cleanup to the function exit * fix excessive logging --------- Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-02-24 19:21:55 +00:00
evan-danswer	076619ce2c	make Settings model match db (#4087 )	2025-02-24 19:04:36 +00:00
pablonyx	1263e21eb5	k (#4102 )	2025-02-24 17:44:18 +00:00
pablonyx	f0c13b6558	fix starter message editing (#4101 )	2025-02-24 01:01:01 +00:00
evan-danswer	a7125662f1	Fix gpt o-series code block formatting (#4089 ) * prompt addition for gpt o-series to encourage markdown formatting of code blocks * fix to match https://simonwillison.net/tags/markdown/ * chris comment * chris comment	2025-02-24 00:59:48 +00:00
evan-danswer	4a4e4a6c50	thread utils respect contextvars (#4074 ) * thread utils respect contextvars now * address pablo comments * removed tenant id from places it was already being passed * fix rate limit check and pablo comment	2025-02-24 00:43:21 +00:00
pablonyx	1f2af373e1	improve scroll (#4096 )	2025-02-23 19:20:07 +00:00
Weves	bdaa293ae4	Fix nginx for prod compose file	2025-02-21 16:57:54 -08:00
pablonyx	5a131f4547	Fix integration tests (#4059 )	2025-02-21 15:56:11 -08:00
rkuo-danswer	ffb7d5b85b	enable manual testing for model server (#4003 ) * trying out a fix * add ability to manually run model tests --------- Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-02-21 14:00:32 -08:00
rkuo-danswer	fe8a5d671a	don't spam the logs with texts on auth errors (#4085 ) * don't spam the logs with texts on auth errors * refactor the logging a bit --------- Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-02-21 13:40:07 -08:00
Yuhong Sun	6de53ebf60	README Touchup (#4088 )	2025-02-21 13:31:07 -08:00
rkuo-danswer	61d536c782	tool fixes (#4075 )	2025-02-21 12:30:33 -08:00
Chris Weaver	e1ff9086a4	Fix LLM selection (#4078 )	2025-02-21 11:32:57 -08:00
evan-danswer	ba21bacbbf	coerce useLanggraph to boolean (#4084 ) * coerce useLanggraph to boolean	2025-02-21 09:43:46 -08:00
pablonyx	158bccc3fc	Default on for non-ee (#4083 )	2025-02-21 09:11:45 -08:00
Weves	599b7705c2	Fix gitbook connector issues	2025-02-20 15:29:11 -08:00
rkuo-danswer	4958a5355d	try more efficient query (#4047 )	2025-02-20 12:58:50 -08:00
Chris Weaver	c4b8519381	Add support for sending email invites for single tenant users (#4065 )	2025-02-19 21:05:23 -08:00
rkuo-danswer	8b4413694a	fix usage of tenant_id (#4062 ) Co-authored-by: Richard Kuo (Danswer) <rkuo@onyx.app>	2025-02-19 17:50:58 -08:00
pablonyx	57cf7d9fac	default agent search `on`	2025-02-19 17:21:26 -08:00
Chris Weaver	ad4efb5f20	Pin xmlsec version + improve SAML flow (#4054 ) * Pin xmlsec version * testing * test nginx conf change * Pass through more * Cleanup + remove DOMAIN across the board	2025-02-19 16:02:05 -08:00
evan-danswer	e304ec4ab6	Agent search history displayed answer (#4052 )	2025-02-19 15:52:16 -08:00
joachim-danswer	1690dc45ba	timout bumps (#4057 )	2025-02-19 15:51:45 -08:00
pablonyx	7582ba1640	Fix streaming (#4055 )	2025-02-19 15:23:40 -08:00
pablonyx	99fc546943	Miscellaneous indexing fixes (#4042 )	2025-02-19 11:34:49 -08:00