2026-01-06 22:59:58 +08:00

12 KiB

Design Document

Overview

This design addresses the bug in SecondaryCircuitInspectionResultAppService.FindDatas where daily deduplication of inspection results occurs after pagination instead of before. The solution involves refactoring the MongoDB aggregation pipeline to perform deduplication before pagination, ensuring that records with the same (Year, Month, Day, SecondaryCircuitInspectionItemId, Status) are properly merged into a single record per day.

Architecture

The current architecture has a flawed query flow:

Current (Incorrect) Flow:

Query → Filter → Sort → Paginate (Skip/Limit) → Deduplicate → Return

New (Correct) Flow:

Query → Filter → Sort → Deduplicate (Group) → Count → Paginate (Skip/Limit) → Return

The key change is moving the deduplication step before pagination. This requires:

  1. Modifying the MongoDB aggregation pipeline to include a $group stage
  2. Calculating the total count after deduplication
  3. Applying pagination to the deduplicated result set

Components and Interfaces

Modified Methods

1. FindDatas Method

  • Current Behavior: Calls QueryPagedResultsAcrossCollectionsAsync then deduplicates in memory
  • New Behavior: Calls a new method that performs deduplication in the aggregation pipeline

2. New Method: QueryPagedDeduplicatedResultsAcrossCollectionsAsync

private async Task<(List<SecondaryCircuitInspectionResult> Results, long TotalCount)> 
    QueryPagedDeduplicatedResultsAcrossCollectionsAsync(
        List<string> collectionNames,
        FilterDefinition<SecondaryCircuitInspectionResult> filter,
        string sortField,
        bool isDescending,
        int skipCount,
        int pageSize,
        CancellationToken cancellationToken = default)

Purpose: Performs cross-collection query with deduplication before pagination

Returns: Tuple containing both the paginated results and the total count of deduplicated records

3. Modified Method: CountAcrossCollectionsAsync

  • Current Behavior: Counts all records matching the filter
  • New Behavior: Should be replaced by the count returned from the new query method

Data Models

No changes to data models are required. The existing SecondaryCircuitInspectionResult entity already contains all necessary fields:

  • Year (int)
  • Month (int)
  • Day (int)
  • SecondaryCircuitInspectionItemId (Guid)
  • Status (string)

MongoDB Aggregation Pipeline Design

Pipeline Stages

The new aggregation pipeline will have the following stages:

  1. $match: Filter records based on search conditions
  2. $project: Project necessary fields for processing
  3. $unionWith: Combine data from multiple time-sharded collections
  4. $sort: Sort records before grouping (to ensure consistent "first" record selection)
  5. $group: Group by (Year, Month, Day, SecondaryCircuitInspectionItemId, Status) and take first record
  6. $facet: Split into two pipelines:
    • count: Count total deduplicated records
    • data: Apply skip/limit for pagination
  7. $project: Format the final output

Detailed Pipeline Structure

[
  // Stage 1: Match filter conditions
  { $match: <filterDocument> },
  
  // Stage 2: Project fields
  { $project: <projectionDocument> },
  
  // Stage 3: Union with other collections (repeated for each collection)
  { $unionWith: { 
      coll: "SecondaryCircuitInspectionResult_YYYY_MM",
      pipeline: [
        { $match: <filterDocument> },
        { $project: <projectionDocument> }
      ]
    }
  },
  
  // Stage 4: Sort before grouping
  { $sort: { <sortField>: <sortDirection> } },
  
  // Stage 5: Group by deduplication keys
  { $group: {
      _id: {
        Year: "$Year",
        Month: "$Month",
        Day: "$Day",
        SecondaryCircuitInspectionItemId: "$SecondaryCircuitInspectionItemId",
        Status: "$Status"
      },
      doc: { $first: "$$ROOT" }
    }
  },
  
  // Stage 6: Replace root with the preserved document
  { $replaceRoot: { newRoot: "$doc" } },
  
  // Stage 7: Sort again after grouping (to maintain user-requested sort order)
  { $sort: { <sortField>: <sortDirection> } },
  
  // Stage 8: Facet for count and pagination
  { $facet: {
      totalCount: [{ $count: "count" }],
      data: [
        { $skip: <skipCount> },
        { $limit: <pageSize> }
      ]
    }
  }
]

Implementation Details

Key Changes in FindDatas Method

// OLD CODE (Lines 704-730):
var pagedResults = await QueryPagedResultsAcrossCollectionsAsync(
    existingCollections, filter, sortField, isDescending, skipCount, pageSize, default);

var deduplicatedResults = pagedResults
    .GroupBy(x => new { ExecutionTime = x.ExecutionTime.Date, x.SecondaryCircuitInspectionItemId, x.Status })
    .Select(g => g.First())
    .ToList();

// NEW CODE:
var (deduplicatedResults, totalCount) = await QueryPagedDeduplicatedResultsAcrossCollectionsAsync(
    existingCollections, filter, sortField, isDescending, skipCount, pageSize, default);

Grouping Key Structure

The grouping key must include:

  • Year: Integer year value
  • Month: Integer month value (1-12)
  • Day: Integer day value (1-31)
  • SecondaryCircuitInspectionItemId: GUID of the inspection item
  • Status: String status value

Handling $first Operator

The $first operator in the $group stage will preserve the first document in each group. Since we sort before grouping, this ensures we get the earliest (or latest, depending on sort direction) record for each day.

Error Handling

Potential Issues and Solutions

  1. Empty Result Sets

    • Issue: Facet stage always returns a document, even if no results
    • Solution: Check if totalCount array is empty and handle accordingly
  2. Collection Not Found

    • Issue: Querying non-existent collections throws exceptions
    • Solution: Use FilterExistingCollectionsAsync before building pipeline (already implemented)
  3. Memory Limits

    • Issue: Large result sets before pagination could exceed memory
    • Solution: MongoDB handles this internally; aggregation pipeline is memory-efficient
  4. Index Performance

    • Issue: Grouping without proper indexes could be slow
    • Solution: Ensure compound index exists on (Year, Month, Day, SecondaryCircuitInspectionItemId, Status)

Testing Strategy

Unit Tests

  1. Test Deduplication Logic

    • Create multiple records with same (Year, Month, Day, ItemId, Status)
    • Verify only one record is returned per group
    • Verify the correct record is selected (first after sorting)
  2. Test Pagination After Deduplication

    • Create 25 unique day-item-status combinations
    • Query with PageSize=10
    • Verify Page 1 has 10 records, Page 2 has 10 records, Page 3 has 5 records
    • Verify TotalCount = 25
  3. Test Total Count Accuracy

    • Create 100 raw records that deduplicate to 30 unique combinations
    • Verify TotalCount returns 30, not 100
  4. Test Cross-Collection Deduplication

    • Insert records in multiple month-sharded collections
    • Query across collections
    • Verify deduplication works across collection boundaries
  5. Test Sort Order Preservation

    • Create records with different ExecutionTime values on the same day
    • Sort by ExecutionTime descending
    • Verify the latest record is selected for each day

Edge Cases

  1. Empty Input: No records match filter → Return empty list with TotalCount=0
  2. Single Record Per Day: No deduplication needed → Return all records
  3. All Records Same Day: Multiple items/statuses on same day → Deduplicate correctly
  4. Null Status Values: Handle null status in grouping key
  5. Large Page Size: PageSize > TotalCount → Return all deduplicated records

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system—essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Deduplication Completeness

For any set of inspection results with matching (Year, Month, Day, SecondaryCircuitInspectionItemId, Status), the query result should contain at most one record per unique combination of these fields.

Validates: Requirements 1.2

Property 2: Pagination Consistency

For any query with pagination parameters (PageIndex, PageSize), the total number of records across all pages should equal the TotalCount returned in the first page response.

Validates: Requirements 2.2

Property 3: Sort Order Preservation

For any sort field and direction, the records within each deduplicated group should be selected according to the sort order (first record after sorting).

Validates: Requirements 1.3

Property 4: Count Accuracy

For any query filter, the TotalCount returned should equal the number of unique (Year, Month, Day, SecondaryCircuitInspectionItemId, Status) combinations in the filtered dataset.

Validates: Requirements 2.1

Property 5: Cross-Collection Consistency

For any time range spanning multiple collections, deduplication should produce the same result as if all data were in a single collection.

Validates: Requirements 4.2

Property 6: Idempotent Deduplication

For any dataset, applying the deduplication logic multiple times should produce the same result as applying it once.

Validates: Requirements 1.1

Performance Considerations

Index Requirements

Create a compound index to optimize the grouping operation:

db.SecondaryCircuitInspectionResult_YYYY_MM.createIndex({
  Year: 1,
  Month: 1,
  Day: 1,
  SecondaryCircuitInspectionItemId: 1,
  Status: 1,
  ExecutionTime: -1
});

This index supports:

  • Efficient grouping by (Year, Month, Day, ItemId, Status)
  • Efficient sorting by ExecutionTime within groups

Memory Usage

The $group stage accumulates documents in memory. However, since we're only keeping the first document per group (using $first), memory usage is bounded by the number of unique groups, not the total number of documents.

Estimated Memory: ~1KB per unique day-item-status combination

Query Performance

Expected Performance:

  • Small datasets (<10K records): <100ms
  • Medium datasets (10K-100K records): 100-500ms
  • Large datasets (>100K records): 500-2000ms

Performance will be monitored using the existing LogQueryPerformance method.

Backward Compatibility

API Compatibility

The FindDatas method signature remains unchanged:

public async Task<RequestPageResult<SecondaryCircuitInspectionResultDetailOutput>> FindDatas(
    PageSearchCondition<SecondaryCircuitInspectionResultSearchConditionInput> searchCondition)

Response Format

The response format remains identical. The only difference is that duplicate records are now properly eliminated before pagination.

Client Impact

Clients may observe:

  • Fewer total records: TotalCount will be lower (reflecting deduplicated count)
  • Different records per page: Since deduplication happens first, page contents may differ
  • Same API contract: No code changes required in clients

Migration Strategy

Deployment Steps

  1. Deploy Code: Update the SecondaryCircuitInspectionResultAppService with new implementation
  2. Monitor Performance: Watch query performance metrics for any degradation
  3. Verify Results: Spot-check query results to ensure deduplication is working correctly

Rollback Plan

If issues arise:

  1. Revert to previous version of SecondaryCircuitInspectionResultAppService
  2. The old in-memory deduplication logic will resume (with its bugs)
  3. No data migration needed (this is a query-only change)

Index Creation

Indexes should be created during a maintenance window:

// For each existing collection
db.SecondaryCircuitInspectionResult_2025_01.createIndex({
  Year: 1, Month: 1, Day: 1,
  SecondaryCircuitInspectionItemId: 1,
  Status: 1,
  ExecutionTime: -1
}, { background: true });

Use background: true to avoid blocking other operations.