SOMS/.kiro/specs/inspection-result-daily-deduplication/design.md

# Design Document

## Overview

This design addresses the bug in `SecondaryCircuitInspectionResultAppService.FindDatas` where daily deduplication of inspection results occurs after pagination instead of before. The solution involves refactoring the MongoDB aggregation pipeline to perform deduplication before pagination, ensuring that records with the same (Year, Month, Day, SecondaryCircuitInspectionItemId, Status) are properly merged into a single record per day.

## Architecture

The current architecture has a flawed query flow:

**Current (Incorrect) Flow:**
```
Query → Filter → Sort → Paginate (Skip/Limit) → Deduplicate → Return
```

**New (Correct) Flow:**
```
Query → Filter → Sort → Deduplicate (Group) → Count → Paginate (Skip/Limit) → Return
```

The key change is moving the deduplication step before pagination. This requires:
1. Modifying the MongoDB aggregation pipeline to include a `$group` stage
2. Calculating the total count after deduplication
3. Applying pagination to the deduplicated result set

## Components and Interfaces

### Modified Methods

#### 1. `FindDatas` Method
- **Current Behavior**: Calls `QueryPagedResultsAcrossCollectionsAsync` then deduplicates in memory
- **New Behavior**: Calls a new method that performs deduplication in the aggregation pipeline

#### 2. New Method: `QueryPagedDeduplicatedResultsAcrossCollectionsAsync`
```csharp
private async Task<(List<SecondaryCircuitInspectionResult> Results, long TotalCount)>
    QueryPagedDeduplicatedResultsAcrossCollectionsAsync(
        List<string> collectionNames,
        FilterDefinition<SecondaryCircuitInspectionResult> filter,
        string sortField,
        bool isDescending,
        int skipCount,
        int pageSize,
        CancellationToken cancellationToken = default)
```

**Purpose**: Performs cross-collection query with deduplication before pagination

**Returns**: Tuple containing both the paginated results and the total count of deduplicated records

#### 3. Modified Method: `CountAcrossCollectionsAsync`
- **Current Behavior**: Counts all records matching the filter
- **New Behavior**: Should be replaced by the count returned from the new query method

### Data Models

No changes to data models are required. The existing `SecondaryCircuitInspectionResult` entity already contains all necessary fields:
- `Year` (int)
- `Month` (int)
- `Day` (int)
- `SecondaryCircuitInspectionItemId` (Guid)
- `Status` (string)

## MongoDB Aggregation Pipeline Design

### Pipeline Stages

The new aggregation pipeline will have the following stages:

1. **$match**: Filter records based on search conditions
2. **$project**: Project necessary fields for processing
3. **$unionWith**: Combine data from multiple time-sharded collections
4. **$sort**: Sort records before grouping (to ensure consistent "first" record selection)
5. **$group**: Group by (Year, Month, Day, SecondaryCircuitInspectionItemId, Status) and take first record
6. **$facet**: Split into two pipelines:
   - **count**: Count total deduplicated records
   - **data**: Apply skip/limit for pagination
7. **$project**: Format the final output

### Detailed Pipeline Structure

```javascript
[
  // Stage 1: Match filter conditions
  { $match: <filterDocument> },

  // Stage 2: Project fields
  { $project: <projectionDocument> },

  // Stage 3: Union with other collections (repeated for each collection)
  { $unionWith: {
      coll: "SecondaryCircuitInspectionResult_YYYY_MM",
      pipeline: [
        { $match: <filterDocument> },
        { $project: <projectionDocument> }
      ]
    }
  },

  // Stage 4: Sort before grouping
  { $sort: { <sortField>: <sortDirection> } },

  // Stage 5: Group by deduplication keys
  { $group: {
      _id: {
        Year: "$Year",
        Month: "$Month",
        Day: "$Day",
        SecondaryCircuitInspectionItemId: "$SecondaryCircuitInspectionItemId",
        Status: "$Status"
      },
      doc: { $first: "$$ROOT" }
    }
  },

  // Stage 6: Replace root with the preserved document
  { $replaceRoot: { newRoot: "$doc" } },

  // Stage 7: Sort again after grouping (to maintain user-requested sort order)
  { $sort: { <sortField>: <sortDirection> } },

  // Stage 8: Facet for count and pagination
  { $facet: {
      totalCount: [{ $count: "count" }],
      data: [
        { $skip: <skipCount> },
        { $limit: <pageSize> }
      ]
    }
  }
]
```

## Implementation Details

### Key Changes in FindDatas Method

```csharp
// OLD CODE (Lines 704-730):
var pagedResults = await QueryPagedResultsAcrossCollectionsAsync(
    existingCollections, filter, sortField, isDescending, skipCount, pageSize, default);

var deduplicatedResults = pagedResults
    .GroupBy(x => new { ExecutionTime = x.ExecutionTime.Date, x.SecondaryCircuitInspectionItemId, x.Status })
    .Select(g => g.First())
    .ToList();

// NEW CODE:
var (deduplicatedResults, totalCount) = await QueryPagedDeduplicatedResultsAcrossCollectionsAsync(
    existingCollections, filter, sortField, isDescending, skipCount, pageSize, default);
```

### Grouping Key Structure

The grouping key must include:
- `Year`: Integer year value
- `Month`: Integer month value (1-12)
- `Day`: Integer day value (1-31)
- `SecondaryCircuitInspectionItemId`: GUID of the inspection item
- `Status`: String status value

### Handling $first Operator

The `$first` operator in the `$group` stage will preserve the first document in each group. Since we sort before grouping, this ensures we get the earliest (or latest, depending on sort direction) record for each day.

## Error Handling

### Potential Issues and Solutions

1. **Empty Result Sets**
   - **Issue**: Facet stage always returns a document, even if no results
   - **Solution**: Check if `totalCount` array is empty and handle accordingly

2. **Collection Not Found**
   - **Issue**: Querying non-existent collections throws exceptions
   - **Solution**: Use `FilterExistingCollectionsAsync` before building pipeline (already implemented)

3. **Memory Limits**
   - **Issue**: Large result sets before pagination could exceed memory
   - **Solution**: MongoDB handles this internally; aggregation pipeline is memory-efficient

4. **Index Performance**
   - **Issue**: Grouping without proper indexes could be slow
   - **Solution**: Ensure compound index exists on (Year, Month, Day, SecondaryCircuitInspectionItemId, Status)

## Testing Strategy

### Unit Tests

1. **Test Deduplication Logic**
   - Create multiple records with same (Year, Month, Day, ItemId, Status)
   - Verify only one record is returned per group
   - Verify the correct record is selected (first after sorting)

2. **Test Pagination After Deduplication**
   - Create 25 unique day-item-status combinations
   - Query with PageSize=10
   - Verify Page 1 has 10 records, Page 2 has 10 records, Page 3 has 5 records
   - Verify TotalCount = 25

3. **Test Total Count Accuracy**
   - Create 100 raw records that deduplicate to 30 unique combinations
   - Verify TotalCount returns 30, not 100

4. **Test Cross-Collection Deduplication**
   - Insert records in multiple month-sharded collections
   - Query across collections
   - Verify deduplication works across collection boundaries

5. **Test Sort Order Preservation**
   - Create records with different ExecutionTime values on the same day
   - Sort by ExecutionTime descending
   - Verify the latest record is selected for each day

### Edge Cases

1. **Empty Input**: No records match filter → Return empty list with TotalCount=0
2. **Single Record Per Day**: No deduplication needed → Return all records
3. **All Records Same Day**: Multiple items/statuses on same day → Deduplicate correctly
4. **Null Status Values**: Handle null status in grouping key
5. **Large Page Size**: PageSize > TotalCount → Return all deduplicated records

## Correctness Properties

*A property is a characteristic or behavior that should hold true across all valid executions of a system—essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*

### Property 1: Deduplication Completeness

*For any* set of inspection results with matching (Year, Month, Day, SecondaryCircuitInspectionItemId, Status), the query result should contain at most one record per unique combination of these fields.

**Validates: Requirements 1.2**

### Property 2: Pagination Consistency

*For any* query with pagination parameters (PageIndex, PageSize), the total number of records across all pages should equal the TotalCount returned in the first page response.

**Validates: Requirements 2.2**

### Property 3: Sort Order Preservation

*For any* sort field and direction, the records within each deduplicated group should be selected according to the sort order (first record after sorting).

**Validates: Requirements 1.3**

### Property 4: Count Accuracy

*For any* query filter, the TotalCount returned should equal the number of unique (Year, Month, Day, SecondaryCircuitInspectionItemId, Status) combinations in the filtered dataset.

**Validates: Requirements 2.1**

### Property 5: Cross-Collection Consistency

*For any* time range spanning multiple collections, deduplication should produce the same result as if all data were in a single collection.

**Validates: Requirements 4.2**

### Property 6: Idempotent Deduplication

*For any* dataset, applying the deduplication logic multiple times should produce the same result as applying it once.

**Validates: Requirements 1.1**

## Performance Considerations

### Index Requirements

Create a compound index to optimize the grouping operation:

```javascript
db.SecondaryCircuitInspectionResult_YYYY_MM.createIndex({
  Year: 1,
  Month: 1,
  Day: 1,
  SecondaryCircuitInspectionItemId: 1,
  Status: 1,
  ExecutionTime: -1
});
```

This index supports:
- Efficient grouping by (Year, Month, Day, ItemId, Status)
- Efficient sorting by ExecutionTime within groups

### Memory Usage

The `$group` stage accumulates documents in memory. However, since we're only keeping the first document per group (using `$first`), memory usage is bounded by the number of unique groups, not the total number of documents.

**Estimated Memory**: ~1KB per unique day-item-status combination

### Query Performance

**Expected Performance**:
- Small datasets (<10K records): <100ms
- Medium datasets (10K-100K records): 100-500ms
- Large datasets (>100K records): 500-2000ms

Performance will be monitored using the existing `LogQueryPerformance` method.

## Backward Compatibility

### API Compatibility

The `FindDatas` method signature remains unchanged:

```csharp
public async Task<RequestPageResult<SecondaryCircuitInspectionResultDetailOutput>> FindDatas(
    PageSearchCondition<SecondaryCircuitInspectionResultSearchConditionInput> searchCondition)
```

### Response Format

The response format remains identical. The only difference is that duplicate records are now properly eliminated before pagination.

### Client Impact

Clients may observe:
- **Fewer total records**: TotalCount will be lower (reflecting deduplicated count)
- **Different records per page**: Since deduplication happens first, page contents may differ
- **Same API contract**: No code changes required in clients

## Migration Strategy

### Deployment Steps

1. **Deploy Code**: Update the `SecondaryCircuitInspectionResultAppService` with new implementation
2. **Monitor Performance**: Watch query performance metrics for any degradation
3. **Verify Results**: Spot-check query results to ensure deduplication is working correctly

### Rollback Plan

If issues arise:
1. Revert to previous version of `SecondaryCircuitInspectionResultAppService`
2. The old in-memory deduplication logic will resume (with its bugs)
3. No data migration needed (this is a query-only change)

### Index Creation

Indexes should be created during a maintenance window:

```javascript
// For each existing collection
db.SecondaryCircuitInspectionResult_2025_01.createIndex({
  Year: 1, Month: 1, Day: 1,
  SecondaryCircuitInspectionItemId: 1,
  Status: 1,
  ExecutionTime: -1
}, { background: true });
```

Use `background: true` to avoid blocking other operations.