ตัวอย่าง: Matrix Multiply

โปรแกรมภาษา C คำนวณการคูณเมทริกซ์ 20×20 — เหมาะสำหรับศึกษาผลกระทบของ cache ต่อประสิทธิภาพ และเปรียบเทียบ pipeline แบบต่าง ๆ

เป้าหมายของบทเรียน

เห็นการเขียนภาษา C ใน Ripes แบบ -nostdlib ที่ provide startup เอง
วัด CPI (cycles per instruction) ของ processor model ต่าง ๆ
เห็น cache miss rate ของ algorithm ที่ access memory แบบ stride
เรียนรู้การ profile โปรแกรมจริงในระดับ instruction

วิธีโหลดโปรแกรม

ตั้งค่า C compiler ใน Edit → Settings → Editor (ดู บท Editor)
เลือกเมนู File → Load Example → C → Matrix Multiply
เปลี่ยน Input type เป็น C
กด Build (Ctrl+B)

โค้ดทั้งหมด

/**
 * คูณเมทริกซ์ขนาด WxW สองตัว
 * ใช้ -nostdlib (ไม่พึ่ง C standard library)
 */

asm("li sp, 0x100000");  // ตั้ง stack pointer ที่ 1 MB
asm("jal main");         // เรียก main
asm("mv a1, a0");        // เก็บ return value ใน a1
asm("li a7, 10");        // เตรียม ecall exit
asm("ecall");            // หยุด simulator

#define W 20  // ขนาดของเมทริกซ์

void mmul(const int a[W][W], const int b[W][W], int c[W][W]) {
    for (int row = 0; row < W; row++) {
        for (int col = 0; col < W; col++) {
            for (int k = 0; k < W; k++) {
                c[row][col] += a[row][k] * b[k][col];
            }
        }
    }
}

int main() {
    int A[W][W];
    int B[W][W];
    int C[W][W];
    mmul(A, B, C);
    return 0;
}

วิเคราะห์โครงสร้าง

1) Startup code แบบเขียนเอง

เนื่องจาก compile แบบ -nostdlib โปรแกรมต้องเตรียม environment เอง:

li sp, 0x100000 — ตั้ง stack pointer ที่ address 1 MB เพื่อให้มี stack ใช้
jal main — เรียก main เหมือนเป็นการ call function ทั่วไป
หลัง main return ค่ามาที่ a0 ให้ย้ายไป a1 และเรียก ecall 10 เพื่อหยุด

ทำไมต้อง -nostdlib? เพื่อให้ executable เล็กลงและอ่านง่าย เพราะ standard library จะ link โค้ดเสริมเข้ามา (เช่น __libc_start_main, soft-float routines) ที่ไม่จำเป็นสำหรับการเรียน

2) Loop access pattern (สำคัญสำหรับ cache!)

การคูณเมทริกซ์มี loop 3 ชั้น:

for (row = 0; row < W; row++)
    for (col = 0; col < W; col++)
        for (k = 0; k < W; k++)
            c[row][col] += a[row][k] * b[k][col];

การ access matrix ทั้ง 3 ตัวมี pattern ต่างกัน:

a[row][k] — access ตามแถว → spatial locality สูง → cache friendly
b[k][col] — access ตามคอลัมน์ → stride = W × 4 byte → cache unfriendly
c[row][col] — เขียนซ้ำที่เดิม → temporal locality สูงมาก

ลองคิด: ถ้า cache line = 4 word (16 byte) และ W = 20 matrix b มี stride = 80 byte ทำให้ทุก access เกินขนาด cache line — เกิด cache miss ทุกครั้ง!

การทดลองที่ 1: เปรียบเทียบ Pipeline

ลองที่ 1.1: Single Cycle

เลือก Single Cycle processor
กด Run
บันทึก: จำนวน cycles และ instructions retired
คำนวณ CPI = cycles / instructions

CPI จะใกล้ 1.0 เพราะ single cycle ทำ 1 instruction ใน 1 cycle เสมอ

ลองที่ 1.2: 5-Stage Pipelined

เปลี่ยนเป็น 5-Stage Processor (ตัวสมบูรณ์ที่มี forwarding + hazard detection)
กด Reset แล้ว Run
บันทึกค่าใหม่

CPI จะลดลงต่ำกว่า 1 (น่าจะอยู่ ~1.1-1.3) เพราะ pipeline ทำให้ทำงานคู่ขนานได้ แต่ก็มี stall จาก data hazard และ load-use

ลองที่ 1.3: 6-Stage Dual-issue

เปลี่ยนเป็น 6-Stage Dual-issue แล้วรันใหม่ — CPI ควรลดลงอีก เพราะรัน 2 instruction พร้อมกันได้

การทดลองที่ 2: Cache Configuration

เปิดแท็บ Cache ในระหว่างรัน Matrix Multiply:

ลองที่ 2.1: Cache เล็ก

เลือก preset "8-entry 4-word direct mapped"
กด Reset แล้ว Run
ดู Cache plot → เลือก Numerator = Hits, Denominator = Access count
บันทึก hit rate

ลองที่ 2.2: Cache ใหญ่ขึ้น

เปลี่ยนเป็น "32-entry 4-word direct mapped"
Reset → Run
เปรียบเทียบ hit rate

ลองที่ 2.3: Set-associative

เปลี่ยนเป็น "32-entry 4-word 2-way set associative"
Reset → Run
เปรียบเทียบ hit rate กับสองครั้งก่อนหน้า

คำถามชวนคิด

ทำไม hit rate ไม่เพิ่มขึ้นเท่ากับขนาด cache?
ถ้าเปลี่ยน loop ordering จาก row-col-k เป็น row-k-col hit rate จะดีขึ้นไหม? (ลองแก้โค้ดดู!)
เพิ่ม W เป็น 32 หรือ 40 hit rate จะเปลี่ยนอย่างไร?

การทดลองที่ 3: Loop Reordering

ลองแก้ไขลำดับ loop ในโค้ดเป็น row → k → col:

void mmul(const int a[W][W], const int b[W][W], int c[W][W]) {
    for (int row = 0; row < W; row++) {
        for (int k = 0; k < W; k++) {
            for (int col = 0; col < W; col++) {
                c[row][col] += a[row][k] * b[k][col];
            }
        }
    }
}

ตอนนี้ b[k][col] ถูก access แบบเรียงตาม column ทำให้ spatial locality ดีขึ้นมาก → hit rate ควรกระโดดขึ้น!

แบบฝึกหัด

เพิ่มขนาด W เป็น 50 — สังเกตว่า cycles เพิ่มขึ้นกี่เท่า?
เปรียบเทียบ CPI ของแต่ละ pipeline รวมตารางส่งอาจารย์
สำหรับ Matrix ขนาด W=20, จำนวน multiplication คือ W³ = 8000 ครั้ง — โปรแกรมใช้ instruction กี่ตัวต่อ multiplication?
ลอง blocking: แบ่ง matrix เป็น sub-block ขนาด 4×4 แล้วคูณทีละ block — hit rate ดีขึ้นแค่ไหน?