Why Does My LBM D2Q9 Implementation in CUDA C Not Conserve Density?

csp256 · Jul 1, 2013

Homework Statement

I am doing a self directed project in Cuda C for an undergraduate parallel computing class. I chose to focus on the lattice Boltzmann method (for computational fluid dynamics). I have not yet had a class on statistical mechanics, nor any other formal exposure to the physics at hand.

My attempt does not seem conserve the total particle density (it trends to 0 pretty quickly).

Homework Equations

I am primarily using the code on this page as my guide: http://www.exolete.com/code/lbm

It is D2Q9 using BGK. It worked on my Matlab instal without any tinkering on my part.

I think I understand the code, though I do not know the math used to get the equations for the FEQ matrix.

The Attempt at a Solution

I tried doing a straight forward implementation of the MATLAB code in Cuda C. I am not to the point of caring about performance: I am very well aware that some parts of the code are painfully slow.

I did do a couple things differently from the reference implementation. My matrices are all flat arrays, row-order. Specifically the F and BOUNCE arrays are 3 dimensional flat arrays, which are row-then-column-order. Each one of the densities at each cell is therefor stored with offset direction*(width of matrix * height of matrix)+x+(width of matrix)*y. Direction is 0 through 9, as in the reference, but because Cuda C is 0 indexed (unlike Matlab) the "center" cell is at direction 0 (instead of 9). The other directions are the same.

Also, in my implementation I used a left handed coordinate system (origin is upper left, y-axis increases downwards). For some reason I thought that was what the reference was using (whoops!). I *think* all I need to do to counter this is to flip some negative signs on the y direction. Even if I did mess something up here, I don't think it would account for the problem I am seeing.

Note also that my UX and UY matrices are computed on the host (CPU), not the device (GPU). They are just to help in debugging and aren't directly used in computation.

I have hard-coded the simulation size to 10 by 10 with two obstacles, one at (4,3) and the other at (6,6) [0 indexed]. Unlike the reference implementation I zeroed out those cells before the first propagation step. This is definitely not responsible for the discrepancy I am seeing (the per cell density decreases everywhere after the first step, not just on cells near the obstacles).

I would really appreciate it if anyone could take a look at this. My teacher can not help me (he is a physicist) and I suspect that whatever I did wrong is a fairly obvious conceptual error that I can't recognize because of my lack of experience.

Also, any (free) references on the LBM (or other practical computational physics techniques) that are suitable for a beginner would be greatly appreciated!

Code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
// #include <timer.h>
#include <project_main.h>
#define density 1.0
#define nx 10
#define ny 10
#define msize nx*ny
//#define numberOfActiveNodes msize-2 // see the comments in BOUND's declaration for a justification.
//#define deltaU 0.00001

#define cudaCheckErrors(msg) \
   do { \
      cudaError_t __err = cudaGetLastError();\
      if (__err != cudaSuccess) { \
         printf("Fatal error %s (%s at %s:%d)\n", msg, cudaGetErrorString(__err), __FILE__, __LINE__); \
         exit(1); \
      } \
   } while (0)

int main(int argc, char **argv) {
	int dev_id = 0;
	if (DeviceSelect(dev_id) < 0) {
		fprintf(stderr, "Err: No GPU device found\n");
		exit(1);
	}
	//DeviceInfo(dev_id);

	// The UX, UY, and DENSITY matrices are currently only used for debugging. 
        float *UX = (float *) malloc(msize * sizeof(float));
        float *UY = (float *) malloc(msize * sizeof(float));
        float *DENSITY = (float *) malloc(msize * sizeof(float));

	// The F matrix is 3 dimensional. (really it is a flat array). The first two dimensions are the x and y coordinates.
	// The third is 9 long, and details the densities moving in each of 9 directions.
	// 0 is "center", 1 is right, 2 is upper right, 3 is up, etc.
	float *F = NULL;
	float *BOUNCE = NULL;  
	//float *TEMP = NULL;
	int F_size = 9 * msize * sizeof(float);
	F = (float *) malloc(F_size);
	BOUNCE = (float *) malloc(F_size);
	for (int i=0; i<9*msize; i++) {
		F[i] = density/9.0;
		BOUNCE[i] = 0; // take care that this continues to be initialized to 0.
	}

	bool *OBSTACLES = NULL;
	OBSTACLES = (bool *) malloc(msize*sizeof(bool));
	for (int i=0; i<msize; i++) {
		// for debugging purposes it is easier to see how things are working if we have SOME obstruction.
		// we define the two obstructions as being at (3,5) and (7,7)... in the MATLAB code.
		// BUT we are 0 indexed here so the obstructions are at (2,4) and (6,6). 
		// This is hardcoded solely for debugging.
		OBSTACLES[i] = false;
		if (i == 4+2*nx || i == 6+6*nx) {
			OBSTACLES[i] = true;
			for (int dir=0; dir<9; ++dir) {
				F[i+dir*msize] = 0.0;
			}
		}
	} 

	float *F_d, *BOUNCE_d, *TEMP_d;
	bool *OBSTACLES_d;
	cudaMalloc(&F_d, F_size);
	cudaCheckErrors("allocating F_d");
	cudaMalloc(&BOUNCE_d, F_size);
	cudaCheckErrors("allocating BOUNCE_d");
	cudaMalloc(&OBSTACLES_d, msize*sizeof(bool));
	cudaCheckErrors("allocating OBSTACLES_d");

	cudaMemcpy(F_d, F, F_size, cudaMemcpyHostToDevice);
	cudaCheckErrors("copying F to device");
	cudaMemcpy(BOUNCE_d, BOUNCE, F_size, cudaMemcpyHostToDevice);
	cudaCheckErrors("copying BOUNCE to device");	
	cudaMemcpy(OBSTACLES_d, OBSTACLES, msize*sizeof(bool), cudaMemcpyHostToDevice);
	cudaCheckErrors("copying OBSTACLES to device");        for (int i=0; i<9; ++i) {
		printf("Printing F in direction %d\n", i);
                printMatrix(F, i);
        }

	createMatrices(F,UX,UY,DENSITY);

	for (int i=0; i<10; ++i) {
                printf("\n###########\n");
                printf("###########\n");
		printf("###########\n");
        	printf("###########  Main loop number: %d\n", (i+1));
        	printf("###########\n");
                printf("###########\n");
                printf("###########\n");

// ###############################################################################################################################
		printf("\n&&& Launching kernels.\n");
		LBM_D2Q9(F_d, BOUNCE_d, OBSTACLES_d, 1); 
// ###############################################################################################################################

		printf("\nSwapping the F_d and BOUNCE_d pointers in main.\n");
		TEMP_d = F_d;
		F_d = BOUNCE_d;
		BOUNCE_d = TEMP_d;

        	cudaMemcpy(F, F_d, F_size, cudaMemcpyDeviceToHost);
        	cudaCheckErrors("copying F_d to host");
		cudaMemcpy(BOUNCE, BOUNCE_d, F_size, cudaMemcpyDeviceToHost);
		cudaCheckErrors("copying BOUNCE_d to host");

		printf("\nThis is the F matrix:\n");
		for (int i=0; i<9; ++i) {
			printf("Printing F in direction %d\n", i);
			printMatrix(F, i);
		}

		printf("\nThis is the BOUNCE matrix:\n");
                for (int i=0; i<9; ++i) {
			printf("Printing BOUNCE in direction %d\n", i);
                        printMatrix(BOUNCE, i);
                }

		printf("\n\n\n\nMaking UX, UY, and DENSITY matrices from F...\n");
		createMatrices(F, UX, UY, DENSITY);
		printMatrix(UX, 0);
		printMatrix(UY, 0);
		printMatrix(DENSITY, 0);

	}

	cudaFree(F_d);
	cudaCheckErrors("freeing F_d");
	cudaFree(BOUNCE_d);
	cudaCheckErrors("freeing BOUNCE_d");
	cudaFree(OBSTACLES_d);
	cudaCheckErrors("freeing OBSTACLES_d");

	printf("\n\n");
//	drawMatrix();
}

void printMatrix(float *F, int direction) {
	printf("Printing some matrix in direction %d... \n", direction);
	for (int j=0; j<ny; j++) {
		for (int i=0; i<nx; i++) {
			printf("%f ", F[direction*msize + i+j*nx]);
		}
		printf("\n");
	}
	printf("\n");
}

void createMatrices(float *F, float *UX, float *UY, float *DENSITY) {
	float totalDensity = 0;

	for (int i=0; i<msize; ++i) {
		DENSITY[i] = 0;
	}

	for (int dir=0; dir<9; ++dir) {
		for (int i=0; i<msize; ++i) {
			//totalDensity += F[dir*msize + i];
			DENSITY[i] += F[dir*msize + i];
			if (dir == 8 | dir == 1 | dir == 2) {
				UX[i] += F[dir*msize + i];
			} else if (dir == 4 | dir == 5 | dir == 6) {
				UX[i] -= F[dir*msize + i];
			}
			
			if (dir == 2 | dir == 3 | dir == 4) {
				UY[i] -= F[dir*msize + i];
			} else if (dir == 6 | dir == 7 | dir == 8) {
				UY[i] += F[dir*msize + i];
			}
		}
	} 

	totalDensity = 0.0; // less floating point errors this way
	for (int i=0; i<msize; ++i) {
		totalDensity += DENSITY[i];
	}
	printf("Total density is %f\n", totalDensity);
}

Code:

#include <stdio.h>
#include <project_main.h>

#define omega 1.0

#define wrapRight(x) if (nx <= x) x = 0;
#define wrapLeft(x) if (x < 0) x = nx - 1;
#define wrapUp(y) if (y < 0) y = ny - 1;
#define wrapDown(y) if (ny <= y) y = 0;

// pass them F_old and BOUNCE, then swap the pointers after the function call
__global__ void LBM_D2Q9_Kernel_Propogate(float *F, float *F_new) {
	int x = blockIdx.x*blockDim.x + threadIdx.x;
        int y = blockIdx.y*blockDim.y + threadIdx.y;
	if (nx <= x || ny <= y) return;

	int x_target, y_target;

	// Forgive me, for I have sinned...
	// right - 1
	x_target = x+1;
	wrapRight(x_target);
	y_target = y;
	F_new[msize*1 + y_target*nx + x_target] += F[msize*1 + y*nx + x]; // indexing gets tricky here; watch yourself!

	// upper right - 2
	x_target = x+1;
	wrapRight(x_target);
	y_target = y-1; // we are using a LEFT handed coordinate system, going to upper right is decreasing y!
	wrapUp(y_target);
	F_new[msize*2 + y_target*nx + x_target] += F[msize*2 + y*nx + x];

	// up - 3
	x_target = x;
	y_target = y-1;
	wrapUp(y_target);
	F_new[msize*3 + y_target*nx + x_target] += F[msize*3 + y*nx + x];

	// upper left - 4
        x_target = x-1;
	wrapLeft(x_target);
        y_target = y-1;
	wrapUp(y_target);
        F_new[msize*4 + y_target*nx + x_target] += F[msize*4 + y*nx + x];

	// left - 5
        x_target = x-1;
	wrapLeft(x_target); 
        y_target = y;
        F_new[msize*5 + y_target*nx + x_target] += F[msize*5 + y*nx + x];

	// bottom left - 6
        x_target = x-1;
        wrapLeft(x_target);
        y_target = y+1;
	wrapDown(y_target);
        F_new[msize*6 + y_target*nx + x_target] += F[msize*6 + y*nx + x];

	// down - 7
        x_target = x;
        y_target = y+1;
        wrapDown(y_target);
        F_new[msize*7 + y_target*nx + x_target] += F[msize*7 + y*nx + x];

	// bottom right - 8
        x_target = x+1;
	wrapRight(x_target);
        y_target = y+1;
        wrapDown(y_target);
        F_new[msize*8 + y_target*nx + x_target] += F[msize*8 + y*nx + x];
	
	// *weeps openly*

	F[y*nx + x] = 0.0;
        F[msize + y*nx + x] = 0.0;
        F[2*msize + y*nx + x] = 0.0;
        F[3*msize + y*nx + x] = 0.0;
        F[4*msize + y*nx + x] = 0.0;
        F[5*msize + y*nx + x] = 0.0;
        F[6*msize + y*nx + x] = 0.0;
        F[7*msize + y*nx + x] = 0.0;
        F[8*msize + y*nx + x] = 0.0;
}
__global__ void LBM_D2Q9_Kernel_Collide(float *F, bool *OBSTACLE, float *BOUNCE) {
        int x = blockIdx.x*blockDim.x + threadIdx.x;
        int y = blockIdx.y*blockDim.y + threadIdx.y;
        if (nx <= x || ny <= y) return;

	if (OBSTACLE[y*nx + x]) {
		int target_x, target_y;

		// write densities to bounce back...
		// (reverse density direction, and move it in that direction)
		// right - 1
		target_x = x-1;
		wrapLeft(target_x);
		target_y = y;
		BOUNCE[5*msize + target_x+target_y*nx] = F[1*msize + y*nx + x];

		// upper right - 2
                target_x = x-1;
                wrapLeft(target_x);
                target_y = y+1;
		wrapDown(target_y);
                BOUNCE[6*msize + target_x+target_y*nx] = F[2*msize + y*nx + x];

		// up - 3
		target_x = x;
		target_y = y+1;
		wrapDown(target_y);
                BOUNCE[7*msize + target_x+target_y*nx] = F[3*msize + y*nx + x];

		// upper left - 4
		target_x = x+1;
		wrapRight(target_x);
		target_y = y+1;
		wrapDown(target_y);
                BOUNCE[8*msize + target_x+target_y*nx] = F[4*msize + y*nx + x];

		// left - 5	
		target_x = x+1;
		wrapRight(target_x);
		target_y = y;
                BOUNCE[msize + target_x+target_y*nx] = F[5*msize + y*nx + x];

		// bottom left - 6
		target_x = x+1;
		wrapRight(target_x);
		target_y = y-1;
		wrapUp(target_y);
                BOUNCE[2*msize + target_x+target_y*nx] = F[6*msize + y*nx + x];

		// down - 7
		target_x = x;
		target_y = y-1;
		wrapUp(target_y);
                BOUNCE[3*msize + target_x+target_y*nx] = F[7*msize + y*nx + x];

		// bottom right - 8
		target_x = x-1;
		wrapLeft(target_x);
		target_y = y-1;
		wrapUp(target_y);
                BOUNCE[4*msize + target_x+target_y*nx] = F[8*msize + y*nx + x];

		// delete the densities at this cell (they have bounced)
		F[x+y*nx] = 0.0;
		F[msize + y*nx + x] = 0;
		F[2*msize + y*nx + x] = 0;
                F[3*msize + y*nx + x] = 0;
                F[4*msize + y*nx + x] = 0;
                F[5*msize + y*nx + x] = 0;
                F[6*msize + y*nx + x] = 0;
                F[7*msize + y*nx + x] = 0;
                F[8*msize + y*nx + x] = 0;
	} else {
		// Calculate local density
		float UX, UY, temp;
		float DENSITY = F[y*nx + x];

		// right - 1
		temp = F[msize + y*nx + x];
		DENSITY += temp;
		UX = temp;

		// upper right - 2
       		temp = F[2*msize + y*nx + x];
        	DENSITY += temp;
        	UX += temp;
		UY = -temp;

		// up - 3
		temp = F[3*msize + y*nx + x];
        	DENSITY += temp;
		UY -= temp;

		// upper left - 4
        	temp = F[4*msize + y*nx + x];
        	DENSITY += temp;
		UX -= temp;
        	UY -= temp;
	
		// left - 5
        	temp = F[5*msize + y*nx + x];
        	DENSITY += temp;
        	UX -= temp;

		// bottom left - 6
        	temp = F[6*msize + y*nx + x];
        	DENSITY += temp;
        	UX -= temp;
		UY += temp;

		// bottom - 7
        	temp = F[7*msize + y*nx + x];
        	DENSITY += temp;
        	UY += temp;

		// bottom right - 8
        	temp = F[8*msize + y*nx + x];
        	DENSITY += temp;
        	UX += temp;
		UY += temp;

		// normalize the pressures
		UX /= DENSITY;
		UY /= DENSITY;

		// adjust inlet pressure
		if (x == 0) UX += deltaU;

		// calculate temp variables, U_SQU, U_C2, U_C4, U_C6, U_C8
		float U_SQU = UX*UX + UY*UY;

		// calculate equilibrium distribution for stationary particles
		F[y*nx + x] = omega*(t1*DENSITY*(1-U_SQU/(2*c_squ)))  +  (1-omega)*F[y*nx + x];	
	
		// ... nearest neighbors
		F[msize + y*nx + x] = omega*(t2*DENSITY*(1+UX/c_squ + 0.5*pow((UX/c_squ),2.0) - U_SQU/(2*c_squ) ))  +  (1-omega)*F[msize + y*nx + x];
		F[3*msize + y*nx + x] = omega*(t2*DENSITY*(1-UY/c_squ + 0.5*pow((UY/c_squ),2.0) - U_SQU/(2*c_squ) ))  +  (1-omega)*F[3*msize + y*nx + x];
		F[5*msize + y*nx + x] = omega*(t2*DENSITY*(1-UX/c_squ + 0.5*pow((UX/c_squ),2.0) - U_SQU/(2*c_squ) ))  +  (1-omega)*F[5*msize + y*nx + x];
		F[7*msize + y*nx + x] = omega*(t2*DENSITY*(1+UY/c_squ + 0.5*pow((UY/c_squ),2.0) - U_SQU/(2*c_squ) ))  +  (1-omega)*F[7*msize + y*nx + x];

		// ... next-nearest neighbors
		F[2*msize + y*nx + x] = omega*(t3*DENSITY*(1+(UX-UY)/c_squ + 0.5*pow((UX-UY)/c_squ,2.0) - U_SQU/(2*c_squ) ))  +  (1-omega)*F[2*msize + y*nx + x];
                F[4*msize + y*nx + x] = omega*(t3*DENSITY*(1+(-UX-UY)/c_squ + 0.5*pow((-UX-UY)/c_squ,2.0) - U_SQU/(2*c_squ) ))  +  (1-omega)*F[4*msize + y*nx + x];
                F[6*msize + y*nx + x] = omega*(t3*DENSITY*(1+(-UX+UY)/c_squ + 0.5*pow((-UX+UY)/c_squ,2.0) - U_SQU/(2*c_squ) ))  +  (1-omega)*F[6*msize + y*nx + x];
                F[8*msize + y*nx + x] = omega*(t3*DENSITY*(1+(UX+UY)/c_squ + 0.5*pow((UX+UY)/c_squ,2.0) - U_SQU/(2*c_squ) ))  +  (1-omega)*F[8*msize + y*nx + x];

		// Relax to equilibrium done inline, above.
	}
	// "Look on my works, ye Mighty, and despair!"
}void LBM_D2Q9(float *F_d, float *BOUNCE_d, bool *OBSTACLES_d, int iterations) {
	printf("Iterations: %d\n\n", iterations);
	float *temp;

	dim3 numberOfBlocks_Propogate(1,1,1);
	dim3 threadsPerBlock_Propogate(nx,ny,1);

	dim3 numberOfBlocks_Collide(1,1,1);
	dim3 threadsPerBlock_Collide(nx,ny,1);

	for (int i=0; i<iterations; ++i) {
		LBM_D2Q9_Kernel_Propogate<<<numberOfBlocks_Propogate, threadsPerBlock_Propogate>>>(F_d, BOUNCE_d);

		temp = BOUNCE_d;
		BOUNCE_d = F_d;
		F_d = temp;

		LBM_D2Q9_Kernel_Collide<<<numberOfBlocks_Collide, threadsPerBlock_Collide>>>(F_d, OBSTACLES_d, BOUNCE_d);
//		LBM_D2Q9_Kernel_Collide<<<numberOfBlocks_Collide, threadsPerBlock_Collide>>>(BOUNCE_d, OBSTACLES_d, F_d);
	}
}

Why Does My LBM D2Q9 Implementation in CUDA C Not Conserve Density?

Homework Statement

Homework Equations

The Attempt at a Solution

1. What is LBM D2Q9 BGK in Cuda C?

2. What are the benefits of using LBM D2Q9 BGK in Cuda C?

3. How does LBM D2Q9 BGK in Cuda C work?

4. What are the limitations of using LBM D2Q9 BGK in Cuda C?

5. What are some real-world applications of LBM D2Q9 BGK in Cuda C?

Similar threads

Hot Threads

Recent Insights