img_56ad20f016e88

Hardware/ Software Co-Design of 1024-Point FFT

Share on Facebook0Tweet about this on TwitterShare on LinkedIn0Email this to someone

By Kelvin Yuk and Chi Ho Yue
EEC282
June 14, 2002

Introduction

The purpose of this project is to explore details and benefits of co-design. We choose to implement the 1024-Point FFT for its applications in signal processing and communication. The 1024-Point FFT also presents a challenging problem in terms of speed and complexity demanding thoroughly designed hardware and software components. The first part of this report consists of a general description of the design requirements. We then continue by discussing a purely software implementation of the FFT and estimate its performance. Next, applying co-design concepts, we present our two possible implementations of the FFT in a shared HW/SW realization and analyze the performance benefits over the software implementation. Finally, we summarize the analysis and key points of the project.

Design specifications

The 1024-point FFT is a faster implementation of the DFT equation given by

The inputs to our 1024-point FFT are a set of discrete values of a sampled signal x(n). The outputs are the complex coefficients representing the frequency spectrum of the signal.

Data Representation

Since computation of the FFT coefficients requires the manipulation of complex numbers, we designate our own fixed-point number representation. We store the data needed to perform the computation in the following format

Each number is contained in a 32-bit word size. The upper 16 bits contain the real part of the complex number. The lower 16 bits contain the imaginary part of the number. Of the 16 bits used for each part, we use 1.15 fixed-point format that represents numbers from . The FFT is computed by stages and at each stage the values are divided by 2 in order to preserve the precision at the output to 1.15 format. The final results are the actual coefficients scaled by 210.

Target Architecture and Performance Requirements

The performance is highly depended on two major aspects: the speed of the processor and memory and the custom hardware built for the FFT(e.g. butterfly complex multipliers). The performance of the software implementation will be determined primarily by the speed of the processor and memory. The hardware/software implementation will dependant on the processor speed and memory as well as the speed of the custom hardware used to accelerate the system. We hope that the custom designed hardware will produce significant performance increases over the standard software implementation.

The target architecture for measuring the performance of the algorithms is below and this information is taken from

http://www.specbench.org/osg/cpu95/results/res96q4/cpu95-960912-01292.html

Hardware Information:

Although the system is designed specifically for producing a 1024-point result, the end user may perhaps need to sample less data points. Therefore or implementation of the system is unspecific to the actually number of points as long as it is a power of two. Both the software and hardware are easily configurable to accept up to less that 1024-points.

Software Implementation

Operation

We designed a software implementation of the 1024-point FFT in C. The program is written with a hardware components perspective to make the later hardware/software partitioning easier to perform. The main algorithm is described as follows:

For the first stage, the addresses in memory are computed using address reversal. The source memory array is accessed and the FFT butterfly is executed for all sequential data pairs and the result is stored into another memory array.

In subsequent stages, each stage is broken into blocks depending on the stage number. Each stage has N/(2stage) blocks. Each block is traversed by applying the butterfly function to its contents. The data required by the 2-input Butterfly function fetched from memory and the addresses are computed using the block size, block number and current element. Also, for each call to the 2-input butterfly function, the WNk value is computed. The outputs of the Butterfly function are stored back into memory into the opposite data array. The process is repeated for all stages totaling ten.

The code is included in Fig A1 for your perusal.

Verification of our program

The program was verified using MATLAB FFT function. Below are two verification trials. The first one is a signal composed of 3 cosine functions and the second is a randomly generated signal.

(1)        x[n]=0.33*cos(n/256*400pi)+0.33*cos(n/256*600pi)+0.33*cos(n/256*100pi).
(2)        x[n] = random signal

The data from our program are all multiplied by 210 scale the values correctly.

Figs 1 and 2 contain the graphs of these two functions, respectively.

Memory requirements

The memory used in this software implementation consists mainly of the two arrays, which alternate storage when computing from one stage to another. The integer arrays “arra1[1024]” and “arra2[1024]” hold the initial, intermediate and final computed values of the FFT.

We also estimate about 50 extra words needed for intermediate computations, which is still minor compared to the large 1024 size arrays.

1024 arra1 + 1024 arra2 + 50 extra words for intermediate computations = 2098 words * 4 bytes/word = 8.4kB

Performance Estimation

Measuring execution time of the algorithm

We initially estimate the execution time using the C function ftime(), which gives the time accurate to the millisecond. Using this function, the time was taken at the beginning and at the end of the algorithm. The difference those times is the runtime. This time measurement is accurate since I/O is not added into the runtime. After repeated 10 trials, the runtime was approximately

Run time estimation and profiling

For the run time estimation, we first convert our C code into assembly code. From the assembly code, we count the number of instructions each function runs including and also estimate the looping of some code. For each function, the number of instructions in the function and the number of times the function is called is multiplied to get an estimate of total instructions executed from that function. The clock speed will determine how long it takes for one instruction to execute and helps us determine how much time each function consumed during the processing.

There are three major function is our 1024 FFT computation: fftfunc2(), bit_reverse(), and wn_compute().

Therefore, the estimated run time = 36.9ms

Comparing the estimated result with the measured result, we have a percent error off by 40%. This is probably due to the memory accesses and communication overhead. We observed that the function that computes WNk is very large in terms of instructions compared to other functions. The other two functions that compute the FFT term and indexes contain many instructions too. Therefore, for the hardware acceleration part, we convert those functions into hardware for a major speed up in performance.

Hardware Implementation

Software Improvement

Before identifying the kernels of the program for eventual conversion into hardware, we improve make some improvements to the software implementation. We found that in our code, computing the WNk values was time consuming. This is because the function inefficiently runs every time the FFT butterfly runs for every stage (except for the first stage where the value=1) when there are only 2stage-1 unique WNk values. The function itself is processor consuming in that it requires the use of sine, cosine and power functions. To eliminate this bottleneck, we compute the WNk values beforehand and store them into memory and each value will be passed to the butterfly unit when needed.

Identification of the most time consuming parts of the computation (kernels)

In reviewing the software implementation of the FFT, we identify the following parts of the program to be the most time consuming and that can be reduced by hardware implementation

void fftfunc2(int index1, int index2, int wn, int outdex1, int outdex2)

Performs the 2-input butterfly function. Within this function there is one complex multiplication, one complex addition and one complex subtraction. This function receives the addresses of the values it must retrieve in memory and also the destination addresses. As a result, the function loads and stores to the memory costing more communication overhead.

int rev_bits(int number, int bits)

Reverses the bit order of the index. This function is called 1024 times for the first FFT stage and performs several shift instructions for every call.

The two sets of kernels we are modeling in hardware are:

Model #1:

  • fftfunc2 with an 2-input butterfly computation unit.
  • Two sets of dedicated memory for data storage

Model #2:

  • fftfunc2 with an 2-input butterfly computation unit.
  • Two sets of dedicated memory for data storage
  • bit_reverse capability within the ALU

We will refer to each of the custom memory elements as MEM1 and MEM2.

VHDL model of the kernels

VHDL model is included in Fig A2.

The VHDL models include the above two hardware functions only and have not been verified since the VHDL code is actually only a trivial matter.

Communication scheme between the CPU (software) and the dedicated hardware kernels and memory

The overall block diagram for the proposed hardware/software implementation is shown in Fig 3. The reverse address calculator is shown in Fig. 4. A detailed schematic of the FFT components is shown in Fig 5. The a basic diagram for the functional unit for the Butterfly is shown in Fig. 6 and the detailed schematic of the Butterfly controller is shown in Fig 7.

Although the kernels for the FFT butterfly and the index reversal have been modeled in hardware, there is no simple way to communicate the data between the CPU and this hardware while minimizing clock cycles. We will address each of the kernels separately but their operation is ultimately interlinked.

Address bit reversal

The address reversal is built into the CPU and implemented as an ALU function. It will consume almost no time at all since it just a reversal of the wires of the bits. To initialize the first stage, the x[n] values from the memory are loaded into the MEM1 into the order determined by the reverse bit function.

Butterfly computation

In order to control the proper operation of the Butterfly unit and movement of data between the memory or CPU and the Butterfly unit, a few dedicated instructions need to be added to the instruction set.

  1. read_in(start_index1) takes the data from memory and loads it into the MEM1 of the butterfly unit. It sets a readin control signal to the FFT controller.
  2. write_out(start_index1) takes the data from the butterfly unit cache1 and sends it into memory. It sets a writeout control signal to the FFT controller
  3. fetch_wn(wn_index) commands the butterfly controller to fetch the needed WNk value from the pre-computed set in the computer memory.
  4. bfly2_run(index1) signals the unit to fetch data from one MEM, perform the parallel butterfly computation, and then store the result back into the other MEM.
  5. write_stage(stage) sends the value of the current stage into the butterfly controller for control purposes
  6. write_blocksize(blocksize/2) stores the value blocksize/2 into the butterfly controller in order to compute index2.

Hardware and memory requirements

The Butterfly computation requires two dedicated memory elements MEM1 and MEM2, a Butterfly computational unit and a Butterfly control unit. The dedicated memory is used to store the source and target values from each intermediate stage. The Butterfly computational unit takes two data and one WNk data value, performs complex multiplication and additions to the data, and outputs the two results.

The Butterfly control unit receives the stage number, the blocksize/2 value and the index of the first element and sends it into the functional unit. It also addresses the WNk and controls how the data is passed between the two Butterfly caches.

In addition to the dedicated hardware memory, the system still needs 1024 words from the computer memory to store the initial x(n) values. The final results after FFT computation can be stored back into this memory array and overwrite the original source data. The system also needs 1024 words to store the pre-computed WNk values and about 50 words for temporary variables.

Operation of FFT unit

The FFT computation operates as follows:

  1. The CPU is set to output the reversed bit address to retrieve the x[n] elements from the general purpose memory and store them into the first cache in the reversed address order via the instruction read_in(index1).
  2. The number of the current stage is sent to the Butterfly controller via the new instruction write_stage(stage). It is used for control purposes to alternate which cache is the source one and which is the destination one.
  3. The blocksize is computed for the current stage. This value is sent into the Butterfly controller via the instruction write_blocksize(blocksize/2). This will be used to compute the start_index2 to select the correct second input to the Butterfly functional unit since we have designed the Butterfly controller to only require one input, the start_index1.
  4. The FFT function is set to run via the instruction bfly2_run(index1).
    1. The index 1 is read into the control unit and index2 is computed from that.
    2. The two indexes are sent to the source cache and the values are retrieved and sent to the functional unit.
    3. The required WNk value is fetched from the external memory via the instruction fetch_wn(wn_index). The index of the correct WNk is wn_index=k%(blocksize/2).
    4. The functional unit computes the two results.
    5. The results are stored into the destination MEM.
    6. This step repeats for all pairs of elements in each block in the stage.
  5. Steps (2)-(5) are repeated for all stages.
  6. After ten stages, the resulting FFT values should be stored into MEM1. We store the each element from MEM1 into the general purpose memory via the instruction write_out(index1).

Performance Estimation

We estimated the performance in two ways: profiling of the execution time by assembly code and runtime estimation using the ftime() function.

Performance estimation based on actual clock cycles

First, we did a performance estimation based on the number of clock cycles each kernel requires and also how many times the kernel is called. The execution time is then estimated using the clock frequency of the processor and that total number of clock cycles needed. The results are shown below:

Compared to the software performance analysis, the performance increase of kernel #1 is:

36.6ms / 2.7ms (estimated) = 13.56 times faster (estimated)
36.6ms / 4.0ms (actual) = 9.15 times faster (actual)

Compared to the software performance analysis, the performance increase of kernel #2 is:

36.6ms / 0.75ms (estimated) = 48.8 times faster (estimated)
36.6ms / 1.2ms (actual) = 1.2 times faster (actual)

Performance estimation in software

We performed an estimation in software by replacing the software kernels with a “hardware call” that is modeled only in terms of clock cycles. These clock cycles are added via the use of dummy instructions which act as placeholders of the time consumed to activate and run the kernel. The number of clock cycles per kernel is a rough estimate based on the proposed hardware design.

After replacing utilizing the kernels, the execution time was computed again using the C function ftime which gives the execution time accurate to the millisecond

The total measured execution time for kernel #1 and kernel #2 was about 4.0ms and 1.2ms, respectively. The implementation with kernel #2 is 50 times faster than our previous software implementation which took 60ms.

The code for the performance estimation is included in Fig A3.

Summary and Conclusions

After putting the hardware accelerator, with the kernel #1, the area increase is estimated to be ~50%. Its corresponding performance increase is about 15 times faster (4.0ms is 15 times faster than 60ms). With the kernel #2, the area increase is estimated to ~50% but slightly is larger then kernel #1 by about 1K sq. lambda. And the corresponding performance is 50 times faster than the pure software implementation and 3.3 times faster than kernel #1. Therefore, we conclude that kernel #2 is superior to the 2 other implementations.

By implementing the 1024-point FFT in software and comparing the performance to our two HW/SW partitioned configurations, we have learned the significance of co-design and how it can affect performance and costs. We were able to successfully measure the performance gains using this hybrid approach to system level design and explore the design space of HW/SW realization.

Software Implementation

Verification

See plots

Fig A1. C code for software implementation for FFT
/*--------------------------project1.c--------------------

This is a software implementation of the Fast Fourier Transform
The major memory elements are declared global.

----------------------------------------------------------*/


#include <stdio.h>
#include <math.h>
#include <time.h>
#include <sys/types.h>
#include <sys/times.h>
#include <sys/timeb.h>

int arra1[1024];
int arra2[1024];
int start_index[1024];
int elements; 
int trial;
/*
int commul_ex=0, comadd_ex=0, comsub_ex=0;
int fftfunc2_ex=0, compute_wn_ex=0;
int rev_bits_ex=0, read_input_ex=0, write_output_ex=0;
*/
/*-------------- complex number operations --------------*/

/* complex multiply of two array elemts*/
void
commul(int in1r, int in1i, int in2r, int in2i, int* outr, int* outi)
{
//  commul_ex+=1;

  int tempr, tempi, tempx;
  tempr = in1r * in2r;
  tempx = in1i * in2i;
  *outr = ((tempr - tempx) &0xffffffff) >> 15;

  tempi = in1r * in2i;
  tempx = in2r * in1i;
  *outi = ((tempi + tempx) &0xffffffff) >> 15;

}

/* complex addition of two array elements */
// when called, eg.. comadd(a, b, c, d, &e, &f);
void
comadd(int in1r, int in1i, int in2r, int in2i, int* outr, int* outi)
{
//   comadd_ex+=1;

   int tempr, tempi;
   *outr = (in1r + in2r) >> 1;
   *outi = ((in1i + in2i) & 0x0001ffff) >> 1;
}

/* complex subtraction of two array elements */
void
comsub(int in1r, int in1i, int in2r, int in2i, int* outr, int* outi)
{
//   comsub_ex+=1;

   int tempr, tempi;
   *outr = (in1r - in2r) >> 1;
   *outi = ((in1i - in2i) & 0x0001ffff) >> 1;
}

/* 2-input FFT butterfly receives the indexes for the 2 input and 2
output elements and the W_N^k value*/
void
fftfunc2(int index1, int index2, int wn, int outdex1, int outdex2)
{
//  fftfunc2_ex+=1;

  int in1r, in1i, in2r, in2i, out1, out2, wnr, wni;
  int temp1, temp2, temp3, temp4, temp5, temp6;

  if( (trial % 2) == 0)
  { 
    in1i = arra1[index1];
    in2i = arra1[index2];
  }else
  {
    in1i = arra2[index1];
    in2i = arra2[index2];
  }

  in1r = (in1i >> 16);
  in2r = (in2i >> 16);
  wnr  = (wn   >> 16);
  in1i = (in1i << 16) >> 16;
  in2i = (in2i << 16) >> 16;
  wni  = (wn   << 16) >> 16;
  
  commul(in2r, in2i, wnr, wni, &temp1, &temp2);
  comadd(in1r, in1i, temp1, temp2, &temp3, &temp4);  
  comsub(in1r, in1i, temp1, temp2, &temp5, &temp6);

  temp1 = (temp3 << 16) + (temp4 & 0x0000ffff);
  temp2 = (temp5 << 16) + (temp6 & 0x0000ffff);

  if( (trial % 2 ) == 0)
  {
    arra2[outdex1] = temp1;
    arra2[outdex2] = temp2;
  }else
  {
    arra1[outdex1] = temp1;
    arra1[outdex2] = temp2;
  }
}

int compute_wn(int k, int blocksize)
{
//   compute_wn_ex+=1;

   int wn;

   float real, rrr;
   float imag, iii;
   real=(cos(-2*3.1415*k/blocksize)*pow(2,15))/1.00001;
   rrr = real/pow(2,15);
   imag=(sin(-2*3.1415*k/blocksize)*pow(2,15))/1.00001;
   iii = imag/pow(2,15);
   wn = (int(real) << 16) + (int(imag) &0x0000ffff);

   return wn;
}

int rev_bits(int number, int bits){
   
/*receives a number and reverses its bit order
depending on how many bits wide the number is */

//   rev_bits_ex+=1;
   
   int mask;
   int rev_num=0;
   
   for(int i=0; i<bits; i++){  
      if(i==0)
         mask=1;
      else
         mask <<= 1;
      rev_num <<= 1;
      if(mask & number)
         rev_num=rev_num+1;
   }
   return rev_num;
} 


void read_input()
{
//   read_input_ex+=1;

   float tempin;
   FILE* fp;
   fp=fopen("input.txt","r");

   if(fp==NULL){
      printf("No file.\n");
   }
   else{
      printf("File opened successfully.\n");

      int i = 0;
      while(!feof(fp)){
         fscanf(fp, "%f", &tempin);
         arra1[i] = (int(tempin*pow(2,31)))&0xffff0000;
         i++;
      }
   }
   fclose(fp);
}

void write_output()
{
//   write_output_ex+=1;

   int temp;
   float tempxf, tempyf, magnn;
 
   FILE* fp;
   fp=fopen("output.txt","w");
   
   for(int i=0;i<1024;i++){
        temp = arra1[i];
        tempyf = float(temp >> 16)/pow(2,15);
        tempxf = float((temp << 16) >> 16)/pow(2,15);
      
        magnn = sqrt(tempxf*tempxf + tempyf*tempyf);
        fprintf(fp,"%f ",magnn);
   }
   fclose(fp);
}

void main(int argc, char* argv){

   timeb time1,time2;

   int i=1;
   int wn;
   int start_ind1;
   int start_ind2;
   int stages = 10;
   int blocksize;
   int blocks;


   trial=0;
   elements=1024;
   int N=elements;
   int bits;
   bits=int(ceil(log10(elements)/log10(2)));

   read_input();

   ftime(&time1);

   //do first stage  
   for(int k=0; k<(elements/2); k++){  //compute indexes
      start_ind1=rev_bits(2*k,bits);
      start_ind2=rev_bits(2*k+1,bits);

      wn=0x7fff0000;
      fftfunc2(start_ind1, start_ind2,wn, 2*k, 2*k+1);

   }

   //complete the rest of the stages
   for(i=2; i<= stages; i++){        //go through each stage
      trial++;
      blocksize=int(pow(2,i));            //compute size of block and
      blocks=N/blocksize;            //  number of blocks

      for(int j=0; j<blocks; j++){        //go through each block
         int block_offset;
         block_offset=j*(N/blocks);    //starting index of block
         //compute indexes
         for(int k=0; k<(blocksize/2); k++){
            start_ind1=block_offset + k;
            start_ind2=start_ind1 + (blocksize/2);

            //compute wn_blocksize^k
            wn=compute_wn(k,blocksize);

            //compute the values needed
            fftfunc2(start_ind1, start_ind2, wn, start_ind1, start_ind2);
         }
      }


   }

   ftime(&time2);
   printf("time= %dsec%dms\n",(int)(time2.time-time1.time),
          time2.millitm-time1.millitm);

   //write the output.txt file
   write_output();
/*
printf("commul=%d comadd=%d comsub=%d\n",commul_ex,comadd_ex,comsub_ex);
printf("fftfunc2=%d compute_wn=%d\n",fftfunc2_ex,compute_wn_ex);
printf("rev_bits=%d read_input=%d ",rev_bits_ex,read_input_ex);
printf("write_output=%d\n",write_output_ex);
*/
}

 

Hardware Implementation

Fig. A2 VHDL code for Butterfly and Index Generator
Library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_unsigned.all;
use IEEE.std_logic_signed.all;

--use WORK.my_pkg.all;

entity butfly2 is
    port (ins1, ins2, wn : in std_logic_vector(31 downto 0);
            outs1, outs2 : out std_logic_vector(31 downto 0));
end butfly2;

architecture abutfly2 of butfly2 is

signal ins1r, ins1i   : std_logic_vector(15 downto 0);
signal ins2r, ins2i   : std_logic_vector(15 downto 0);
signal wnr,   wni     : std_logic_vector(15 downto 0);
signal outs1r, outs1i : std_logic_vector(16 downto 0);
signal outs2r, outs2i : std_logic_vector(16 downto 0);
signal tempr,  tempi  : std_logic_vector(31 downto 0);
signal temx1r, temx1i : std_logic_vector(31 downto 0);
signal temx2r, temx2i : std_logic_vector(31 downto 0);
signal temp3r, temp3i : std_logic_vector(16 downto 0);

begin

ins1r <= ins1(31 downto 16);
ins1i <= ins1(15 downto 0 );
ins2r <= ins2(31 downto 16);
ins2i <= ins2(15 downto 0 );
wnr <= wn(31 downto 16);
wni <= wn(15 downto 0 );

---------------   multiply Wn with IN2 -----------------
temx1r <= ins2r * wnr;
temx2r <= ins2i * wni;
tempr  <= temx2r - temx1r;
temp3r <= tempr(31 downto 15);

temx1i <= ins2r * wni;
temx2i <= ins2i * wnr;
tempi  <= temx1i + temx2i;
temp3i <= tempi(31 downto 15);

----------- addition and subtraction --------------------

outs1r <= ins1r + temp3r;
outs1i <= ins1i + temp3i;

outs2r <= ins1r - temp3r;
outs2i <= ins1i - temp3i;

--------------------------------------------------------

outs1(31 downto 16) <= outs1r(16 downto 1);
outs1(15 downto 0 ) <= outs1i(16 downto 1);
outs2(31 downto 16) <= outs2r(16 downto 1);
outs2(15 downto 0 ) <= outs2i(16 downto 1);

end abutfly2; -- architecture

configuration butfly2_cfg of butfly2 is
  for abutfly2
  end for;
end butfly2_cfg;

------------------------------------------------------------------------------------------------------------
VHDL code for index computation kernel

Library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_unsigned.all;

--use WORK.my_pkg.all;

entity indcom is
    port (index: in std_logic_vector(9 downto 0);
            outdex : out std_logic_vector(9 downto 0));
end indcom;

architecture aindcom of indcom is
begin

outdex(9 downto 0) <= index(0 to 9);

end aindcom; -- architecture

configuration indcom_cfg of indcom is
  for aindcom
  end for;
end indcom_cfg;

 

Fig A3. Modified C code to simulate hardware calls
/*--------------------------project1.c--------------------

This is a software implementation of the Fast Fourier Tranform
The major memory elements are declared global.

----------------------------------------------------------*/


#include <stdio.h>
#include <math.h>
#include <time.h>
#include <sys/types.h>
#include <sys/times.h>
#include <sys/timeb.h>

int arra1[1024];
int arra2[1024];
int wn[1024];
int blocksize[11];
int elements; 
int trial;
int dummywil;

/*
int commul_ex=0, comadd_ex=0, comsub_ex=0;
int fftfunc2_ex=0, compute_wn_ex=0;
int rev_bits_ex=0, read_input_ex=0, write_output_ex=0;
*/
/*-------------- complex number operations --------------*/

/* complex multiply of two array elemts*/
void
commul(int in1r, int in1i, int in2r, int in2i, int* outr, int* outi)
{
//  commul_ex+=1;

  int tempr, tempi, tempx;
  tempr = in1r * in2r;
  tempx = in1i * in2i;
  *outr = ((tempr - tempx) &0xffffffff) >> 15;

  tempi = in1r * in2i;
  tempx = in2r * in1i;
  *outi = ((tempi + tempx) &0xffffffff) >> 15;

}

/* complex addition of two array elements */
// when called, eg.. comadd(a, b, c, d, &e, &f);
void
comadd(int in1r, int in1i, int in2r, int in2i, int* outr, int* outi)
{
//   comadd_ex+=1;

   int tempr, tempi;
   *outr = (in1r + in2r) >> 1;
   *outi = ((in1i + in2i) & 0x0001ffff) >> 1;
}

/* complex subtraction of two array elements */
void
comsub(int in1r, int in1i, int in2r, int in2i, int* outr, int* outi)
{
//   comsub_ex+=1;

   int tempr, tempi;
   *outr = (in1r - in2r) >> 1;
   *outi = ((in1i - in2i) & 0x0001ffff) >> 1;
}

/* 2-input FFT butterfly receives the indexes for the 2 input and 2
output elements and the W_N^k value*/
void
fftfunc2(int index1, int index2, int wn, int outdex1, int outdex2)
{
//  fftfunc2_ex+=1;

  int in1r, in1i, in2r, in2i, out1, out2, wnr, wni;
  int temp1, temp2, temp3, temp4, temp5, temp6;

  if( (trial % 2) == 0)
  { 
    in1i = arra1[index1];
    in2i = arra1[index2];
  }else
  {
    in1i = arra2[index1];
    in2i = arra2[index2];
  }

//two clock cycles for the two stages in the butterfly
// multiply 1 stage, add sub 2nd stage done in parallel
  temp1+=0;
  temp1+=0;

  if( (trial % 2 ) == 0)
  {
    arra2[outdex1] = temp1;
    arra2[outdex2] = temp2;
  }else
  {
    arra1[outdex1] = temp1;
    arra1[outdex2] = temp2;
  }
}

int compute_wn(int k, int blocksize)
{
//   compute_wn_ex+=1;

   int wn;

   float real, rrr;
   float imag, iii;
   real=(cos(-2*3.1415*k/blocksize)*pow(2,15))/1.00001;
   rrr = real/pow(2,15);
   imag=(sin(-2*3.1415*k/blocksize)*pow(2,15))/1.00001;
   iii = imag/pow(2,15);
   wn = (int(real) << 16) + (int(imag) &0x0000ffff);

   return wn;
}

int rev_bits(int number, int bits){
   
/*receives a number and reverses its bit order
depending on how many bits wide the number is */
return 0;

} 


void read_input()
{
//   read_input_ex+=1;

   float tempin;
   FILE* fp;
   fp=fopen("input.txt","r");

   if(fp==NULL){
      printf("No file.\n");
   }
   else{
      printf("File opened successfully.\n");

      int i = 0;
      while(!feof(fp)){
         fscanf(fp, "%f", &tempin);
         arra1[i] = (int(tempin*pow(2,31)))&0xffff0000;
         i++;
      }
   }
   fclose(fp);
}

void write_output()
{
//   write_output_ex+=1;

   int temp;
   float tempxf, tempyf, magnn;
 
   FILE* fp;
   fp=fopen("output.txt","w");
   
   for(int i=0;i<1024;i++){
        temp = arra1[i];
        tempyf = float(temp >> 16)/pow(2,15);
        tempxf = float((temp << 16) >> 16)/pow(2,15);
      
        magnn = sqrt(tempxf*tempxf + tempyf*tempyf);
        fprintf(fp,"%f ",magnn);
   }
   fclose(fp);
}

void main(int argc, char* argv){

   timeb time1,time2;

   int i;
   int start_ind1;
   int start_ind2;
   int stages = 10;
   int blocks;


   trial=0;
   elements=1024;
   int N=elements;
   int bits;
   int blockoff=0;
//   bits=int(ceil(log10(elements)/log10(2)));

   read_input();
   for(i=1; i<= stages; i++){        //go through each stage
      trial++;
      blocksize[i]=int(pow(2,i));            //compute size of block and
      blocks=N/blocksize[i];                //  number of blocks

      for(int k=0; k<(blocksize[i]/2); k++){      //calculate wn for one
      wn[k+blockoff]=compute_wn(k,blocksize[i]);    // block before calc fft2
      }                               // for all blocks
    blockoff += blocksize[i]/2;
   }


   ftime(&time1);

  //map the x[n] data from the memory into cache1 to initialize the FFT
  //  computation.  There are 1024 elements to be copied and the
  //  following loop has 512 branch and 512 increments on t
  //here the reverse bit addressing is activated to retrieve the correct
  //  address
   for(int t=0;t<512;t++){
   }

   //compute the stages
   for(i=1; i<= stages; i++){        //go through each stage
      trial++;

//butterfly controller receives a index1 and computes index2 from the
//  stored blocksize/2.  This value is sent in at the beginning of each
//  stage to avoid repeated calculation in the inner loop and to avoid
//  spending multiple cycles to send in multiple addresses into butt cont
//  butterfly also receives stage number in order to determine which
//  cache is the source and destination.


      //send stage number to butterfly controller
      dummywil=1;

      //send the blocksize/2 to butterfly controller
      dummywil=0;

      blocks=N/blocksize[i];            //  number of blocks
               // for all blocks

      for(int j=0; j<blocks; j++){        //go through each block
         int block_offset;
         block_offset=j*(N/blocks);    //starting index of block
         //compute indexes

         for(int k=0; k<(blocksize[i]/2); k++){
            start_ind1=block_offset + k;
        //start_ind2=start_ind1 + (blocksize/2);

            //compute the values needed
        //fftfunc2(start_ind1, start_ind2, 
        //wn[k%(blocksize/2)],start_ind1, start_ind2);
        //send start_ind1 into butterfly controller and it will
        // instruct the hardware to compute the butterfly value
        dummywil=0;

         }
      }


   }

   ftime(&time2);
   printf("time= %dsec%dms\n",(int)(time2.time-time1.time),
          time2.millitm-time1.millitm);

   //write the output.txt file
   write_output();
/*
printf("commul=%d comadd=%d comsub=%d\n",commul_ex,comadd_ex,comsub_ex);
printf("fftfunc2=%d compute_wn=%d\n",fftfunc2_ex,compute_wn_ex);
printf("rev_bits=%d read_input=%d ",rev_bits_ex,read_input_ex);
printf("write_output=%d\n",write_output_ex);
*/
}
Kelvin Yuk
Kelvin Yuk obtained his PhD in Electrical Engineering in 2012.
See Comments

Leave a Reply