REAL TIME FOURIER TRANSFORM USING PARALLEL RTX PROCESSORS

 

 

C. H. Ting, Applied Biosystems, Inc.

 and Rick VanNorman, Harris Semiconductor

 

 

 

Abstract

 

 

The Continuous Fourier Transform (CFT) algorithm is well suited for parallel processing using an array of fast RTX processors to speed up the transform process.  Using four 6 MHz RTX2000 processors linked together, the input sampling rate is pushed to 2 kHz.  The resulting frequency spectrum is displayed simultaneously while the transform is being computed.  This experiment demonstrates conclusively that the speed of CFT is linearly proportional to the number of processors dedicated to the transform computation.

 

 

1. Introduction

 

 

In the 1989 Rochester Forth conference we reported the implementation of the Continuous Fourier Transform (CFT) on an RTX2000 processor (1).  CFT has many unique characteristics which take advantage of a fast computing element like RTX2000.  Since CFT guarantees that computation error does not accumulate, integer arithmetic operations can be used for all its computations.  Another important property of CFT is that the computation of each component in the frequency spectrum is independent of the other components.  Therefore, the computation can be segregated to as many independent processors as there are frequency components, and the throughput is linearly proportional to the number of processors dedicated to the computational task.  This report details the construction of such a parallel processor array and its performance.

 

 

2. Theory of Continuous Fourier Transform (CFT)

 

 

The theory of CFT was discussed fully in our prior report.  Here we shall give only a brief summary of the theory to facilitate later discussions. Assume that we have a continuous stream of data, sampled at a constant interval, from which the frequency spectrum is computed in real time:

 

   f(0), f(1), f(2), ... , f(k), ...

 

where k may run to infinity.  A window of N contiguous elements from f(k) on are taken to obtain a frequency spectrum:

 

   f(k), f(k+1), f(k+2), ... , f(k+N-1)

 

The Continuous Fourier Transform computes the following N frequency components:

 

      k+N-1

   T(k,n) =     ?   f(m) exp{-2j¹mn/N}

      m=k

 

   n = 0, 1, 2, ... , N-1

 

Once this set of frequency components is computed, the next set with a new data point f(k+N) can be obtained easily with the following recursive equation:

 

   T(k+1,n)  =  T(k,n)  +  [f(k+N) - f(k)] exp{-2j¹kn/N}

   n =  0, 1, 2, ... , N-1

 

Each T component can be evaluated independent of other T components.  As many as N processors can be employed to solve the Fourier transform.  The processors only need the data stream f(k), and the assignments of the frequency identifier n.  There is no inter-processor communication necessary.

 

In this experiment, we distribute the T components among four RTX2000 processors.  The speed of the Fourier transform is increased by a factor of 4.  Only the real part of the phase factor is used in computation, and the result is that of a cosine transform.  Computing only the cosine transform further increases the processing speed by a factor of two.  As the spectrum is computed on data in a sliding window in real time, it is not necessary to do the sine transform.

 

 

3. The System Architecture

 

 

This parallel processor architecture more closely approximates the Parallel CFT Processor proposed in the original article by Ting (2), in which one slave processor would be used to process one frequency component in the frequency spectrum.  However, as the number of slave processors increases, the throughput bottle neck will shift from the computational section to the output section of the Parallel CFT Processor System.  In this experiment, the spectrum output is neatly integrated into the slave RTX processor array.  The overall balance among input, processing and output in the current system is a significant achievement.

 

The Real Time CFT Parallel Processor System is designed with simplicity and versatility as the primary considerations.  Its structure is shown schematically in Figure 1.  A real time analog signal is fed into an A/D converter, which broadcasts the digitized data over the A/D Data Bus to four Slave Processors.  These Slave Processors, each handles a quarter of the computation load, present the cosine transform results to the D/A Data Bus.  A D/A converter generates the frequency spectrum continuously from the data on the D/A Data Bus.  A Master Processor generates all the timing signals to the A/D converter, the Slave Processors and the D/A converter.  In the current setup, each component in the frequency spectrum is computed from a window of 1024 contiguous data points in the input signal, and 512 components in the cosine transform are displayed continuously on an oscilloscope.

 

 

4. Hardware Implementation

 

 

The Master Processor and the four Slave Processors are identical single board computers based on the INDELKO Forth Kit, with RTX2000GI-8 CPU from Harris Semiconductor Corp.  Each board has 64 Kbytes of RAM memory and the cmForth operating system is in EPROM's.  Each board derives the system clock from an individual 12 MHz crystal and operates at 6 MHz cycling rate.  The Master Processor drives the A/D converter, the Slave Processors and the D/A converter through a 74HCT374 latch.  The Slave Processors receive data from the A/D converter through a set of '374 latch and they put the transformed data onto the D/A Data bus through a second set of '374 latches.  The A/D converter and the D/A converter are contained in a single KSV3100A chip from Samsung.  Both the A/D and D/A Data Buses are 8 bits in width and they share the same I/O port (Port 1CH) on the G-Bus of RTX2000.

 

The schematics of the Master Processor and the Slave Processors are shown in Figure 2 and 3.  Each processor is on a circuit board, which plugs into a DIN style socket on the mother board.  The A/D and D/A converter chip KSV3100A, the '374 latches and '138 decoders are all located on the mother board.

 

Although the D/A in KSV3100A is a 10 bit converter, only the most significant 8 bits are used for the D/A Data Bus.  Both A/D and D/A converters have a voltage range of 0 to 2 volts.  Since the outputs from the Slave Processors are in signed two's complement form, the most significant sign bit is inverted before feeding into the D/A through a '04 inverter.

 

The reset and RS232 lines on each processor board are brought to the mother board.  They are connected to a single reset switch and a DB25 connector through a flexible cable header.  This arrangement allows a host computer to communicate to each individual processor through a common RS232 cable.  All processors boot cmForth from their own on-board EPROM's. The same CFT program is downloaded to each processor.  The Master Processor executes the word MASTER to generate the timing signals to the rest of the system.  Slave Processor 1 executes CFT1.  Slave Processor 2 executes CFT2, etc.

 

To facilitate the testing and verification of the system, a 555 timer is also built on the mother board. It feeds a square wave of variable frequency to the A/D.

 

It takes 24 cycles (4 us) for a Slave processor to generate a new component in the frequency spectrum.  The four Slave Processors are interleaved and the Master Processor clocks out consecutive components by sending 0.5 us pulses at 1 us intervals sequentially to Slaves 1 to 4 and then repeats the sequence 128 times.  Thus the 512 components in the cosine transform can be displayed continuously on an oscilloscope. The Slaves synchronize themselves by monitoring the EI2 input lines driven by the Master. The Master samples the A/D converter and loads the data into slave processors' input '374 latches.  Figure 4 shows all the timing signals generated by the Master.

 

While initiating the A/D converter, the Master also raises the SyncSlaves line.  When the Slaves sense the SyncSlaves signals on their EI2 pins, they read the A/D data and start their own transform routine.  The first transform results will be latched to the output '374 latches in about 10 us.  The slave's waiting loop for EI2 signal is a very tight 5 cycle loop.  Consequently, the jitter in the output of the slaves is limited to 0.84 us, which does not disturb the interleaved sequencing of the Slaves' outputs.

 

Using interrupt servicing technique would reduce the jittering to within 0.5 us.  This will be helpful when we build a 24 Slave Processor Array. However, the polling technique works very satisfactorily in this four processor array and it eliminates four resettable flip-flops to handle the EI2 interrupts.

 

This scheme works because RTX2000 is a very well behaved machine.  Its cycles and actions can be timed precisely as all instructions are executed in one or two machine cycles.  For most other processors, one would have to use a deep FIFO chip to buffer the output data from the processor to the D/A converter.  In our current experiment, the '374 latch, which is a single stage FIFO, can perform flawlessly because the latched data will be picked up assuredly before the next data point arrives.

 

 

5. Software Implementation

 

 

The complete software package used in this experiment is shown in Listing 1.  It is loaded to all processors on the top of the cmForth kernel which is copied from the EPROM's into the RAM memory during cold boot.  The Master Processor executes the word MASTER in Screen 2.  The Slave Processors execute respectively one of the CFT1, CFT2, CFT3 or CFT4 commands in Screens 12 to 15.

 

In Screen 2, MASTER is the word which causes the Master Processor to generate all the timing signals needed by the Slave Processors and the A/D-D/A converters, as shown in Figure 4.  HOLD is the command disabling the output '374 latches of the Slaves, and preventing them from fighting against one another over the D/A Data Bus.  T1 is a test routine to be used by the Slave Processors.  It causes the Slave to display a sawtooth wave through the D/A converter.  It is useful to verify that the Master generates the correct timing signals, and that the output data latch in a Slave is working properly.

 

Screens 4 to 7 contain the cmForth optimizing compiler, which compresses as much Forth functionality into a RTX2000 machine instruction as possible.  The optimizing compiler takes care of optimizing regular Forth code into highly efficient RTX machine code.  Therefore, hand optimizing is necessary only in the most time critical routine (TERM) in CFT.

 

Screen 8 contains some important slave test routines.  DD dumps 256 bytes of memory for inspection.  Address of the next byte above the dumped region is left on the stack so that DD can be used again to continue the dumping and inspection.  VECTORS displays the contents of the Interrupt Vector Register in RTX2000.  It shows whether one can read the SyncSlaves signal through the EI2 pin.  The initialization routine in cmForth often leaves the Data Stack Underflow Interrupt on and then VECTORS will display a pageful of 1A0, which is the vector address of data stack underflow interrupt.  This condition must be corrected by pushing a few dummy numbers on the stack.

 

INPUT in Screen 8 reads and displays 256 samples of data from the A/D converter.  This command will ensure that the A/D converter and the '374 data input latches are working properly.

 

Screens 9 and 10 contain code to initialize the sine table and other memory areas used by the CFT routine.  Entries in the sine table are 14 bits two's complement values.  They are selected so that after multiplying with the 8 bit input data and accumulated 1024 times, the result fills in a 32 bit accumulator optimally.

 

(TERM) in Screen 11 is the most critical piece of code in this parallel processor CFT system.  It updates every fourth element in the frequency domain, after an input data is read in from the A/D converter.  After the 32 bit accumulated result is obtained, the most significant 16 bit is written to the output G port.  Only the most significant 8 bits are latched into the '374 output data latch, from where it will be clocked out to the D/A converter by the Master Processor.  The inner loop in (TERM) is executed 128 times to compute a quarter of the 512 cosine transform components.  It takes 24 machine cycles to complete this inner loop.  With four Slave Processors interleaved in their outputs, the effective system throughput is one output data per 6 machine cycles-- 1 us on 6 MHz RTX2000 processors.

 

CFT1 to CFT4 in Screens 12 to 15 are similar.  The difference is that each word selects a difference initial phase factor for the computation.  The initial phase factor determines the frequency of the cosine transform component in the output frequency spectrum.  CFT2 to CFT4 also contain code to delay the starting of (TERM) so that the output data from the Slave Processors are interleaved correctly.

 

 

6. Conclusion

 

 

With this Parallel CFT Processor System we have proved conclusively that the CFT algorithm can be implemented on a parallel processing structure and that the processing speed can be increased linearly proportional to the number of processors dedicated to the task.  It is most gratifying to observe that the CFT system takes analog signals and generates the Fourier spectrum in real time, without degradation while running unattended continuously .

 

The architecture described here can be expanded to include up to 24 RTX2000 Slave Processors, using only one RTX2000 as the Master.  This structure will be able to produce one frequency component in every machine cycle of the Master Processor.  The sampling rate will be 12 kHz using 6 MHz RTX's, or 20 kHz using 10 MHz RTX's.  To use more slave processors, we will have to use a faster master processor or use specialized control circuits to generate the required timing signals to all the slave processors.

 

RTX2000 is very fast as a general purpose CPU and controller.  However, it is still far from being optimized for the task of CFT.  In the 24 cycle inner loop most of the cycles are fetching data from the memory and storing results back to memory.  The ghost of the von Neumann bottle neck follows us wherever we go.  The only way to get rid of the bottle neck is to build specialized CFT slave processors with integrated multiplier/accumulator and table memory in ROM.  Then the ultimate speed limiting factor is how fast can we route the transformed spectrum out to the display device and to the system which will ultimately make use of the spectrum.

 

 

References

 

 

1.   C. H. Ting and Rick VanNorman, Continuous Fourier Transform, Proceedings of the 1989 Rochester Forth Conference, p. 121, 1989.

 

2.  C. H. Ting, Fourier Transform Faster Than Fast Fourier Transform, Real Time Signal Processing, SPIE Proceedings, Vol. 241, p. 167, 1980.

 

 


Figure 1.   Architecture of the Parallel CFT Processor System.

 

 

 

 

 

 

 


Figure 2.   Schematics of the Master Processor.

 

 

 

 

 


Figure 3.   Schematics of the Slave Processor 1.

 

 

 

 

 


Figure 4.   Timing signals generated by the Master Processor.