PySpark in Microsoft Fabric: 100 Quick Revision Questions and Answers

5/5 - (1 vote)

1. What is PySpark? – PySpark is the Python API for Apache Spark used for distributed data processing.

2. What is Microsoft Fabric? – A unified analytics platform integrating data engineering, data science, and Power BI.

3. What is the role of Spark in Fabric? – Spark provides distributed data processing within Fabric notebooks.

4. How to view Spark session in Fabric? – Use ‘spark’ variable; it’s auto-created in Fabric notebooks.

5. How to read a CSV file? – df = spark.read.csv(‘path’, header=True, inferSchema=True)

6. How to read a Parquet file? – df = spark.read.parquet(‘path’)

7. How to show top rows? – df.show(5)

8. How to print schema? – df.printSchema()

9. How to count total rows? – df.count()

10. How to display columns? – df.columns

11. How to describe data? – df.describe().show()

12. How to select columns? – df.select(‘col1’, ‘col2’)

13. How to filter rows? – df.filter(df.age > 30)

14. How to use SQL in PySpark? – df.createOrReplaceTempView(‘table’); spark.sql(‘SELECT * FROM table’)

15. What is a DataFrame? – A distributed collection of data organized into columns.

16. What is an RDD? – Resilient Distributed Dataset; low-level distributed data structure in Spark.

17. How to convert DataFrame to RDD? – df.rdd

18. How to add a new column? – df.withColumn(‘newCol’, df.col1 + 10)

19. How to drop a column? – df.drop(‘colName’)

20. How to rename a column? – df.withColumnRenamed(‘old’, ‘new’)

21. How to get distinct values? – df.select(‘col’).distinct()

22. How to sort data? – df.sort(‘col’) or df.orderBy(‘col’)

23. How to group and aggregate? – df.groupBy(‘col’).agg({‘sales’:’sum’})

24. What is a transformation? – Operation that creates a new DataFrame (lazy).

25. What is an action? – Operation that triggers computation (e.g., show, collect).

26. How to cache a DataFrame? – df.cache()

27. How to unpersist a DataFrame? – df.unpersist()

28. How to join two DataFrames? – df1.join(df2, ‘key’, ‘inner’)

29. Types of joins? – inner, left, right, full, cross.

30. How to save DataFrame as Parquet? – df.write.parquet(‘path’)

31. How to overwrite existing data? – df.write.mode(‘overwrite’).csv(‘path’)

32. How to append data? – df.write.mode(‘append’).csv(‘path’)

33. What is a Lakehouse in Fabric? – A data storage combining data lake and warehouse features.

34. How to write to Lakehouse table? – df.write.format(‘delta’).saveAsTable(‘lakehouse.table’)

35. How to read a Delta table? – spark.read.format(‘delta’).load(‘path’)

36. What is a Delta table? – A table format supporting ACID transactions in Fabric.

37. How to drop duplicates? – df.dropDuplicates()

38. How to replace null values? – df.fillna({‘col’:0})

39. How to check for nulls? – df.filter(df.col.isNull())

40. How to union DataFrames? – df1.union(df2)

41. How to sample data? – df.sample(0.1)

42. How to get unique count? – df.select(countDistinct(‘col’))

43. How to use when/otherwise? – df.withColumn(‘new’, when(df.col>0,1).otherwise(0))

44. How to repartition data? – df.repartition(4)

45. How to coalesce partitions? – df.coalesce(1)

46. What is lazy evaluation? – Spark waits until an action is called to execute transformations.

47. How to get DataFrame schema? – df.schema

48. What is broadcast join? – Joining small DataFrame using broadcast(df_small).

49. What is the default file format in Spark? – Parquet.

50. How to register a temp view? – df.createOrReplaceTempView(‘view’)

51. How to execute SQL query? – spark.sql(‘SELECT * FROM view’)

52. How to measure execution time? – Use %time or Python’s time module.

53. How to read JSON? – spark.read.json(‘path’)

54. How to save JSON? – df.write.json(‘path’)

55. How to check Spark version? – spark.version

56. How to show Spark config? – spark.sparkContext.getConf().getAll()

57. What is a Fabric notebook? – An interactive environment in Fabric for writing Spark, SQL, and Python code.

58. What is autoscale in Fabric Spark? – Automatically adjusts resources based on workload.

59. How to optimize joins? – Use broadcast joins for small tables.

60. How to improve performance? – Cache reused data, use partitioning, avoid shuffles.

61. How to create a Delta table in Fabric? – spark.sql(‘CREATE TABLE tbl USING delta LOCATION path’)

62. How to show tables? – spark.sql(‘SHOW TABLES’)

63. How to delete table? – spark.sql(‘DROP TABLE table_name’)

64. What is schema inference? – Spark automatically detects column types from data.

65. What is a partition? – Logical division of data for parallel processing.

66. What is a shuffle? – Data redistribution between partitions during wide transformations.

67. What is collect()? – Returns all rows as a list to the driver.

68. What is take()? – Returns first n rows to the driver.

69. How to convert DataFrame to Pandas? – df.toPandas()

70. What is show() used for? – Displays top rows of a DataFrame.

71. How to read Excel in Fabric PySpark? – Use pandas.read_excel() in a Spark notebook cell.

72. How to stop Spark session? – spark.stop()

73. What is the advantage of Delta format? – Supports ACID, versioning, and time travel.

74. What is display() in Fabric? – Renders DataFrame output in notebook cell with visualization options.

75. How to check number of partitions? – df.rdd.getNumPartitions()

76. How to repartition by column? – df.repartition(‘col’)

77. What is collect_set()? – Returns unique values as array per group.

78. What is a window function? – A function that performs operations across a window of rows.

79. How to use rank()? – from pyspark.sql.window import Window; rank().over(Window.partitionBy(‘col’).orderBy(‘val’))

80. What is persist()? – Caches DataFrame in memory or disk for reuse.

81. How to check data types? – df.dtypes

82. What is a checkpoint? – Saves RDD lineage to break dependency chain.

83. How to drop rows with nulls? – df.na.drop()

84. How to replace nulls with value? – df.na.fill(‘N/A’)

85. What is explode()? – Converts array or map column into multiple rows.

86. How to convert string to date? – to_date(df.col, ‘yyyy-MM-dd’)

87. How to write to SQL table? – df.write.jdbc(url, table, mode, properties)

88. How to read from SQL table? – spark.read.jdbc(url, table, properties)

89. How to rename multiple columns? – Use reduce or loop with withColumnRenamed.

90. How to get summary statistics? – df.summary().show()

91. What is collect_list()? – Aggregates values into a list per group.

92. What is countDistinct()? – Returns count of unique values.

93. What is broadcast()? – A method to broadcast a variable to all worker nodes.

94. What is df.limit()? – Limits rows in output.

95. How to cache table in SQL? – spark.sql(‘CACHE TABLE table_name’)

96. What is MLlib? – Spark’s machine learning library.

97. How to install packages in Fabric notebook? – Use %pip install package_name.

98. How to run Python code in Fabric? – Use Python cell in notebook.

99. How to schedule a notebook in Fabric? – Use Fabric Data Pipeline or Job scheduling feature.

100. How to monitor Spark jobs in Fabric? – Check Fabric Monitoring hub under Spark Job runs.

Thanks for reading this post! I hope you found it helpful. Feel free to share it with others or your teammates so they can benefit from it too.

PySpark in Microsoft Fabric: 100 Quick Revision Questions and Answers

Share this:

Leave a ReplyCancel reply

Discover more from Power BI Docs